Collaborative Filtering: Why working on static data sets is not enough

As a scientist, it is important to question your assumptions. So far, most of the hard Computer Science research on collaborative filtering has used static data sets such as Netflix. Specifically, it is assumed that the recommender systems do not impact the ratings and what items get rated. A related assumption is that polls do not change how people vote (thanks to Peter for this observation).

Yet, people’s preferences are often constructed in the process of elicitation. That is, collaborative filtering is a nonlinear problem: ratings feed into the recommender system which helps to determine what people rate, which, in turn, feeds back into the recommender system…

How could a researcher take this into account? It would be too expensive to try to simulate e-commerce sites with volunteers. We need to submit simulated users to a recommender system. The usefulness of the recommendations is a tricky thing to measure and cross-validation errors are probably not what you want to study exclusively, diversity might be an important factor too.

Note 1: If someone out there know how to simulate users (something I do not know how to do), please get in touch! I have no idea how to do sane user modelling and I need help!

Note 2: Peter also once pointed me to the Iterated Prisoner’s Dilemma problem as something related.

6 thoughts on “Collaborative Filtering: Why working on static data sets is not enough”

  1. Hi Daniel,

    This related work comes to mind:

    Intelligent Information Access Publications

    – A Learning Agent that Assists the Browsing of Software Libraries
    – A Learning Apprentice For Browsing
    – Accelerating Browsing by Automatically Inferring a User’s Search Goal

  2. I agree that this is a very important complication in the evaluation of collaborative filtering.
    To sharpen the point, I think that there are two separate issues:
    The fact that the interactive recommender system influences the users’ behaviors, which, in turn, feedback into the CF system, and so in a loop. In other words, the CF mechanism is a active part of the system that it is supposed to learn and judge.

    All the feedback to the collaborative filtering is conditioned on the fact that the users actually performed an action. All our observations on a product are based on the very narrow and unrepresentative sub-population that chose to reflect their opinion (implicitly or explicitly) on that product. Naturally, such a population is highly biased to like the product. For example, when we say that “the average rating for The Six Sense movie is 4.5 stars” we really mean to say: “the average rating for The Six Sense movie AMONG PEOPLE THAT CHOSE TO RATE THAT MOVIE is 4.5 stars”. Now what is really the average rating for The Six Sense across all population? Well, that’s hard to know. But the whole population is the one that really counts…

    I used to be much more concerned about the second issue…


  3. [Caveat: I’m a programmer, not a researcher]

    The production recommendation systems that i’ve had experience with attempt to avoid self-reinforcing behavior by introducing a degree of randomness. In other words, you determine recommendations based on the user’s rating profile, but then you augment that with some percentage of more remotely related items and possibly even a small percentage of unrelated items. I wish i could provide evidence that this helps, but it’s mostly a hack.

    There are a couple of other biases in rating data though, at least in the area that i’m familiar with (music). One is the “selection bias”, or the fact that people don’t rate everything that’s presented to them but rather only things they love or hate. The other is that peoples’ rating behavior can differ substantially from their actual listening behavior (probably more when their rating profile is public).

    It might be possible to model users in the sense of reproducing the distribution of ratings in a dataset like NetFlix’s. But i think the bigger challenge for recommendation technology right now is to capture the things we aren’t getting from users, like how to correlate mood to preferences, or how to distinguish true favorites from temporary enthusiasms.

  4. I’ve always figured sites powering recommendation systems would need to perform some sort of experimentation on their users to control for the effect of recommendations. This could include selectively omitting recommendations (perhaps altogether for certain items and/or users) to establish control groups.

    Regarding Note 1, I think a simulation of human behaviour adequate to explore the consequences of ratings on human behaviour would require already knowing the answer, so that’s a circular and prohibitive way of going about things.

Leave a Reply

Your email address will not be published. Required fields are marked *