Implementing a Rating-Based Item-to-Item Recommender System in PHP/SQL

Following some requests I got about the paper Slope One Predictors for Online Rating-Based Collaborative Filtering, I decided to make available a technical report which actually gives some SQL and PHP code: Implementing a Rating-Based Item-to-Item Recommender System in PHP/SQL.

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

8 thoughts on “Implementing a Rating-Based Item-to-Item Recommender System in PHP/SQL”

  1. Hi Daniel,

    I have implemented your algorithms and created a dev table using the MovieLens 100k database. Is there any way to return similar items to item 1 rather than just items that are rated higher than it?

    If I use say:

    SELECT dev.itemid1 , dev.sum, dev.count, dev.sum / dev.count AS average, movies.movietitle
    FROM dev, movies
    WHERE dev.count > 5 AND dev.itemid2 = 98 AND dev.itemid1 = movies.movieid
    ORDER BY average DESC;

    (item 98 is One Flew Over the Cuckoo’s Nest, query takes 0.0388s) I get 3 Wallace and Grommit movies in the top 5. I presume this is because people who liked Wallace and Grommit did not like One Flew Over the Cuckoo’s Nest.

    Also, you mention in your paper that you can initialize your dev table with 0 default values for all entries and then use the predict_all function. I use the following query for that:

    SELECT dev.itemID1, movies.movietitle, sum(dev.sum + dev.count*rating.ratingvalue) / sum(dev.count) as avgrat
    FROM dev, rating, movies
    WHERE rating.userid = 1
    AND dev.count > 30
    AND dev.itemID1 < > rating.itemid
    AND dev.itemID2 = rating.itemid
    AND dev.itemID1 = movies.movieid
    GROUP BY dev.itemID1, movies.movietitle
    ORDER BY avgrat DESC
    LIMIT 100;

    (that query takes about 5 seconds with a MyISAM table, almost 200 seconds with an InnoDB table! (there are 1,967,832 rows in the dev table)). But I can’t see any reason to pre-populate with 0 values.

    I initially tried this with PostgreSQL but it was way too slow (even with triggers and functions.) I am currently writing stored procedures for MySQL 5 and hopefully they will make things even faster. What do you use on your site?

    Thanks for all the work you’ve put into this!

    Chris

  2. May I ask what you are working on Chris?

    I have implemented your algorithms and created a dev table using the MovieLens 100k database. Is there any way to return similar items to item 1 rather than just items that are rated higher than it?

    Actually, what I suggest people do is to return items that people liked even more than the current item. You could cook up a similarity measure, but most often, isn’t your goal to help people find items they will like rather than items similar to what they’ve found already? Of course, you could argue that if they’ve found a musical and they like musicals, we should suggest other musicals, but slope one is only focused on *rating-based* collaborative filtering.

    Obviously, you could throw in a correlation measure between two items in the database. Myself, I believe you should rather send people see items they are likely to like at least as much as the current item.


    If I use say:

    SELECT dev.itemid1 , dev.sum, dev.count, dev.sum / dev.count AS average, movies.movietitle

    FROM dev, movies

    WHERE dev.count > 5 AND dev.itemid2 = 98 AND dev.itemid1 = movies.movieid

    ORDER BY average DESC;

    (item 98 is One Flew Over the Cuckoo’s Nest, query takes 0.0388s) I get 3 Wallace and Grommit movies in the top 5. I presume this is because people who liked Wallace and Grommit did not like One Flew Over the Cuckoo’s Nest.

    First of all, in collaborative filtering, when someone rates an item, they, most often, like it. Rating something is a positive reaction. If you hate “Flew Over the Cuckoo”, you won’t bother to rate it, most of the time. So, what your result means is that people who rated “Flew Over the Cuckoo’s Nest” preferred “Wallace and Grommit”. In other words, people who cared enough about “Flew Over the Cuckoo’s Nest” to rate it, actually preferred “Wallace and Grommit”.

    If you think “oh, but I don’t want movies of different genres to be recommended this way”, then the best bet is to throw in some content-based techniques. For example, when the user is browsing a drama, only recommend drama. You can achieve this result by having one dev table per genre.

    Also, you mention in your paper that you can initialize your dev table with 0 default values for all entries and then use the predict_all function. I use the following query for that:

    SELECT dev.itemID1, movies.movietitle, sum(dev.sum + dev.count*rating.ratingvalue) / sum(dev.count) as avgrat

    FROM dev, rating, movies

    WHERE rating.userid = 1

    AND dev.count > 30

    AND dev.itemID1 < > rating.itemid

    AND dev.itemID2 = rating.itemid

    AND dev.itemID1 = movies.movieid

    GROUP BY dev.itemID1, movies.movietitle

    ORDER BY avgrat DESC

    LIMIT 100;

    (that query takes about 5 seconds with a MyISAM table, almost 200 seconds with an InnoDB table! (there are 1,967,832 rows in the dev table)). But I can’t see any reason to pre-populate with 0 values.

    Now that you ask, I cannot see why on Earth one would need to populate the table with zeroes. This is very puzzling to me as I always assumed you needed to have a dense matrix for this particular trick to work, but I don’t see why it is needed now. I will add a note in the tech. report to reflect this realization.

    I initially tried this with PostgreSQL but it was way too slow (even with triggers and functions.) I am currently writing stored procedures for MySQL 5 and hopefully they will make things even faster. What do you use on your site?

    inDiscover.net uses MySQL 4.x I believe.

  3. Thanks for the reply, Daniel. I am currently working on a top-secret project 😉 Details will be revealed soon…

    I think the problem of W&G being recommended when you look at One Flew Over… could also be due to there being such a few number of moview in the ML database. If there were say 100,000 movies, members’ ratings would be better “grouped” into similar items because then people would only rate the movies they liked, as you said, by searching for them rather than being presented a list to rate. But 100,000^2 makes a pretty big dev table!

    I have a few more suggestions for your paper that I will post in a few days. Still trying to get the damn stored procedures working in MySQL! Their cursor implementation seems very strange to me.

  4. This is just something I though about over the weekend regarding similar items. If there was a “genre” column it would be even better (“and r.genre = r2.genre”):

    select count(r.itemid), r.itemid, m.movietitle
    from rating r, rating r2, movies m
    where r.userid = r2.userid
    and r2.itemid = 168
    and r.itemid 168
    and r2.ratingvalue = 5
    and r.ratingvalue = 5
    and r.itemid = m.movieid
    group by r.itemid
    order by count(r.itemid) desc

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax