It’s about the people and linkage

General, Netflix October 13th, 2006

The last few nights I’ve been working on and off on the training_set. It’s tempting to think up elaborate algorithms and compete them against each other, but it’s not practical yet for me to do that. I still need to get a better idea of what data I have to know how to apply it, and that started the daunting task of re-indexing the data.

You see, Netflix gave us 17700 files with ratings (one file per movie.) The number of ratings varies widely- one movie only has 10 ratings (Hockey Mom,) the other movie has almost 233k (Miss Congeniality.) Also, there are 480k customers in this dataset, and 100M ratings. Indexing that information by movie makes sense; it’s the smallest number.

But the interesting stuff happens when you start looking at people, and what people do. I tried a few times to break individual customer ratings out to files, but it just took too long. The scope of the data is a little mindboggling. Going through 17700 files to break out 480000 files was too much to do on my simple computer. So before I think about ways I want to store that data, I need to find out what I’m looking at.

Each person can only rate one movie once, so at most I’m looking at 17700 * 480k ratings. That’s 8.4B ratings; this dataset has 100M. On average, that’s 203 ratings a member. But I doubt that average rings true on the whole dataset. Someone on the boards mentioned there’s a customer who has ranked all 17700 movies. That would pull the average up a bit (but then again, over 480k users, maybe not.)

So, files mostly in place, I think I need to find out what I’m looking at before I go to the next step. My next algorithm depends on associating user ratings with some notion of commonality — do 80% of the users only have 20 movies ranked? That would effect my algorithm, so I need to do the legwork first.

I also got Junior hooked on this, so it should be fun bouncing ideas off each other. (Remember, $100k club!) Also, thanks to Mike for linking in. I’ve setup a Netflix category so these posts can be easier tracked.

Netflix prize random whole numbers

General, Netflix October 11th, 2006

Discussing last night’s efforts with Andrew on the way in this morning, we were hashing out why the random algorithms always generated the same RMS values, and the fact that predictions can have decimal values while recommendations are whole integers. I realized I never tested random whole numbers against the recommendations, so I did a quick test. For whole values 1 to 5 on the probe set, I got a RMSE of 1.929. That’s better than my 0.00-5.00 attempt (2.176) but not as good as my 1.00-5.00 attempt (1.748). And that’s about what I expected.

I might need to do some poking around ACM for papers on predictive algorithms. Would Bayes help me here?

Other interesting data I have to play with is the date the recommendations were made, the title and year of the movie, and the general overall numbers for how many recommendations were made for a certain movie. I’m sure I can pull something probative out of that.

Netflix Prize night one

General, Netflix October 11th, 2006

I broke down and registered for the Netflix Prize so I could see their datasets. Here’s the results from day one of testing. If you get bored by posts where I geek out, you best browse away now. Also, if you’re thinking about tinkering with the Netflix Prize stuff, and don’t want any discovery/spoilers, you should probably skip this too. (Not that I would tell you my winning solution, but this is the start of me fleshing out a search for an algorithm. *cue the Indiana Jones music*)

The dataset is kinda fun. It’s 650MB compressed, and just over 2GB decompressed. There are 17,700 movies with anonomyzed rating data associated with it. Some movies have just a few dozen ratings, others have tens of thousands. The file formats are all text. Ratings go from 1-5 on an integer scale, (but predictions can be floats.)

The general premise of the contest is to predict ratings for movies. They “test” this against what people actually rated the movie. The difference between your prediction and the actual rating is evaluated. The root mean square of these errors (RMSE) is the metric. Their current system has a RMSE of .9514. To win a “progress prize,” an algorithm has to get down to .9419. To win the $1M, you have to do better than .8563.

They only allow one submission a week, so they give you a small subset of data (probe set) similar to what they eval you on (1.4M ratings to predict, the qualifying set is 2.8M) that you can compare against the training set to act as solutions for self-evaluation.

My first task was getting that probe set collected. A small perl script took the data out of the training set file and created one 1.4M line text file with the movie/user/rating information together on each line.

Then, it was time to test an algorithm. We talked about randomness at wings tonight, so my first stab (as a base case) was to randomly select a 0.00-5.00 value for each entry in the probe set and calculate the RMSE from there. It was 2.17647. That’s not good, but it gives lots of room to improve. I realized that users cannot select below one, so I really want rand()*4+1 to give 1.00-5.00. That yielded 1.748. Progress! The odd thing is, a 1.4M sample set negated any benefit from the randomness in the algorithm. Each time I ran it, I got almost exactly the same RMS value. I ran it several times. I guess I need to learn more about RMS, as it’s telling me that on average any random value I can come up with is going to be 1.748 away from my 1.00-5.00 target. That’s interesting; I expected more swing because of the randomness, but I guess not with 1.4M samples.

My next algoritm reduced the error distance in the RMS equation — I predicted everyone would rate every movie a 3. It’s in the middle between 1 and 5, so it seemed logical. My program generated a delta between each real rating and the number three, and my RMSE shrunk to 1.31335.

The next algorithm (you didn’t expect me to give up there, did you?) involved averages. Given the 100M or so ratings in the training set, what’s the average movie rating? Obviously, that would be closer than the median for the training set, so it should see some improvement. A small script to mine the movie averages, and then assess a global average over all movies in the training set, told me the global average was 3.61. Sounds reasonable. That value as a constant prediction yielded a RMSE of 1.12923; my best so far.

But since I have the per-movie averages, I’ll do better to recommend the individual average for that movie instead of the global average. This is significant because it’s the first time one of my algorithms looks at the movie data to determine a recommendation. It’s probably the simplest one I can do, but it’s a start. Those averages ranged from 1.29 to 4.72, and yielded a RMSE of 1.05195. If I can drop another .10 in a future iteration, I’ll not only be as good as Cinematch, I’ll also be on the Leaderboard.

I have ideas on how to improve upon these numbers, both entirely from data mined within the training set and from outside information, but from a little bit of work tonight, at least I have a starting point. And a new project. If you’re interested in helping me out, let me know. I’ll work on my homework tomorrow. ;)

links for 2006-10-11

Del.icio.us links October 10th, 2006

links for 2006-10-07

Del.icio.us links October 6th, 2006

2006-07 Illini men’s basketball schedule iCal

General October 6th, 2006

My script that takes the data off fightingillini.com and massages the men’s basketball schedule into something I can easily import into Outlook worked again this year, and surprisingly without any functional changes to the script. Using a vCal/iCal file is so much easier than entering all those games by hand. The one thing I wish I could set in an iCal file was the color/label Outlook applies. I have to go through after the import and manually set all the home games orange and the away games blue.

In a truly dynamic iCal environment, I could point my calendar program to a ‘calendar feed’ and these things would automatically appear in my calendar. Outlook 2003 doesn’t have that capability, so you have to save the file locally, and goto File / Import and Export / Import an iCalendar or vCalendar file (.vcs) and point it at the file. Maybe next year I’ll have an even more automated procedure.

2006-07 Illini Men’s basketball schedule in iCal format (right click, save as)

Of note, check out the week of November 13th… There’s a home game on Monday, Wednesday, Friday, Sunday and the next Tuesday. Woot, Illini week! Sure they’re schools no one has ever heard of, but that’s not important. (Think confidence builder on a rebuilding year.) The only exciting pre-Big1T1en game is Maryland at home. Also, my calendars didn’t include the Big10 Tourney or the post season tournaments, so be sure to add those. Enjoy!

links for 2006-10-06

Del.icio.us links October 5th, 2006

links for 2006-10-04

Del.icio.us links October 3rd, 2006

links for 2006-10-03

Del.icio.us links October 2nd, 2006

Attention crazy people

General October 2nd, 2006

Please stop shooting up schools, students, and children. Thanks.

Love,
Kenny