It’s about the people and linkage
General, Netflix October 13th, 2006
The last few nights I’ve been working on and off on the training_set. It’s tempting to think up elaborate algorithms and compete them against each other, but it’s not practical yet for me to do that. I still need to get a better idea of what data I have to know how to apply it, and that started the daunting task of re-indexing the data.
You see, Netflix gave us 17700 files with ratings (one file per movie.) The number of ratings varies widely- one movie only has 10 ratings (Hockey Mom,) the other movie has almost 233k (Miss Congeniality.) Also, there are 480k customers in this dataset, and 100M ratings. Indexing that information by movie makes sense; it’s the smallest number.
But the interesting stuff happens when you start looking at people, and what people do. I tried a few times to break individual customer ratings out to files, but it just took too long. The scope of the data is a little mindboggling. Going through 17700 files to break out 480000 files was too much to do on my simple computer. So before I think about ways I want to store that data, I need to find out what I’m looking at.
Each person can only rate one movie once, so at most I’m looking at 17700 * 480k ratings. That’s 8.4B ratings; this dataset has 100M. On average, that’s 203 ratings a member. But I doubt that average rings true on the whole dataset. Someone on the boards mentioned there’s a customer who has ranked all 17700 movies. That would pull the average up a bit (but then again, over 480k users, maybe not.)
So, files mostly in place, I think I need to find out what I’m looking at before I go to the next step. My next algorithm depends on associating user ratings with some notion of commonality — do 80% of the users only have 20 movies ranked? That would effect my algorithm, so I need to do the legwork first.
I also got Junior hooked on this, so it should be fun bouncing ideas off each other. (Remember, $100k club!) Also, thanks to Mike for linking in. I’ve setup a Netflix category so these posts can be easier tracked.
About