Netflix rental statistics

Entertainment, Technology March 6th, 2005

More data mining….

Last week, I discovered Netflix can email you a history of all the movies you’ve rented, and when they were checked out and checked in. I thought, “Neat, I can use this to tell how many movies I’ve rented and what my average rental time for a movie is.” So I started writing a parser. Tonight, I’ve been working on adapting that parser to the web so other people can run stats on their Netflix histories.

You can see an example of what it does here. The sample is just a small subset of my history. I’ve run it on the full thing, and I know my average turnaround time is about 27 days and each movie over my entire history with Netflix has cost me just under $7. I know the shortest turnaround time from Champaign to St. Louis, even if I watch them as soon as I get them and put them in the mail the next day is 7 days. Pathetic, eh?

What I hope to accomplish by this parser is a better understanding of how I use Netflix, and hopefully increase my usage of it to make it more economical (the more movies I rent per month, the more value I get out of it.) I guess another fun thing to do would be to compare my rental ‘out’ times to what it would cost me to do the exact same thing with Blockbuster and see if Netflix, Blockbuster online, or Blockbuster traditional rentals (with or without late fees,) would be a better deal for me.

If you are a Netflix subscriber and want to help me test it, or think it should have new stats/features, please let me know. The only “bug” I’m aware of at this time is it assumes a 3-out plan. I should fix that to either figure out how many movies you have out by the history, or prompt the user. I’ll probably clean it up a bit and lure some of the Netflix bloggers to it and see what they think.

Geeky, but cool? Fun, and useful? I can live with that.

New pee-sea

Education & Development, Science & Nature March 6th, 2005

I used to get a new computer once a year. That was back when I was a self-employed and I needed the tax write-offs (and the newer technology was fun.) But that stopped in 1998, and I haven’t bought a new computer for my home use since. I used that 1998 computer up until the move this last summer; something with the motherboard/cpu didn’t POST right when I tried to bring it up in my new house. (I hope that it’s only a mobo issue, and my drives are still working okay. I don’t have good backups, and that’s one of the first things I want to address.) So, I fixed that today and bought a new PC.

Okay, technically, I bought the pieces to a new PC. New motherboard, CPU, memory and case (since all of those were required,) and a DVD burner and new keyboard/mouse. I could have bought a whole new cheap machine from Dell, but I got quality pieces (read: not Celeron/Sempron and the mobo has SATA.) I’ll upgrade the hard drive and video card later – what I bought was just enough to get me back up and running at home. I used the ARS Technica buying guides and basically spec’d the pieces from the Budget Box. Plus there’s something cool and fun about building a computer from pieces. (Although when I’m swearing because stuff isn’t playing nice together or I didn’t order the right stuff, you can laugh at me.)

The stuff should be here in a few days, and I’m sure I’ll post more when I get it up and running. I went cheap this run because I ultimately want to get a laptop for home use (daaaaamn you Kresl) but that’s going to wait until grad school.

Let’s get it started…

Quotes, Sports & Leisure March 6th, 2005

I’ll let the sports websites recap the Illini’s amazing regular season, which ended today. We were responsible for more records broken and streaks shattered than I could have dreamed possible back in November. I knew we were good, but we’re damn good, and it’s been a lot of fun being an Illini fan this year.

Twenty-nine and one is respectful. The loss today, our first in a long, long time is a little sad considering we could have gone perfect. But the loss won’t hurt us. We should still be #1 ranked tomorrow (although I doubt with all the votes,) and we still won the Big Ten outright and will have the #1 seed in the BigTen tourney this coming weekend. We will still be a #1 seed on Selection Sunday (just seven days away!) and we’ll roll all the way through Indy, Chicago, and St. Louis. I leave you with Coach Weber:

“Being undefeated was never one of our goals, it just kind of snuck in there,” said Weber. “We’ll learn from this and move on. The next stretch is the most important of the year. That’s what people are going to remember.

“The way I look at it, you’re going to lose sometime. Better to lose now than three weeks from now.”

Spam control

Science & Nature, Site/Blog, Work March 6th, 2005

I’m going to geek out here, so if you don’t like the data-mining nerd in me, move along now.

Campus is getting close to announcing their spam control solution for @uiuc.edu. DCS is going to mimic something similar to it for mail going to @cs.uiuc.edu. Chuck is upgrading our SpamAssassin installation this week so we get even better filtering and Baysian analysis. So, I figured I would generate some stats to have pre-upgrade to look back on later.

A CITES security brief email Friday afternoon told us their new anti-virus filtering on anything to @uiuc.edu deleted over 13,000 viruses from nearly 800k emails in the 24 hours before that message. This is excellent, and I’m glad to see campus is making progress towards keeping virii out of our mailboxes. (Forget about the few people in the CS department who actually enjoy getting viruses because they analyze them. For everyone else, it’s just mailbox clutter and disk space waste.)

I took my current SpamAssassin filtered spam mailbox and crunched some numbers on it. I flag anything with a SA value over 4 to go to this box. The first messages in this box appeared to be from January 1st, 2005 (the last time I recycled the mailbox.) The first number is the SA spam level and the second is the count. The rest of the numbers look at how that SA spam level statistically fits in with the rest of the values.

Total spam: 2916
2       1       (0.03%)         so far 0.03%    left 100%
3       1       (0.03%)         so far 0.07%    left 100%
4       125     (4.29%)         so far 4.36%    left 96%
5       200     (6.86%)         so far 11.21%   left 89%
6       204     (7.00%)         so far 18.21%   left 82%
7       256     (8.78%)         so far 26.99%   left 73%
8       317     (10.87%)        so far 37.86%   left 62%
9       251     (8.61%)         so far 46.47%   left 54%
10      261     (8.95%)         so far 55.42%   left 45%
11      237     (8.13%)         so far 63.55%   left 36%
12      191     (6.55%)         so far 70.10%   left 30%
13      172     (5.90%)         so far 75.99%   left 24%
14      154     (5.28%)         so far 81.28%   left 19%
15      103     (3.53%)         so far 84.81%   left 15%
16      93      (3.19%)         so far 88.00%   left 12%
17      70      (2.40%)         so far 90.40%   left 10%
18      72      (2.47%)         so far 92.87%   left  7%
19      50      (1.71%)         so far 94.58%   left  5%
20      24      (0.82%)         so far 95.40%   left  5%
21      33      (1.13%)         so far 96.54%   left  3%
22      11      (0.38%)         so far 96.91%   left  3%
23      14      (0.48%)         so far 97.39%   left  3%
24      22      (0.75%)         so far 98.15%   left  2%
25      11      (0.38%)         so far 98.53%   left  1%
26      11      (0.38%)         so far 98.90%   left  1%
27      6       (0.21%)         so far 99.11%   left  1%
28      10      (0.34%)         so far 99.45%   left  1%
29      2       (0.07%)         so far 99.52%   left  0%
30      2       (0.07%)         so far 99.59%   left  0%
31      1       (0.03%)         so far 99.62%   left  0%
32      2       (0.07%)         so far 99.69%   left  0%
50      9       (0.31%)         so far 100.00%  left  0%

I’m thinking of configuring procmail to tell all the spam with levels over 12 to just go to the bit bucket. That would eliminate 25% of my spam. Do any of you do level filtering on SA tagged messages? If so, what are your thresh-holds? If you sort the messages into different mailboxes, what’s your thinking for that, and how often do you review them?

I also tried to look at which email addresses that eventually get delivered to my box (all of the *.uiuc.edu ones, and several service aliases in cs.uiuc.edu) get the most spam. For example, could I block anything to ews.uiuc.edu because I don’t use that address anymore? But, I found that to be a harder analysis and I haven’t finished it yet. If you’re interested in running my script on your email, let me know. It scans a mailbox in mbox format and prints out the data.