Spam control
Science & Nature, Site/Blog, Work March 6th, 2005
I’m going to geek out here, so if you don’t like the data-mining nerd in me, move along now.
Campus is getting close to announcing their spam control solution for @uiuc.edu. DCS is going to mimic something similar to it for mail going to @cs.uiuc.edu. Chuck is upgrading our SpamAssassin installation this week so we get even better filtering and Baysian analysis. So, I figured I would generate some stats to have pre-upgrade to look back on later.
A CITES security brief email Friday afternoon told us their new anti-virus filtering on anything to @uiuc.edu deleted over 13,000 viruses from nearly 800k emails in the 24 hours before that message. This is excellent, and I’m glad to see campus is making progress towards keeping virii out of our mailboxes. (Forget about the few people in the CS department who actually enjoy getting viruses because they analyze them. For everyone else, it’s just mailbox clutter and disk space waste.)
I took my current SpamAssassin filtered spam mailbox and crunched some numbers on it. I flag anything with a SA value over 4 to go to this box. The first messages in this box appeared to be from January 1st, 2005 (the last time I recycled the mailbox.) The first number is the SA spam level and the second is the count. The rest of the numbers look at how that SA spam level statistically fits in with the rest of the values.
Total spam: 2916 2 1 (0.03%) so far 0.03% left 100% 3 1 (0.03%) so far 0.07% left 100% 4 125 (4.29%) so far 4.36% left 96% 5 200 (6.86%) so far 11.21% left 89% 6 204 (7.00%) so far 18.21% left 82% 7 256 (8.78%) so far 26.99% left 73% 8 317 (10.87%) so far 37.86% left 62% 9 251 (8.61%) so far 46.47% left 54% 10 261 (8.95%) so far 55.42% left 45% 11 237 (8.13%) so far 63.55% left 36% 12 191 (6.55%) so far 70.10% left 30% 13 172 (5.90%) so far 75.99% left 24% 14 154 (5.28%) so far 81.28% left 19% 15 103 (3.53%) so far 84.81% left 15% 16 93 (3.19%) so far 88.00% left 12% 17 70 (2.40%) so far 90.40% left 10% 18 72 (2.47%) so far 92.87% left 7% 19 50 (1.71%) so far 94.58% left 5% 20 24 (0.82%) so far 95.40% left 5% 21 33 (1.13%) so far 96.54% left 3% 22 11 (0.38%) so far 96.91% left 3% 23 14 (0.48%) so far 97.39% left 3% 24 22 (0.75%) so far 98.15% left 2% 25 11 (0.38%) so far 98.53% left 1% 26 11 (0.38%) so far 98.90% left 1% 27 6 (0.21%) so far 99.11% left 1% 28 10 (0.34%) so far 99.45% left 1% 29 2 (0.07%) so far 99.52% left 0% 30 2 (0.07%) so far 99.59% left 0% 31 1 (0.03%) so far 99.62% left 0% 32 2 (0.07%) so far 99.69% left 0% 50 9 (0.31%) so far 100.00% left 0%
I’m thinking of configuring procmail to tell all the spam with levels over 12 to just go to the bit bucket. That would eliminate 25% of my spam. Do any of you do level filtering on SA tagged messages? If so, what are your thresh-holds? If you sort the messages into different mailboxes, what’s your thinking for that, and how often do you review them?
I also tried to look at which email addresses that eventually get delivered to my box (all of the *.uiuc.edu ones, and several service aliases in cs.uiuc.edu) get the most spam. For example, could I block anything to ews.uiuc.edu because I don’t use that address anymore? But, I found that to be a harder analysis and I haven’t finished it yet. If you’re interested in running my script on your email, let me know. It scans a mailbox in mbox format and prints out the data.
About