I have developed a style of dealing with spam over the years based on a couple of unfortunate facts:
It may not be much, but I've been attila@stalphonsos.com for a pretty long time now, and snl@cluefactory.com for nearly as long. It is convenient and useful to me to keep my personal and work-related hacking activities sorted out that way, and I've spent a long time using those identities. I'm not going to start morphing my email addresses or changing them every so often just because of spam.
Alright, so much for the soap-box. The main problem for me is false positives: mail misclassified as spam. False negatives don't happen that often and are easily dealt with using the nifty spam command that I wrote as an extension to flail. It runs whatever messages you select (by default the current message) through bmf the right way. If you say:
flail> spam/rethen we run the current message through bmf -S (to tell it about a false negative). In practice I only get a few false negatives a week, mainly due to OpenBSD's built-in support for grey-listing combined with qmail's generally high-level of quality. Whatever is left is generally caught by bmf, and I hardly ever have to spam/re anything.
However the false positives are sitting there in a spam folder, waiting to be discovered. I have designed the entire process of fetching mail from my various mail spools to deal with the process of helping me find these false positives.
Once in a while a false positive is missed. It happens very rarely, but in my experience it is frequently painful and embarrassing when it does. It has on occasion been so painful and embarrassing, and disk is getting so cheap that I see no reason why I should ever throw spam away. It compresses nicely (you could probably derive a reliable spam-or-not-spam metric based on the compression ratio of various algorithms over the text).
Furthermore, there is a utilitarian angle to keeping all of your spam: sometimes you need to re-train your statistical spam filter. For instance, my laptop recently died, and I lost my ~/.bmf directory. However, I had backups of known-to-be-spam archives since forever, so I just did:
$ zcat /spam/archive/spam*.gz | bmf -sto get myself back to more or less the same state I was in when I lost stuff.
Incoming mail is vetted into two rough piles: spam and non-spam. Spam is kept in numbered spam folders, e.g. spam314, spam315, where each folder only stores some maximum number of messages and/or grows to some maximum size; non-spam goes into my inbox. The latest spam folders are periodically scanned for likely false positives, which I must then go and deal with individually to stop warnings about possibile false positives from irritiating me. Over time old spam folders are presumed to contain 100% spam and are compressed and archived, but not thrown away, just in case...
I fetch mail from my various accounts using fetchmail over one of my poor man's VPN setups. My config (suitably sanitized) looks something like:
poll cluefactory via localhost.cluefactory.com
with proto imap port 30143
authenticate password user "myusername" with pass "password" is "attila" here
no keep mda "/usr/bin/maildrop -d %T"
# ... for gmail, we do:
poll gmail via imap.gmail.com
with proto imap and options no dns
user "cluefactory@gmail.com" there with password "password" is "attila" here
keep mda "/usr/bin/maildrop -d %T" options ssl
In all cases fetchmail invokes maildrop on every message I get. Maildrop, in turn, has its own config file, misleadingly called ~/.mailfilter:
exception {
`bmf`
if ( $RETURNCODE == 0 )
to `latest_spam_folder`
}
This involves a script that comes with flail called latest_spam_folder. This script decides when the current numbered spam folder is too big. As long as e.g. spam123 is below 1.5MB (in the default setup), it continues to return spam123. When it becomes larger, it starts to respond with spam124. The effect of this maildrop config, therefore, is to throw anything that bmf claims is spam into the latest numbered spam folder. Everything else ends up in my inbox.
This deceptively simple scheme turns out to be extremely useful, and is also necessary given that flail's default storage format is the venerable old Unix chestnut mbox. Yes, that's right. Mbox. I know all about Maildir and agree it is swell, but flail was born of haste, is a child of the profane, and a dirty damn shame to look at to boot. Mbox sucks but I have lots of dirty, sweaty, guilty little tools that know how to deal with it.
Alright, so I'm old. Be that as it may, mbox files with more than, say, a couple hundred messages in them suck. A lot. It does help keep nice, neat little boxes around all the spam, too. I mean if you're going to keep it forever, you might as well put it in a box.
So, now we have to go fishing in those boxes after the fact and find likely false positives. That's what the spamfish.pl script is for, from the examples that come with flail. If you don't give spamfish any arguments, it will run latest_spam_folder and go fishing in it. If you give it folder names (or just numbers) as arguments it will fish in those instead. In either case, it does its fishing based on your own autofile settings, a sample of which come in autofile_config.pl. Spamfish loads autofile_config.pl and makes note of any message in the folders it examines which match any of those regexps. It spits out a note on each match, including the folder and message number.
This could be greatly improved. I should hack spamfish to check for addresses from your flail address book (or any other address book). It could do a great many other things, too, but just this much tends to catch everything important.
To give myself the most control over what is going on, I wrote a script that is intended to be run from cron. My crontab file looks like:
MAILTO=attila PATH=/home/attila/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin # m h dom mon dow command */15 * * * * mail_grinderThis arranges to run my mail_grinder script every 15 minutes. Note the MAILTO environment variable; under Linux, at any rate, cron will not send you email with output unless you set it. By doing so, I have arranged to get any output produced by mail_grinder delivered to my system mail spool. This is, coincidentally, where all of my other mail from fetchmail will arrive, too.
The mail_grinder script that comes with flail is still a bit lame, so here's my improved one until the next release:
#!/bin/sh
##
# Time-stamp: <2008-08-03 17:19:18 attila@stalphonsos.com>
##
BASE=${HOME-/home/attila}
WORK=$BASE/tmp
TEE=""
if [ x"$1" = "x-f" ]; then
TEE="tee"
echo $0: copying output to stdout
fi
##
export PATH=/home/attila/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/local/bin:/usr/local/sbin
MGPID=$WORK/mail_grinder.pid
FMLOG=$WORK/fetchmail.log
SFLOG=$WORK/spamfish.log
##
MGPROC=""
[ -f $MGPID ] && MGPROC=`cat $MGPID`
MGRUNNING=0
if [ "x$MGPROC" != "x" ]; then
echo $0: checking mail_grinder pid $MGPROC ...
(kill -0 $MGPROC >/dev/null 2>&1) && MGRUNNING=1
if [ $MGRUNNING = 0 ]; then
echo $0: removing stale $MGPID file pid $MGPROC
rm $MGPID
fi
fi
##
if [ $MGRUNNING = 1 ]; then
echo $0: already running, pid $MGPROC - skipped ...
ps wwu $MGPROC
else
echo $$ > $MGPID
if [ x"$TEE" != x ]; then
(fetchmail -v 2>&1) | $TEE $FMLOG
else
fetchmail -v > $FMLOG 2>&1
fi
#cat $FMLOG
if [ x"$TEE" != "x" ]; then
(spamfish -nodots -nmd -recent 2>&1) | $TEE $SFLOG
else
spamfish -nodots -nmd -recent >$SFLOG 2>&1
cat $SFLOG
fi
#grep -v MAILER $SFLOG
rm $MGPID
fi
exit 0
A couple niceties of this script:
This means I can always do
$ mail_grinder -ffrom the command line if I am urgently awaiting some piece of email to arrive, without fear that it will clash in any way with the version about to run from cron; if it is already running from cron then I will be informed of this fact.
This setup is able to deal with fairly large quantities of spam. If mail_grinder produces no output, that means that spamfish saw nothing worth noting; no news is good news. If mail_grinder says something, then I get an email with what it said. These emails tell me the spam folder and message number in that folder, along with what part of the message matched.
Typically, I go look at each one of the things that spamfish tells me about. I use the cd command to go into that spam folder, and the cat command to examine messages, and the mark command to mark false positives. Assuming there are any false positives, I then do something like:
flail> spam/re/no -markedto run the spam/re/no command over all marked messages. This command runs each message through bmf -N to reclassify it as non-spam, then mv's it into my inbox folder.
In the case where the matches are actually spam and not false positives, I do one of two things: remove the message or move it into another folder that isn't searched by spamfish. Generally it doesn't hurt to delete the odd true spam or two. Who cares. There's plenty around. Once in a while a spam is interesting for some other reason, so I stash it somewhere.
Finally, I flush the changes and go back to my inbox:
flail> sync flail> incoming
That's it. False negatives appear in my inbox and are reclassified by hand. False positives are sussed out by spamfish, assuming that I keep my autofile_config.pl up to date. By using fetchmail to grab my mail I can get by using flail only to read my local mail spool, which is fine by me.