RSS LJ

June 11, 2008

My email setup ()

by fluffy at 5:26 PM
Okay, lately I've seen a lot of people hating on their crappy email programs' spam filters and saying they were just going to switch to GMail for their email. Personally I think GMail still kind of sucks for a variety of reasons, and I rather like my email setup so I might as well explain to people what I do.

Note that this is probably much more fiddly and geeky than most people want to deal with, and it requires an email host which allows you to use IMAP, procmail, and custom filters (or at least specifically bogofilter).

Note that I don't specifically have anything against GMail, it's just that I like to be in control of my hosting setup and spam filtration and so on, and I've noticed that GMail's spam filter is a bit less-than-stellar. (Certainly it's better than pretty much every client-side filter, of course, but I've had much better luck with my setup.)

Getting the mail

IMAP is king. IMAP is based on sync rather than fetch. If you read a message on one system it gets marked read on the server and then that gets marked read on other clients as well. You don't have to worry about your hard drive or mobile device filling up because devices just cache the stuff you're actually looking at. IMAP is great. Yes POP sucks. IMAP doesn't suck. Some IMAP clients treat it like POP though, and that sucks. But most IMAP clients treat IMAP as IMAP. Use IMAP. Even if you switch to GMail, if you want to keep using a regular email client with GMail, use IMAP, not POP. Seriously.

Spam filtering

Classification

My email provider (Dreamhost) has SpamAssassin installed. SpamAssassin is okay at tagging email with various high-level characteristics (e.g. "this came from a server that's on a blacklist" or "this server said it has a different hostname than it actually does" or whatever), but it only looks at rules as single-dimensional scores, when really it's combinations of factors which should be used as an earmark. So I set SpamAssassin to just tag email but not to filter it. The tags are useful because they are still visible to bogofilter. (However, they're not strictly necessary, so if you don't have SpamAssassin don't worry too much.)

The actual filtering happens with bogofilter. It is a tool which looks at a stream of text and, based on word frequencies compared to a database of prior word frequencies, decides whether an email is spam or not. It learns. It is nice. And, by having it on the mail server, you only have to train it once.

bogofilter itself doesn't actually decide what to do with the messages, though. Instead it just sets a header in the file which can then be seen by procmail, which takes a message and given a bunch of rules decides which folder to put it into. If your email provider lets you set up mail filters, it'll almost certainly have procmail. Here is my .procmailrc file:

MAILDIR=$HOME/Maildir/
PMDIR=$HOME/.procmail
LOGFILE=$PMDIR/log-`date +%Y-%m`
MAIL=$HOME/Maildir/
BOGOFILTER=$HOME/bin/bogofilter

DEFAULT=$MAIL

# Strip SpamAssassin crap out of the subject
:0fw
* ^Subject:.*\*\*\*SPAM\*\*\*
| sed 's/Subject: \*\*\*SPAM\*\*\*/Subject:/'

:0fw
|$BOGOFILTER -e -p

:0:
* ^X-Bogosity: Spam
.spam/

:0:
* ^X-Bogosity: Unsure
.review/
Your configuration will probably be slightly different but this is a pretty simple script. It just says, "take this email and run it through bogofilter in tag mode. If it says it's spam, put it in the spam folder. If you're not sure, put it in the review folder. Otherwise put it in the inbox." (I also have a few other rules for mailing lists and such.)

Obviously this requires there to be two folders, "spam" and "review" on your email account. Well, actually there needs to be a few more than that.

Training

The other two mailboxes are named "train-spam" and "train-notspam." Their functions should be pretty obvious. I occasionally check the "review" folder, and move messages into the correct folders accordingly. (Also the occasional spam will show up in my inbox. Just move it into "train-spam" and be done with it. It's also extremely unlikely that a legitiamte email will show up in the spam folder, but bogofilter is extremely conservative (which is why it has the "unsure" classification to begin with). It will only put something in the spam folder if it's absolutely sure (well, 99.5% certain by default).

So, okay, when you've put messages into these directories, how do they get back into bogofilter? It's pretty simple, really... I have a script, "bogotrain," which runs every 20 minutes (via a cron job), or if you don't have cron access you can just run it manually when you've accumulated a bunch of messages:

#!/bin/bash

export bogofilter=~plaidfluff/bin/bogofilter
export procmail=/usr/bin/procmail

function train {
        $bogofilter -e -vvv < "$1" &&
        $bogofilter "$2" < "$1" &&
        $procmail < "$1" &&
        rm -f "$1"
}

find ~/Maildir/.train-notspam/{cur,new,tmp} -type f |
while read fname ; do
        train "$fname" -n
done

find ~/Maildir/.train-spam/{cur,new,tmp} -type f |
while read fname ; do
        train "$fname" -s
done
This script is also pretty simple. It basically just goes through the files in train-notspam and train-spam, trains bogofilter as not spam or spam respectively, and then reruns procmail on the file, as if it were just mailed to you. (The reason I do this rather than have it just filter or delete the email outright is that I want it to still go into the "wrong" folder if it needs to be trained more. Also, I like retaining all my recent spam as a training set in case I need to recreate my bogofilter database for some reason.)

Message archival

Finally, I really like keeping a complete archive of my email for a long time, but of course just having it all pile up in my inbox becomes untenable. So, I have two more folders, "Read" and "Sent" - but with a twist, as IMAP allows folders to have subfolders. I have a script which runs once a month (also via cron) called "maildirs":
#!/bin/sh

TARGET=`date -d 'last month' +%Y.%m`

cd $HOME/Maildir
for i in .Read .Sent ; do
        [ -d $i ] && [ ! -d $i.$TARGET ] && mv $i $i.$TARGET
        mkdir -p $i/cur $i/new $i/tmp
done
So, when I'm done with messages, I move them into "Read" (and my email clients are all set up to put outgoing messagse in "Sent"), and once a month these become subfolders like Read/2008/03 on the server.

Search is limited by your choice of email client. OSX's Mail.app does a great job of this. Thunderbird... not so much. Outlook is somewhere in between. If I need to do some heavy-duty searching and I'm not on a Mac I'll just ssh into the mail server and do some complex find/grep fiddling, or I'll ssh to my Mac at home and do mdfind or something. Okay, this is somewhere that GMail definitely wins.

Anyway. GMail is fine (especially with Google Apps for Domains, so it's not like you even have to be stuck with somelongusername@gmail.com or whatever), but you don't have to switch to it to get everything the way you want it.

Comments

#10969 influx 06/11/2008 08:05 pm
I've recently become a fan of mairix for indexing my maildir mail. I actually use fetchmail + mairix + procmail+ muttng at work now.
#10970 fluffy 06/11/2008 08:11 pm
Huh, interesting! Unfortunately the documentation page seems to be gone. I like to see how one goes about working with software before spending the time/effort to try to get it up and running. Like, does it require running a daemon? How do you issue queries? etc.
#10971 dusk 06/11/2008 09:17 pm
Sorry to burst your bubble a bit, but word has it that DH is inching away from doing email hosting. A little bird told me that it's partly because email turns out to scale really badly, especially when people are running crazy procmail recipes on it, and partly because people tend to notice really quickly (and get really angry) when it breaks.

I've had the idea for a while of using GMail to collect and send mail, but doing all the spam filtering and whatnot locally by pulling everything via IMAP. Never gotten around to implementing it, though.


(Also, you have a parse error in index.php.)
#10972 fluffy 06/11/2008 10:11 pm
Yeah, I've noticed their "gentle" insistence on having people use gmail instead, and their new "features" of having new shell accounts not hosted on the same fileserver as the website. That doesn't make this any less useful though.

And argh, I can't figure out where that parse error is coming from. I hate you, php.
#10974 fluffy 06/11/2008 10:19 pm
Ah, found out what it was. The real WTF was that I had a bunch of stuff just commented out on the page, including an entire copy of the most recent entry, and of course the sed expression in the procmail recipe was causing the comment block to end unexpectedly. I guess I had commented something out as a temporary thing and forgot to make the change permanent.
#10975 dusk 06/11/2008 11:00 pm
Oh, and actually, I figured out a trivial way to filter GMail locally. Of course, this makes the web client a pain to use, but I hardly ever use it anyway.

#!/usr/bin/env python
GMAIL_USER='your_gmail_username'
GMAIL_PASS='y0urgmai1pa$sw0rd'

import imaplib

class ImapChecker(object):
    class ServerError(Exception): pass
    def __init__(self, imap):
        self._imap = imap
    def __getattr__(self, attr):
        innerFunc = getattr(self._imap, attr)
        def _wrap(*args, **kwargs):
            [result, retval] = innerFunc(*args, **kwargs)
            if result != "OK":
                raise ImapChecker.ServerError(result, retval)
            return retval
        return _wrap

imap = ImapChecker(imaplib.IMAP4_SSL(host='imap.gmail.com'))
imap.login(GMAIL_USER, GMAIL_PASS)

[count] = imap.select("[Gmail]/Spam")
count = int(count)
if not count:
    print "No new spam! Hurrah!"
else:
     print "You've got %d spam! Copying them to inbox..." % count
     imap.copy("1:%d" % count, "INBOX")
     print "Done!"
#10976 kwsNI 06/12/2008 10:25 am
Eh, I use gmail rather than my own domain's email addresses anymore. Out of the last year and 10,000+ spam messages sent to me, I think 3 have slipped into my inbox and I don't think any legit messages have been flagged as spam.
#10977 fluffy 06/12/2008 10:31 am
One nice thing I did notice about gmail's IMAP interface is that now you can finally mark things as spam from the client. Honestly I'm pretty likely to switch to gmail-for-domains the next time there's a major problem with Dreamhost's email anyway. It's just that migrating is always a huge pain (though of course the IMAP interface probably helps a lot with that too).