Scam Spam, the death of email, and Machine Learning

Tim Bray has predicted the end of email as we know it:

I don’t know about you, but in recent weeks I’ve been hit with high volumes of spam promoting penny stocks. They are elaborately crafted and go through my spam defenses like a hot knife through butter. (…) This could be the straw that finally breaks the back of email as we know it, the kind that costs nothing to send and something to receive.

Yes, Tim, I’ve been bombed by spam mail too. To the point that the fraction of non-spam email has gone below 10% for the first time in years. Before you think I’m an extreme case, ask your local IT experts about the amount of spam they are receiving. Currently, no spam filter can cope with the amount of spam I’m receiving.

The only spam filter that does anything to help is Google Mail’s spam filter, but it still let more spam through than legit emails (if I exclude mailing lists).

What is really failing us here is not the Internet per se: it is rather trivial to think of a better way to design email protocols. What is failing us is the blunt application of Machine Learning to a real-world problem.

Many Machine Learning researchers would have you believe, mostly because they really believe it, that Bayes or Neural Networks (add your favorite algorithm here) are ideally suited to solve most classification problems. That they can be tweaked to a particular problem. That in some small way, we have strong AI at our door. But we don’t. The failure of spam filters is symbolic. There is really no free lunch as far as algorithms go.

This is not to say that Machine Learning does not work. Recommender systems like those based on collaborative filtering or PageRank work. But in the real world, the best they can do is assist us. And how fancy your algorithm is does not change the equation.

The lesson here is that until we have strong AI, and this could be a long way still, if ever, we should collectively work on finding algorithms that can assist us better instead of trying to replace us.

For example, spam filters should work with the user on defining what is spam. And I don’t mean having the user train the algorithm. I mean that the user should be allowed to change and add to the spam filter. Naturally, in practice, this is hard work, very hard work, and thus, it might be simpler and better to replace the email protocols.

We have to move away from black box algorithms and embrace the fact that we lack strong AI. The intelligence is in your users, not in your software.

3 thoughts on “Scam Spam, the death of email, and Machine Learning”

  1. Nice post!
    I can only agree with that.
    I would reformulate it (in a weaker form): the power is not in the algorithms but in the features you use. It is very hard to determine which are the right features, or the right representation of an email to be used by a learning algorithm.
    But the question is how users can help here?

    Here are a couple of random thoughts:
    One way could be that they suggest high level features (e.g. in the form of rules) that can then be combined.
    Maybe there should be a combination of examples, rules and features.
    Someone may say:
    [examples] this is a good email, this is a bad one
    [rules] when the sender is from @bla.com it is spam
    [features] how many times the term ‘viagra’ appears is a good feature

    Then you can imagine an algorithm that uses this to build its model. But ideally this model should remain understandable to the user (probably using something like rules again) so that he can modify it, or complement it….

  2. Those who are in academia have their e-mails on websites and are easy targets of spammers. One way to avoid is to have an e-mail system for such people is to requires an “electronic stamp” To send an e-mail to user X, one must access the user X’s website and get an “electronic stamp.”, which involves reading and typing distorted patterns, which machines are not good at. An e-mail received without stamps goes to a junk folder.

  3. These sorts of “spam fads” (spads?) happen every couple months or so; I’ve noticed some of these spams leaking through my filters since late spring. I rarely see more than a couple spams a day, despite my email addresses being all over the net (and let’s face it, once one spammer has your email address, they all have it). Each of my email addresses sends email through two levels of spam filtering: one either on a forwarding server (e.g, IEEE or ACM) or on the email host (Univ. of Washington) and the second the built-in filter in Apple’s Mail program. When a new spad starts, there’s a spike in leakage, and then Mail learns the new spam’s characteristics and the leakage drops to a trickle.

    I don’t see it as the death of email; just the email version of fast forwarding through the commercials on a TiVo. Yes, I’d rather not have to do it.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax