All Topics  
SpamAssassin

 

   Email Print
   Bookmark   Link

 

SpamAssassin


 
 

History

SpamAssassin was created by Justin Mason who had maintained a number of patches against an earlier program named filter.plx by Mark Jeftovic, which in turn was begun in August 1997. Mason rewrote all of Jeftovic's code from scratch and uploaded the resulting codebase to SourceForge.netSourceForge

SourceForge is a collaborative software development management system....
 on April 20, 2001.

Methods of usage

SpamAssassin is a PerlPerl

Perl, also Practical Extraction and Report Language is a dynamic procedural programming language designed by Larry Wa...
-based application (MailSpamAssassin in CPANFacts About CPAN

CPAN is an acronym standing for Comprehensive Perl Archive Network....
) which is usually used to filter all incoming mail for one or several users. It can be run as a standalone application or as a clientClient (computing)

A client is a computer system that accesses a service on another computer by some kind of network....
 (spamc) that communicates with a daemonDaemon (computer software)

In Unix and other computer multitasking operating systems, a daemon is a computer program that runs in the background, rathe...
 (spamd). The latter mode of operation has performance benefits, but under certain circumstances may introduce additional security risks.

Typically either variant of the application is set up in a generic mail filter program, or it is called directly from a mail user agent that supports this, whenever new mail arrives. Mail filter programs such as procmailProcmail Overview

Procmail is a mail delivery agent or mail filter, a program to process incoming emails on a computer, widely used on Unix sy...
 can be made to pipe all incoming mail through SpamAssassin with an adjustment to user's .procmailrc file.

Anti-spam techniques

SpamAssassin comes with a large set of rules which are applied to determine whether an email is spam or not. To decide, specific fields within the email header and the email body are typically searched for certain regular expressionRegular expression

A regular expression is a string that describes or matches a set of strings, according to certain syntax rules....
s, and if these expressions match, the email is assigned a certain score, depending on the test, and several (customizable) headers are added to the mail. The total score resulting from all tests or other criteria can then be used by the end user or by the ISP to set the conditions under which email is moved to a separate spam folder, deleted, flagged etc.

Each test has a label and a description. The label is usually an all upper case identifier separated with underscores, such as "LIMITED_TIME_ONLY", with the description for that label being "Offers a limited time offer". A mail that fails that test (in this case, contains certain variants of the "limited time only" phrase) might be assigned a score of +0.3. With a spam threshold of 5 (default as of SpamAssassin version 2.55), several other tests would usually have to fail for the mail to be classified as spam. On the other hand, some tests, such as those for invalid message IDs or years, result in a very high score being assigned, where even a single test can almost put a mail "over the edge".

When a mail's total score is higher than the "required_score" setting in SpamAssassin's configuration, the mail is treated as spam and rewritten according to several options. In the default configuration, the content of the mail is appended as a MIMEMIME

Multipurpose Internet Mail Extensions is an Internet Standard that extends the format of e-mail to support text in characte...
 attachment, with a brief excerpt in the message body, and a description of the tests which resulted in the mail being classified as spam. If the score is lower than the defined settings, by default the information about the passed tests and total score is still added to the email headers and can be used in post-processing for less severe actions, such as tagging the mail as suspicious.

The user can customize these filters using a file "user_prefs" in their home directoryHome directory

In computing, a home directory is a directory which contains the personal files of a particular user of the system....
 or a database. Within this file, they can specify individuals whose emails are never considered spam, or change the scores for certain rules. The user can also define a list of languages which they want to receive mail in, and SpamAssassin then assigns a higher score to all mails that appear to be written in another language. This can be very useful to users receiving a lot of foreign spam but never actually corresponding with people in that language.

Network-based filtering methods

SpamAssassin also supports:
  • DNS-based blackhole listsDNSBL Overview

    A DNS-based Blackhole List, is a means by which an Internet site may publish a list of IP addresses, in a format which can b...
  • URI blacklists such as SURBLSURBL

    SURBL is an acronym for "Spam URI Realtime Blocklists", a method for detecting spam by searching e-mail message bodies for U...
     or URIBL.com which track spam websites
  • checksum-based filters such as the Distributed Checksum ClearinghouseDistributed Checksum Clearinghouse

    Distributed Checksum Clearinghouse, is a hash sharing method of spam email detection....
    s, Vipul's RazorVipul's Razor

    Vipul's Razor is a checksum-based, distributed, collaborative, spam-detection-and-filtering network....
     and the Cloudmark Authority plug-in (commercial)
  • HashcashHashcash

    Hashcash is a proof-of-work system designed to limit email spam and denial of service attacks....
  • Sender Policy FrameworkSender Policy Framework

    In computing, Sender Policy Framework is an extension to the Simple Mail Transfer Protocol....


as a means to tell 'ham' from 'spam'.

More methods can be added reasonably easily by writing a Perl plug-in for SpamAssassin.

Bayesian filtering

SpamAssassin by default tries to reinforce its own rules through Bayesian filteringBayesian spam filtering

Bayesian spam filtering is the process of using Bayesian statistical...
, but Bayesian learning is most effective with actual user input. Typically, the user is expected to "feed" example spam mails and example "ham" (useful) mails to the filter, which can then learn the difference between the two. For this purpose, SpamAssassin provides the command-line tool sa-learn, which can be instructed to learn a single mail or an entire mailbox as either ham or spam.

Typically, the user will move unrecognized spam to a separate folder for a while, and then run sa-learn on the folder of non-spam and on the folder of spam separately. Alternatively, if the mail user agent supports it, sa-learn can be called for individual emails. Regardless of the method used to perform the learning, SpamAssassin's Bayesian test will subsequently assign a higher score to e-mails that are similar to previously received spam (or, more precisely, to those emails that are different from non-spam in ways similar to previously received spam e-mails).

Licensing

SpamAssassin is freeFree software

Free software, as defined by the Free Software Foundation, is software which can be used, copied, studied, modified and redi...
/open source software, licensed under the Apache License 2.0Apache License

The Apache License is a free software / open source license authored by The Apache Software Foundation....
. Versions prior to 3.0 are dual-licensed under the Artistic LicenseArtistic License

The Artistic License is a software license used for certain free software packages, most notably the standard Perl implement...
 and the GNU General Public LicenseGNU General Public License

The GNU General Public License is a widely used free software license, originally written by Richard Stallman for the GNU p...
.

sa-compile

sa-compile is a utility distributed with SpamAssassin as of version 3.2.0. It compiles a SpamAssassin ruleset into a deterministic finite automaton that allows SpamAssassin to use processor power more efficiently.

Testing SpamAssassin

Most implementations of SpamAssassin will trigger on the GTUBEGtube

The GTUBE is a 68 byte test string used to test anti spam solutions, notably those based on spamassassin....
, a 68 byte string not unlike the antivirus EICAREicar test file

The Eicar test file is a file, developed by the EICAR organization, that is used in testing anti-virus scanners for their in...
 test file. If this string is inserted in an RFC 2822 formatted message and passed through the SpamAssassin engine, SpamAssassin will trigger with a weight of 1000.

External links

  • Automatically updating SA
  • containing many very good rules for filtering with SA.
  • to automatically update SA with the newest and best SARE rules.
  • showing that SpamAssassin received 69% of the vote for "best Linux-based anti-spam solution."