SpamAssassin
Encyclopedia
SpamAssassin is a computer program
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...

 released under the Apache License 2.0
Apache License
The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer....

 used for e-mail spam
E-mail spam
Email spam, also known as junk email or unsolicited bulk email , is a subset of spam that involves nearly identical messages sent to numerous recipients by email. Definitions of spam usually include the aspects that email is unsolicited and sent in bulk. One subset of UBE is UCE...

 filtering based on content-matching rules. It is now part of the Apache Foundation.

SpamAssassin uses a variety of spam-detection techniques, that includes DNS
Domain name system
The Domain Name System is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities...

-based and checksum-based spam detection, Bayesian filtering
Bayesian spam filtering
Bayesian spam filtering is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.Bayesian classifiers work by correlating the use of tokens , with spam and non spam e-mails and then using Bayesian inference to calculate a probability that an...

, external programs, blacklists and online databases.

The program can be integrated with the mail server
Mail transfer agent
Within Internet message handling services , a message transfer agent or mail transfer agent or mail relay is software that transfers electronic mail messages from one computer to another using a client–server application architecture...

 to automatically filter all mail for a site. It can also be run by individual users on their own mailbox and integrates with several mail programs. SpamAssassin is highly configurable; if used as a system-wide filter it can still be configured to support per-user preferences.

SpamAssassin was awarded the Linux New Media Award 2006 as the 'Best Linux-based Anti-spam Solution'.

History

SpamAssassin was created by Justin Mason who had maintained a number of patches against an earlier program named filter.plx by Mark Jeftovic, which in turn was begun in August 1997. Mason rewrote all of Jeftovic's code from scratch and uploaded the resulting codebase to SourceForge.net
SourceForge
SourceForge Enterprise Edition is a collaborative revision control and software development management system. It provides a front-end to a range of software development lifecycle services and integrates with a number of free software / open source software applications .While originally itself...

 on April 20, 2001. In summer 2004 the project became an Apache Software Foundation
Apache Software Foundation
The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers...

 project and later officially renamed to Apache SpamAssassin. The project involved algorithms developed in part by Gary Robinson
Gary Robinson
Gary Robinson is an American software engineer notable for his mathematical algorithms to fight spam.-Fighting spam with algorithms:In 2003, Robinson published an article in Linux Journal which discussed mathematical approaches for fighting spam which led to work along with Tim Peters on the...

 and others.

Methods of usage

SpamAssassin is a Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

-based application (Mail::SpamAssassin in CPAN
CPAN
CPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations...

) which is usually used to filter all incoming mail for one or several users. It can be run as a standalone application or as a subprogram of another application (such as Milter
Milter
Milter is an extension to the widely used open source mail transfer agents Sendmail and Postfix. It allows administrators to add mail filters for filtering spam or viruses very efficiently in the mail-processing chain...

, SA-Exim, Exiscan, MailScanner
MailScanner
MailScanner is an open source e-mail security system for use on Unix e-mail gateways and was first released in 2001. It protects against viruses and spam...

, MIMEDefang
MIMEDefang
MIMEDefang is a GPL licensed framework for filtering e-mail. It uses sendmail's "Milter" API, some C glue code, and some Perl code to let the user write high-performance mail filters in Perl.MIMEDefang can be used to:* Block viruses...

, Amavis) or as a client
Client (computing)
A client is an application or system that accesses a service made available by a server. The server is often on another computer system, in which case the client accesses the service by way of a network....

 (spamc) that communicates with a daemon
Daemon (computer software)
In Unix and other multitasking computer operating systems, a daemon is a computer program that runs as a background process, rather than being under the direct control of an interactive user...

 (spamd). The client/server or embedded mode of operation has performance benefits, but under certain circumstances may introduce additional security risks.

Typically either variant of the application is set up in a generic mail filter program, or it is called directly from a mail user agent that supports this, whenever new mail arrives. Mail filter programs such as procmail
Procmail
procmail is a mail delivery agent capable of sorting incoming mail into various directories and filtering out spam messages. Procmail is widely used on Unix-based systems and stable, but no longer maintained; users who wish a maintained program are advised to use an alternative MDA, such as...

 can be made to pipe all incoming mail through SpamAssassin with an adjustment to user's .procmailrc file.

Operation

SpamAssassin comes with a large set of rules which are applied to determine whether an email is spam or not. Most rules are based on regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s that are matched against the body or header fields of the message, but SpamAssassin also employs a number of other spam-fighting techniques. The rules are called 'tests' in the SpamAssassin documentation.

Each test has a score value that will be assigned to a message if it matches the test's criteria. The scores can be positive or negative, with positive values indicating 'spam' and negative 'ham' (non-spam messages). A message is matched against all tests and SpamAssassin combines the results into a global score which is assigned to the message. The higher the score, the higher the probability that the message is spam.

SpamAssassin has an internal (configurable) score threshold to classify a message as spam. Usually a message will only be considered as spam if it matches multiple criteria; matching just a single test will not usually be enough to reach the threshold.

If SpamAssassin considers a message to be spam, it can be further rewritten. In the default configuration, the content of the mail is appended as a MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

 attachment, with a brief excerpt in the message body, and a description of the tests which resulted in the mail being classified as spam. If the score is lower than the defined settings, by default the information about the passed tests and total score is still added to the email headers and can be used in post-processing for less severe actions, such as tagging the mail as suspicious.

SpamAssassin allows for a per-user configuration of its behaviour, even if installed as system-wide service; the configuration can be read from a file or a database. In their configuration users can specify individuals whose emails are never considered spam, or change the scores for certain rules. The user can also define a list of languages which they want to receive mail in, and SpamAssassin then assigns a higher score to all mails that appear to be written in another language.

Network-based filtering methods

SpamAssassin also supports:
  • DNS-based blackhole lists
    DNSBL
    A DNSBL is a list of IP addresses published through the Internet Domain Name Service either as a zone file that can be used by DNS server software, or as a live DNS zone that can be queried in real-time...

     and DNS-based whitelists
    DNSWL
    DNSWL is both a generic term and a specific list. The specific list DNSWL.org, lists over 50,000 legitimate SMTP senders.- Generic need for whitelisting :...

  • URI blacklists such as SURBL
    SURBL
    SURBLs are lists of Uniform Resource Identifier hosts, typically web site domains, that appear in unsolicited messages. SURBLs can be used to search incoming e-mail message bodies for similar sites to help evaluate whether the messages are unsolicited...

     or URIBL.com which track spam websites
  • checksum-based filters such as the Distributed Checksum Clearinghouse
    Distributed Checksum Clearinghouse
    Distributed Checksum Clearinghouse is a hash sharing method of spam email detection.The basic logic in DCC is that most spam mails are sent to many recipients. The same message body appearing many times is therefore bulk email. DCC identifies bulk email by taking a checksum and sending that...

    s, Vipul's Razor
    Vipul's Razor
    Vipul's Razor is a checksum-based, distributed, collaborative, spam-detection-and-filtering network. Through user contribution, Razor establishes a distributed and constantly updating catalogue of spam in propagation that is consulted by email clients to filter out known spam. Detection is done...

     and the Cloudmark Authority plug-in (commercial)
  • Hashcash
    Hashcash
    Hashcash is a proof-of-work system designed to limit email spam and denial-of-service attacks. It was proposed in March 1997 by Adam Back.-How it works:...

  • Sender Policy Framework
    Sender Policy Framework
    Sender Policy Framework is an email validation system designed to prevent email spam by detecting email spoofing, a common vulnerability, by verifying sender IP addresses. SPF allows administrators to specify which hosts are allowed to send mail from a given domain by creating a specific SPF...

     and DomainKeys Identified Mail
    DomainKeys Identified Mail
    DomainKeys Identified Mail is a method for associating a domain name to an email message, thereby allowing a person, role, or organization to claim some responsibility for the message. The association is set up by means of a digital signature which can be validated by recipients...



More methods can be added reasonably easily by writing a Perl plug-in for SpamAssassin.

Bayesian filtering

SpamAssassin by default tries to reinforce its own rules through Bayesian filtering
Bayesian spam filtering
Bayesian spam filtering is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.Bayesian classifiers work by correlating the use of tokens , with spam and non spam e-mails and then using Bayesian inference to calculate a probability that an...

, but Bayesian learning is most effective with actual user input. Typically, the user is expected to "feed" example spam mails and example "ham" (useful) mails to the filter, which can then learn the difference between the two. For this purpose, SpamAssassin provides the command-line tool sa-learn, which can be instructed to learn a single mail or an entire mailbox as either ham or spam.

Typically, the user will move unrecognized spam to a separate folder for a while, and then run sa-learn on the folder of non-spam and on the folder of spam separately. Alternatively, if the mail user agent supports it, sa-learn can be called for individual emails. Regardless of the method used to perform the learning, SpamAssassin's Bayesian test will subsequently assign a higher score to e-mails that are similar to previously received spam (or, more precisely, to those emails that are different from non-spam in ways similar to previously received spam e-mails).

Licensing

SpamAssassin is free
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

/open source software, licensed under the Apache License 2.0
Apache License
The Apache License is a copyfree free software license authored by the Apache Software Foundation . The Apache License requires preservation of the copyright notice and disclaimer....

. Versions prior to 3.0 are dual-licensed under the Artistic License
Artistic License
The Artistic License refers most commonly to the original Artistic License , a software license used for certain free and open source software packages, most notably the standard Perl implementation and most CPAN modules, which are dual-licensed under the Artistic License and the GNU General Public...

 and the GNU General Public License
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....

.

sa-compile

sa-compile is a utility distributed with SpamAssassin as of version 3.2.0. It compiles a SpamAssassin ruleset into a deterministic finite automaton that allows SpamAssassin to use processor power more efficiently.

Testing SpamAssassin

Most implementations of SpamAssassin will trigger on the GTUBE
Gtube
The GTUBE is a 68-byte test string used to test anti-spam systems, in particular those based on SpamAssassin...

, a 68 byte string similar to the antivirus EICAR test file
Eicar test file
The EICAR test file is a file, developed by the European Institute for Computer Antivirus Research, to test the response of computer antivirus programs...

. If this string is inserted in an RFC 2822 formatted message and passed through the SpamAssassin engine, SpamAssassin will trigger with a weight of 1000.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK