Spam in blogs
Encyclopedia
Spam in blogs is a form of spamdexing
Spamdexing
In computing, spamdexing is the deliberate manipulation of search engine indexes...

. (Note that blogspam has another, more common meaning, namely the post of a blogger who creates no-value-added posts to submit them to other sites.) It is done by automatically posting random comments or promoting commercial services to blogs, wiki
Wiki
A wiki is a website that allows the creation and editing of any number of interlinked web pages via a web browser using a simplified markup language or a WYSIWYG text editor. Wikis are typically powered by wiki software and are often used collaboratively by multiple users. Examples include...

s, guestbook
Guestbook
A guestbook is a paper or electronic means for a visitor to acknowledge their visitation to a site, physical or web-based, and leave their name, postal or electronic address , and a comment or note, if desired...

s, or other publicly accessible online discussion boards. Any web application that accepts and displays hyperlinks submitted by visitors may be a target.

Adding links that point to the spammer's web site artificially increases the site's search engine ranking. An increased ranking often results in the spammer's commercial site being listed ahead of other sites for certain searches, increasing the number of potential visitors and paying customers.

History

This type of spam originally appeared in internet guestbook
Guestbook
A guestbook is a paper or electronic means for a visitor to acknowledge their visitation to a site, physical or web-based, and leave their name, postal or electronic address , and a comment or note, if desired...

s, where spammers repeatedly fill a guestbook with links to their own site and no relevant comment, to increase search engine rankings. If an actual comment is given it is often just "cool page", "nice website", or keywords of the spammed link.

In 2003, spammers began to take advantage of the open nature of comments in the blogging software like Movable Type
Movable Type
Movable Type is a weblog publishing system developed by the company Six Apart. It was publicly announced on September 3, 2001; version 1.0 was publicly released on October 8, 2001. On 12 December 2007, Movable Type was relicensed as free software under the GNU General Public License...

 by repeatedly placing comments to various blog posts that provided nothing more than a link to the spammer's commercial web site. Jay Allen created a free plugin, called MT-BlackList, for the Movable Type weblog tool (versions prior to 3.2) that attempted to alleviate this problem. Many blogging packages now have methods of preventing or reducing the effect of blog spam, although spammers have developed tools to circumvent them. Many spammers use special blog spamming tools like Trackback Submitter
Trackback submitter
Trackback Submitter is one of most popular link building tools used by spammers and lovers of black SEO. Developed by an unknown spammer from Europe in September 2006, Trackback Submitter became very popular because of its ability to bypass comment spam protection used on popular blogging systems...

 to bypass comment spam protection on popular blogging systems like Movable Type, Wordpress, and others.

Disallowing multiple consecutive submissions

It is rare on a site that a user would reply to their own comment, yet spammers typically will do. Checking that the user's IP address is not replying to a user of the same IP address will significantly reduce flooding. This, however, proves problematic when multiple users, behind the same proxy, wish to comment on the same entry. Blog Spam software may get around this by faking IP addresses, posting similar blog spam using many IP addresses.

Blocking by keyword

Blocking specific words from posts is one of the simplest and most effective ways to reduce spam. Much spam can be blocked simply by banning names of popular pharmaceuticals and casino games.

This is a good long-term solution, because it's not beneficial for spammers to change keywords to "vi@gra" or such, because keywords must be readable and indexed by search engine bots to be effective.

nofollow

Google announced in early 2005 that hyperlinks with rel="nofollow" attribute would not be crawled or influence the link target's ranking in the search engine's index. The Yahoo and MSN search engines also respect this tag.

Using rel="nofollow" is a much easier solution that makes the improvised techniques above irrelevant. Most weblog software now marks reader-submitted links this way by default (with no option to disable it without code modification). A more sophisticated server software could spare the nofollow for links submitted by trusted users
Trust management
Trust management can be conceptualized in two ways. First, as a process by which individual A becomes trustworthy for other individuals. This trust is significant criterion of success and survival because it makes individuals to collaborate with individual A...

 like those registered for a long time, on a whitelist, or with a high karma. Some server software adds rel="nofollow" to pages that have been recently edited but omits it from stable pages, under the theory that stable pages will have had offending links removed by human editors.

Some weblog authors object to the use of rel="nofollow", arguing, for example, that
  • Link spammers will continue to spam everyone to reach the sites that do not use rel="nofollow"
  • Link spammers will continue to place links for clicking (by surfers) even if those links are ignored by search engines.
  • Google is advocating the use of rel="nofollow" in order to reduce the effect of heavy inter-blog linking on page ranking.
  • Google is advocating the use of rel="nofollow" only to minimize its own filtering efforts and to deflect that this actually had better been called rel="nopagerank".
  • Nofollow may reduce the value of legitimate comments


Other websites like Slashdot
Slashdot
Slashdot is a technology-related news website owned by Geeknet, Inc. The site, which bills itself as "News for Nerds. Stuff that Matters", features user-submitted and ‑evaluated current affairs news stories about science- and technology-related topics. Each story has a comments section...

, with high user participation, use improvised nofollow implementations like adding rel="nofollow" only for potentially misbehaving users. Potential spammers posting as users can be determined through various heuristics like age of registered account and other factors. Slashdot also uses the poster's karma as a determinant in attaching a nofollow tag to user submitted links.

rel="nofollow" has come to be regarded as a microformat
Microformat
A microformat is a web-based approach to semantic markup which seeks to re-use existing HTML/XHTML tags to convey metadata and other attributes in web pages and other contexts that support HTML, such as RSS...

.

Validation (reverse Turing test)

A method to block automated spam comments is requiring a validation
Data validation
In computer science, data validation is the process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called "validation rules" or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system...

 prior to publishing the contents of the reply form. The goal is to verify that the form is being submitted by a real human being and not by a spam tool and has therefore been described as a reverse Turing test
Reverse Turing test
The term reverse Turing test has no single clear definition, but has been used to describe various situations based on the Turing test in which the objective and/or one or more of the roles have been reversed between computers and humans....

. The test should be of such a nature that a human being can easily pass and an automated tool would most likely fail.

Many forms on websites take advantage of the CAPTCHA
CAPTCHA
A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...

 technique, displaying a combination of numbers and letters embedded in an image which must be entered literally into the reply form to pass the test. In order to keep out spam tools with built-in text recognition the characters in the images are customarily misaligned, distorted, and noisy. A drawback of many older CAPTCHAs is that passwords are usually case-sensitive while the corresponding images often don't allow a distinction of capital and small letters. This should be taken into account when devising a list of CAPTCHAs. Such systems can also prove problematic to blind people who rely on screen readers. Some more recent systems allow for this by providing an audio version of the characters. A simple alternative to CAPTCHAs is the validation in the form of a password
Password
A password is a secret word or string of characters that is used for authentication, to prove identity or gain access to a resource . The password should be kept secret from those not allowed access....

 question, providing a hint to human visitors that the password is the answer to a simple question like "The Earth revolves around the... [Sun]".

One drawback to be taken into consideration is that any validation required in the form of an additional form field may become a nuisance especially to regular posters. Many bloggers and guestbook owners notice a significant decrease in the number of comments once such a validation is in place.

Disallowing links in posts

There is negligible gain from spam that does not contain links, so currently all spam posts contain (an excessive number of) links. It is safe to require passing Turing tests only if post contains links and letting all other posts through. While this is highly effective, spammers do frequently send gibberish posts (such as "ajliabisadf ljibia aeriqoj") to test the spam filter. These gibberish posts will not be labeled as spam. They do the spammer no good, but they still clog up comments sections.

Garbage submissions might however also result from level 0 spambots, which don't parse the attacked HTML form fields first, but send generic POST requests against pages. So it happens that a "content" or "forum_post" POST variable is set and received by the blog or forum software, but the "uri" or other wrong url field name is not accepted and thus not saved as spamlink.

Redirects

Instead of displaying a direct hyperlink submitted by a visitor, a web application could display a link to a script on its own website that redirects to the correct URL
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

. This will not prevent all spam since spammers do not always check for link redirection, but effectively prevents against increasing their PageRank
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

, just as rel=nofollow. An added benefit is that the redirection script can count how many people visit external URLs, although it will increase the load on the site.

Redirects should be server-side
Server-side
Server-side refers to operations that are performed by the server in a client–server relationship in computer networking.Typically, a server is a software program, such as a web server, that runs on a remote server, reachable from a user's local computer or workstation...

 to avoid accessibility issues related to client-side redirects. This can be done via the .htaccess file
.htaccess
A .htaccess file is a directory-level configuration file supported by several web servers, that allows for decentralized management of web server configuration....

 in Apache.

Another way of preventing PageRank
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

 leakage is to make use of public redirection
URL redirection
URL redirection, also called URL forwarding and the very similar technique domain redirection also called domain forwarding, are techniques on the World Wide Web for making a web page available under many URLs.- Similar domain names :...

 or dereferral
HTTP referer
The referrer, or HTTP referrer — also known by the common misspelling referer that occurs as an HTTP header field — identifies, from the point of view of an Internet webpage or resource, the address of the webpage The referrer, or HTTP referrer — also known by the common misspelling...

 services such as TinyURL
TinyURL
TinyURL is a URL shortening service, a web service that provides short aliases for redirection of long URLs. Kevin Gilbertson, a web developer, launched the service in January 2002 so that he would be able to link directly to newsgroup postings that frequently had long and cumbersome addresses.-...

. For example,

Link

where 'alias_of_target' is the alias of target address.

Note however that this prevents users from being able to view the target of a link before clicking it, thus interfering with their ability to ignore websites they know to be spam.
TinyURL
TinyURL
TinyURL is a URL shortening service, a web service that provides short aliases for redirection of long URLs. Kevin Gilbertson, a web developer, launched the service in January 2002 so that he would be able to link directly to newsgroup postings that frequently had long and cumbersome addresses.-...

 now offers a preview feature to help avoiding this situation.

Distributed approaches

This approach is very new to addressing link spam. One of the shortcomings of link spam filters is that most sites receive only one link from each domain which is running a spam campaign. If the spammer varies IP addresses, there is little to no distinguishable pattern left on the vandalized site. The pattern, however, is left across the thousands of sites that were hit quickly with the same links.

A distributed approach, like the free LinkSleeve uses XML-RPC
XML-RPC
XML-RPC is a remote procedure call protocol which uses XML to encode its calls and HTTP as a transport mechanism. "XML-RPC" also refers generically to the use of XML for remote procedure call, independently of the specific protocol...

 to communicate between the various server applications (such as blogs, guestbooks, forums, and wikis) and the filter server, in this case LinkSleeve. The posted data is stripped of urls and each url is checked against recently submitted urls across the web. If a threshold is exceeded, a "reject" response is returned, thus deleting the comment, message, or posting. Otherwise, an "accept" message is sent.

A more robust distributed approach is Akismet
Akismet
Akismet or Automattic Kismet is a spam filtering service. It attempts to filter link spam from blog comments and spam TrackBack pings. The filter works by combining information about spam captured on all participating blogs, and then using those spam rules to block future spam...

, which uses a similar approach to LinkSleeve but uses API keys to assign trust to nodes and also has wider distribution as a result of being bundled with the 2.0 release of WordPress
WordPress
WordPress is a free and open source blogging tool and publishing platform powered by PHP and MySQL. It is often customized into a content management system . It has many features including a plug-in architecture and a template system. WordPress is used by over 14.7% of Alexa Internet's "top 1...

. They claim over 140,000 blogs contributing to their system. Akismet
Akismet
Akismet or Automattic Kismet is a spam filtering service. It attempts to filter link spam from blog comments and spam TrackBack pings. The filter works by combining information about spam captured on all participating blogs, and then using those spam rules to block future spam...

 libraries have been implemented for Java, Python, Ruby, and PHP, but its adoption may be hindered by its commercial use restrictions. In 2008, Six Apart
Six Apart
Six Apart Ltd., sometimes abbreviated 6A, is a software company known for creating the Movable Type blogware, TypePad blog hosting service, and Vox. The company also is the former owner of LiveJournal. Six Apart is headquartered in Tokyo and is planning to open a new, U.S.-based office in New York...

 therefore released a beta version of their TypePad AntiSpam software, which is compatible with Akismet but free of the latter's commercial use restrictions.

Project Honey Pot
Project Honey Pot
Project Honey Pot is a web based honeypot network which uses software embedded in web sites to collect information about IP addresses used when harvesting e-mail addresses for spam or other similar purposes such as bulk mailing and e-mail fraud...

 has also begun tracking comment spammers. The Project uses its vast network of thousands of traps installed in over one hundred countries around the world in order to watch what comment spamming web robots are posting to blogs and forums. Data is then published on the top countries for comment spamming, as well as the top keywords and URLs being promoted. The Project's data is then made available to block known comment spammers through http://BL. Various plugins have been developed to take advantage of the http://BL API.

Application-specific anti-spam methods

Particularly popular software products such as Movable Type
Movable Type
Movable Type is a weblog publishing system developed by the company Six Apart. It was publicly announced on September 3, 2001; version 1.0 was publicly released on October 8, 2001. On 12 December 2007, Movable Type was relicensed as free software under the GNU General Public License...

 and MediaWiki
MediaWiki
MediaWiki is a popular free web-based wiki software application. Developed by the Wikimedia Foundation, it is used to run all of its projects, including Wikipedia, Wiktionary and Wikinews. Numerous other wikis around the world also use it to power their websites...

 have developed their own custom anti-spam measures, as spammers focus more attention on targeting those platforms. . More advanced access control list
Access control list
An access control list , with respect to a computer file system, is a list of permissions attached to an object. An ACL specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects. Each entry in a typical ACL specifies a subject...

s require various forms of validation before users can contribute anything like linkspam.

The goal in every case is to allow good users to continue to add links to their comments, as that is considered by some to be a valuable aspect of any comments section.

RSS feed monitoring

Some wikis allow you to access an RSS feed of recent changes or comments. If you add that to your news reader and set up a smart search for common spam terms (usually viagra and other drug names) you can quickly identify and remove the offending spam.

Response tokens

Another filter available to webmasters is to add a hidden variable to their comment form containing a session token which uniquely identifies the instance of the form. The primary protection afforded by this mechanism is through enforcing a one-to-one correspondence between each request to get the form and each request to submit it. This is impossible to do with IP addresses, since they are shared by users behind a proxy, firewall, or nat (e.g., multiple users sitting in the same internet cafe, library, senior citizens' center, managed care home, club, etc.) and they may change frequently, even between related requests (e.g., AOL and other enterprise-scale proxies, anonymizing services such as Tor
Tor (anonymity network)
Tor is a system intended to enable online anonymity. Tor client software routes Internet traffic through a worldwide volunteer network of servers in order to conceal a user's location or usage from someone conducting network surveillance or traffic analysis...

). When the form is eventually submitted, the server can use the token to validate the post. If the token is unrecognized the server can send back the form, along with a new token, requiring user resubmission. A duplicate token with duplicate content can safely be silently discarded. Additionally, spammers may not actually load the comments form for an entry; having a unique code for each request inserted into the comment form and verifying it on receipt of the HTTP POST will significantly increase the number of steps required to spam multiple entries.

Given a valid token, the server can then flag as suspicious, for example, postings that use different IP addresses for loading the comment form and posting the comment form, many postings all using the same IP address, or postings that took unusually short or long periods of time to compose. These can then be subjected to additional scrutiny, such as challenging the poster with a captcha
CAPTCHA
A CAPTCHA is a type of challenge-response test used in computing as an attempt to ensure that the response is generated by a person. The process usually involves one computer asking a user to complete a simple test which the computer is able to generate and grade...

, queuing for human review, or outright rejected.

This method is effective against spammers who spoof their IP Address
IP address spoofing
In computer networking, the term IP address spoofing or IP spoofing refers to the creation of Internet Protocol packets with a forged source IP address, called spoofing, with the purpose of concealing the identity of the sender or impersonating another computing system.-Background:The basic...

 in an attempt to conceal their identities or to appear to be many more distinct users than the number of IP addresses simultaneously under their control, since the token can only be returned if it was received by the spammer in the first place. It has been suggested that flagging posts based on changing IP addresses is effective against spammers abusing the distributed anonymous proxy Tor
Tor (anonymity network)
Tor is a system intended to enable online anonymity. Tor client software routes Internet traffic through a worldwide volunteer network of servers in order to conceal a user's location or usage from someone conducting network surveillance or traffic analysis...

.

Ajax

Some blog software such as Typo
Typo (content management system)
Typo is a free, open source blogging engine written in the Ruby programming language, using the Ruby on Rails web application framework released under the MIT License...

 allow the blog administrator to allow only comments submitted via Ajax
Ajax (programming)
Ajax is a group of interrelated web development methods used on the client-side to create asynchronous web applications...

 XMLHttpRequest
XMLHttpRequest
XMLHttpRequest is an API available in web browser scripting languages such as JavaScript. It is used to send HTTP or HTTPS requests directly to a web server and load the server response data directly back into the script. The data might be received from the server as XML text or as plain text...

s, and discard regular form POST requests. This causes accessibility problems typical to Ajax-only applications.

Although this technique prevents spam so far, it is a form of security by obscurity and will probably be defeated if it becomes popular enough.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK