Spamdexing
Encyclopedia
In computing
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

, spamdexing (also known as search spam, search engine spam, web spam or Search Engine Poisoning) is the deliberate manipulation of search engine
Web search engine
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...

 indexes
Index (search engine)
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science...

. It involves a number of methods, such as repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.
Some consider it to be a part of search engine optimization
Search engine optimization
Search engine optimization is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid search results...

, though there are many search engine optimization methods that improve the quality and appearance of the content of web sites and serve content useful to many users. Search engines use a variety of algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

s to determine relevancy ranking. Some of these include determining whether the search term appears in the META keyword
Keyword (Internet search)
An index term, subject term, subject heading, or descriptor, in information retrieval, is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the...

s tag, others whether the search term appears in the body text
Body text
Body text is the term for the text forming the main content of a book, magazine, web page or other printed matter. This is as a contrast to both the headings on each page, and also the pages of front matter that form the introduction to a book....

 or URL
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

 of a web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

. Many search engines check for instances of spamdexing and will remove suspect pages from their indexes. Also, people working for a search-engine organization can quickly block the results-listing from entire websites that use spamdexing, perhaps alerted by user complaints of false matches. The rise of spamdexing in the mid-1990s made the leading search engines of the time less useful.

Common spamdexing techniques can be classified into two broad classes: content spam (or term spam) and link spam.

History

The earliest known reference to the term spamdexing is by Eric Convey in his article "Porn sneaks way back on Web," The Boston Herald, May 22, 1996, where he said:

The problem arises when site operators load their Web pages with hundreds of extraneous terms so search engines will list them among legitimate addresses.

The process is called "spamdexing," a combination of spamming — the Internet term for sending users unsolicited information — and "indexing."

Content spam

These techniques involve altering the logical view that a search engine has over the page's contents. They all aim at variants of the vector space model
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...

 for information retrieval on text collections.

Keyword stuffing

Keyword stuffing involves the calculated placement of keywords within a page to raise the keyword count, variety, and density of the page. This is useful to make a page appear to be relevant for a web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

 in a way that makes it more likely to be found. Example: A promoter of a Ponzi scheme
Ponzi scheme
A Ponzi scheme is a fraudulent investment operation that pays returns to its investors from their own money or the money paid by subsequent investors, rather than from any actual profit earned by the individual or organization running the operation...

 wants to attract web surfers to a site where he advertises his scam. He places hidden text appropriate for a fan page of a popular music group on his page, hoping that the page will be listed as a fan site and receive many visits from music lovers. Older versions of indexing programs simply counted how often a keyword appeared, and used that to determine relevance levels. Most modern search engines have the ability to analyze a page for keyword stuffing and determine whether the frequency is consistent with other sites created specifically to attract search engine traffic. Also, large webpages are truncated, so that massive dictionary lists cannot be indexed on a single webpage.

Hidden or invisible text

Unrelated hidden text
Hidden text
Hidden text is computer text that is displayed in such a way as to be invisible or unreadable. Hidden text is most commonly achieved by setting the font color to the same color as the background, rendering the text invisible unless the user highlights it....

 is disguised by making it the same color as the background, using a tiny font size, or hiding it within HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 code such as "no frame" sections, alt attribute
Alt attribute
The alt attribute is used in HTML and XHTML documents to specify alternative text that is to be rendered when the element to which it is applied cannot be rendered. In HTML 4.01, the attribute is required for the img and area tags...

s, zero-sized DIVs
Span and div
In HTML, the span and div elements are used where parts of a document cannot be semantically described by other HTML elements.Most HTML elements carry semantic meaning – i.e. the element describes, and can be made to function according to, the type of data contained within...

, and "no script" sections. People screening websites for a search-engine company might temporarily or permanently block an entire website for having invisible text on some of its pages. However, hidden text is not always spamdexing: it can also be used to enhance accessibility
Accessibility
Accessibility is a general term used to describe the degree to which a product, device, service, or environment is available to as many people as possible. Accessibility can be viewed as the "ability to access" and benefit from some system or entity...

.

Meta-tag stuffing

This involves repeating keywords in the Meta tags, and using meta keywords that are unrelated to the site's content. This tactic has been ineffective since 2005.

Doorway pages

"Gateway" or doorway page
Doorway page
Doorway pages are web pages that are created for spamdexing, this is, for spamming the index of a search engine by inserting results for particular phrases with the purpose of sending visitors to a different page. They are also known as bridge pages, portal pages, jump pages, gateway pages, entry...

s are low-quality web pages created with very little content but are instead stuffed with very similar keywords and phrases. They are designed to rank highly within the search results, but serve no purpose to visitors looking for information. A doorway page will generally have "click here to enter" on the page.

Scraper sites

Scraper site
Scraper site
A scraper site is a spam website that copies all of its content from other websites using web scraping.In the last few years scraper sites have proliferated at an amazing rate for spamming search engines...

s sites, are created using various programs designed to "scrape" search-engine results pages or other sources of content and create "content" for a website. The specific presentation of content on these sites is unique, but is merely an amalgamation of content taken from other sources, often without permission. Such websites are generally full of advertising (such as pay-per-click
Pay per click
Pay per click is an Internet advertising model used to direct traffic to websites, where advertisers pay the publisher when the ad is clicked. With search engines, advertisers typically bid on keyword phrases relevant to their target market...

 ads), or they redirect the user to other sites. It is even feasible for scraper sites to outrank original websites for their own information and organization names.

Article spinning

Article spinning
Article spinning
Article spinning is a search engine optimization technique by which blog or website owners post a unique version of relevant content on their sites. It works by rewriting existing articles, or parts of articles, and replacing elements to provide a slightly different perspective on the topic...

 involves rewriting existing articles, as opposed to merely scraping content from other sites, to avoid penalties imposed by search engines for duplicate content
Duplicate content
Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page, within the same web site...

. This process is undertaken by hired writers or automated using a thesaurus
Thesaurus
A thesaurus is a reference work that lists words grouped together according to similarity of meaning , in contrast to a dictionary, which contains definitions and pronunciations...

 database or a neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

.

Link spam

Link spam is defined as links between pages that are present for reasons
other than merit. Link spam takes advantage of link-based ranking algorithms, which gives websites higher rankings the more other highly ranked websites link to it. These techniques also aim at influencing other link-based ranking techniques such as the HITS algorithm
HITS algorithm
Hyperlink-Induced Topic Search is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It was a precursor to PageRank...

.

Link-building software

A common form of link spam is the use of link-building software to automate the search engine optimization
Search engine optimization
Search engine optimization is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid search results...

 process.

Link farms

Link farm
Link farm
On the World Wide Web, a link farm is any group of web sites that all hyperlink to every other site in the group. Although some link farms can be created by hand, most are created through automated programs and services. A link farm is a form of spamming the index of a search engine...

s are tightly-knit communities of pages referencing each other, also known humorously as mutual admiration societies.

Hidden links

Putting hyperlink
Hyperlink
In computing, a hyperlink is a reference to data that the reader can directly follow, or that is followed automatically. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks...

s where visitors will not see them to increase link popularity. Highlighted link text can help rank a webpage higher for matching that phrase.

Sybil attack

A Sybil attack
Sybil attack
The Sybil attack in computer security is an attack wherein a reputation system is subverted by forging identities in peer-to-peer networks. It is named after the subject of the book Sybil, a fictional case study of a woman with multiple personality disorder...

 is the forging of multiple identities for malicious intent, named after the famous multiple personality disorder patient "Sybil" (Shirley Ardell Mason
Shirley Ardell Mason
Shirley Ardell Mason was an American psychiatric patient and commercial artist who was reputed to have multiple personality disorder. Her life was fictionalized in 1973 in the book Sybil, and two films of the same name were made in 1976 and 2007...

). A spammer may create multiple web sites at different domain name
Domain name
A domain name is an identification string that defines a realm of administrative autonomy, authority, or control in the Internet. Domain names are formed by the rules and procedures of the Domain Name System ....

s that all link to each other, such as fake blogs (known as spam blogs).

Spam blogs

Spam blog
Spam blog
A spam blog, sometimes referred to by the neologism splog, is a blog which the author uses to promote affiliated websites, to increase the search engine rankings of associated sites or to simply sell links/ads....

s are blogs created solely for commercial promotion and the passage of link authority to target sites. Often these "splogs" are designed in a misleading manner that will give the effect of a legitimate website but upon close inspection will often be written using spinning software or very poorly written and barely readable content. They are similar in nature to link farms.

Page hijacking

Page hijacking
Page hijacking
Page hijacking is a form of search engine index spamming. It is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler, but redirects web surfers to unrelated or malicious websites...

 is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler but redirects web surfers to unrelated or malicious websites.

Buying expired domains

Some link spammers monitor DNS records for domains that will expire soon, then buy them when they expire and replace the pages with links to their pages. See Domaining. However Google resets the link data on expired domains.
Some of these techniques may be applied for creating a Google bomb
Google bomb
The terms Google bomb and Googlewashing refer to practices, such as creating large numbers of links, that cause a web page to have a high ranking for searches on unrelated or off topic keyword phrases, often for comical or satirical purposes...

, this is, to cooperate with other users to boost the ranking of a particular page for a particular query.

Cookie stuffing

Cookie stuffing
Cookie stuffing
Cookie stuffing is a blackhat online marketing technique used to generate illegitimate affiliate sales. Cookie stuffing occurs when a user visits a website, and as a result of that visit receives a third-party cookie from an entirely different website , usually without the user being aware of...

 involves placing an affiliate
Affiliate marketing
Affiliate marketing is a marketing practice in which a business rewards one or more affiliates for each visitor or customer brought about by the affiliate's own marketing efforts...

 tracking cookie on a website visitor's computer without their knowledge, which will then generate revenue for the person doing the cookie stuffing. This not only generates fraudulent affiliate sales, but also has the potential to overwrite other affiliates' cookies, essentially stealing their legitimately earned commissions.

Using world-writable pages

Web sites that can be edited by users can be used by spamdexers to insert links to spam sites if the appropriate anti-spam measures are not taken.

Automated spambot
Spambot
A spambot is an automated computer program designed to assist in the sending of spam. Spambots usually create fake accounts and send spam using them, although it would be obvious that a spambot is sending it...

s can rapidly make the user-editable portion of a site unusable.
Programmers have developed a variety of automated spam prevention techniques to block or at least slow down spambots.

Spam in blogs

Spam in blogs
Spam in blogs
Spam in blogs is a form of spamdexing. It is done by automatically posting random comments or promoting commercial services to blogs, wikis, guestbooks, or other publicly...

 is the placing or solicitation of links randomly on other sites, placing a desired keyword into the hyperlinked text of the inbound link. Guest books, forums, blogs, and any site that accepts visitors' comments are particular targets and are often victims of drive-by spamming where automated software creates nonsense posts with links that are usually irrelevant and unwanted.

Comment spam

Comment spam is a form of link spam that has arisen in web pages that allow dynamic user editing such as wikis, blogs, and guestbooks. It can be problematic because agents
Software agent
In computer science, a software agent is a piece of software that acts for a user or other program in a relationship of agency, which derives from the Latin agere : an agreement to act on one's behalf...

 can be written that automatically randomly select a user edited web page, such as a Wikipedia article, and add spamming links.

Wiki spam

Wiki spam is a form of link spam on wiki pages. The spammer uses the open editability of wiki
Wiki
A wiki is a website that allows the creation and editing of any number of interlinked web pages via a web browser using a simplified markup language or a WYSIWYG text editor. Wikis are typically powered by wiki software and are often used collaboratively by multiple users. Examples include...

 systems to place links from the wiki site to the spam site. The subject of the spam site is often unrelated to the wiki page where the link is added. In early 2005, Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

 implemented a default "nofollow
Nofollow
nofollow is a value that can be assigned to the rel attribute of an HTML a element to instruct some search engines that a hyperlink should not influence the link target's ranking in the search engine's index...

" value for the "rel" HTML attribute. Links with this attribute are ignored by Google's PageRank
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

 algorithm. Forum and Wiki admins can use these to discourage Wiki spam.

Referrer log spamming

Referrer spam takes place when a spam perpetrator or facilitator accesses a web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

 (the referee), by following a link from another web page (the referrer), so that the referee is given the address of the referrer by the person's Internet browser. Some website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

s have a referrer log which shows which pages link to that site. By having a robot
Internet bot
Internet bots, also known as web robots, WWW robots or simply bots, are software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone...

 randomly access many sites enough times, with a message or specific address given as the referrer, that message or Internet address then appears in the referrer log of those sites that have referrer logs. Since some Web search engine
Web search engine
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...

s base the importance of sites on the number of different sites linking to them, referrer-log spam may increase the search engine rankings of the spammer's sites. Also, site administrators who notice the referrer log entries in their logs may follow the link back to the spammer's referrer page.

Mirror websites

A mirror site is the hosting of multiple websites with conceptually similar content but using different URL
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

s. Some search engines give a higher rank to results where the keyword searched for appears in the URL.

URL redirection

URL redirection
URL redirection
URL redirection, also called URL forwarding and the very similar technique domain redirection also called domain forwarding, are techniques on the World Wide Web for making a web page available under many URLs.- Similar domain names :...

 is the taking of the user to another page without his or her intervention, e.g., using META refresh
Meta refresh
Meta refresh is a legacy method of instructing a web browser to automatically refresh the current web page or frame after a given time interval, using an HTML meta element with the http-equiv parameter set to "refresh" and a content parameter giving the time interval in seconds...

 tags, Flash
Adobe Flash
Adobe Flash is a multimedia platform used to add animation, video, and interactivity to web pages. Flash is frequently used for advertisements, games and flash animations for broadcast...

, JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....

, Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

 or Server side redirects.

Cloaking

Cloaking
Cloaking
Cloaking is a search engine optimization technique in which the content presented to the search engine spider is different from that presented to the user's browser. This is done by delivering content based on the IP addresses or the User-Agent HTTP header of the user requesting the page...

 refers to any of several means to serve a page to the search-engine spider
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

 that is different from that seen by human users. It can be an attempt to mislead search engines regarding the content on a particular web site. Cloaking, however, can also be used to ethically increase accessibility of a site to users with disabilities or provide human users with content that search engines aren't able to process or parse. It is also used to deliver content based on a user's location; Google itself uses IP delivery, a form of cloaking, to deliver results. Another form of cloaking is code swapping, i.e., optimizing a page for top ranking and then swapping another page in its place once a top ranking is achieved.

See also

  • Adversarial information retrieval
    Adversarial information retrieval
    Adversarial information retrieval is a topic in information retrieval related to strategies for working with a data source where some portion of it has been manipulated maliciously. Tasks can include gathering, indexing, filtering, retrieving and ranking information from such a data source...

  • Web scraping
    Web scraping
    Web scraping is a computer software technique of extracting information from websites...

  • TrustRank
    TrustRank
    TrustRank is a link analysis technique described in a paper by Stanford University and Yahoo! researchers for semi-automatically separating useful webpages from spam.Many Web spam pages are created only with the intention of misleading search engines...

  • Index (search engine)
    Index (search engine)
    Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science...

    — overview of search engine indexing technology

To report spamdexed pages


Search engine help pages for webmasters


Other tools and information for webmasters

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK