Home      Discussion      Topics      Dictionary      Almanac
Signup       Login
Scraper site

Scraper site

Discussion
Ask a question about 'Scraper site'
Start a new discussion about 'Scraper site'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
A scraper site is a spam
Spam (electronic)
Spam is the use of electronic messaging systems to send unsolicited bulk messages indiscriminately...

 website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

 that copies all of its content from other websites using web scraping
Web scraping
Web scraping is a computer software technique of extracting information from websites...

.

In the last few years scraper sites have proliferated at an amazing rate for spamming search engines
Spamdexing
In computing, spamdexing is the deliberate manipulation of search engine indexes...

. Open content
Open content
Open content or OpenContent is a neologism coined by David Wiley in 1998 which describes a creative work that others can copy or modify. The term evokes open source, which is a related concept in software....

 is a common source of material for scraper sites.

A search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

 is not a scraper site itself; sites such as Yahoo and Google gather content from other websites and index it so that the index can be searched with keywords. Search engines then display snippets of the original site content in response to a user's search.

Made for advertising


Some scraper sites are created to make money by using advertising programs. In such case, they are called Made for AdSense
AdSense
Google AdSense which is a program run by Google Inc. allows publishers in the Google Network of content sites to automatically serve text, image, video, and rich media adverts that are targeted to site content and audience. These adverts are administered, sorted, and maintained by Google, and they...

sites or MFA. This derogatory term refers to websites that have no redeeming value except to lure visitors to the website for the sole purpose of clicking on advertisements.

Made for AdSense sites are considered sites that are spamming search engines
Spamdexing
In computing, spamdexing is the deliberate manipulation of search engine indexes...

 and diluting the search results by providing surfers with less-than-satisfactory search results. The scraped content is considered redundant to that which would be shown by the search engine under normal circumstances had no MFA website been found in the listings.

Legality


Scraper sites may violate copyright law. Even taking content from an open content
Open content
Open content or OpenContent is a neologism coined by David Wiley in 1998 which describes a creative work that others can copy or modify. The term evokes open source, which is a related concept in software....

 site can be a copyright violation, if done in a way which does not respect the license. For instance, the GNU Free Documentation License
GNU Free Documentation License
The GNU Free Documentation License is a copyleft license for free documentation, designed by the Free Software Foundation for the GNU Project. It is similar to the GNU General Public License, giving readers the rights to copy, redistribute, and modify a work and requires all copies and...

 (GFDL) and Creative Commons
Creative Commons
Creative Commons is a non-profit organization headquartered in Mountain View, California, United States devoted to expanding the range of creative works available for others to build upon legally and to share. The organization has released several copyright-licenses known as Creative Commons...

 ShareAlike (CC-BY-SA) licenses require that a republisher inform readers of the license conditions, and give credit to the original author.

Techniques


Many scrapers will pull snippets and text from websites that rank high for keywords they have targeted. This way they hope to rank highly in the SERPs
Search engine results page
A search engine results page , is the listing of web pages returned by a search engine in response to a keyword query. The results normally include a list of web pages with titles, a link to the page, and a short description showing where the Keywords have matched content within the page...

 (Search Engine Results Pages). RSS
RSS (file format)
RSS is a family of web feed formats used to publish frequently updated works—such as blog entries, news headlines, audio, and video—in a standardized format...

 feeds are vulnerable to scrapers.

Some scraper sites consist of advertisements and paragraphs of words randomly selected from a dictionary. Often a visitor will click on a pay-per-click
Pay per click
Pay per click is an Internet advertising model used to direct traffic to websites, where advertisers pay the publisher when the ad is clicked. With search engines, advertisers typically bid on keyword phrases relevant to their target market...

 advertisement because it is the only comprehensible text on the page. Operators of these scraper sites gain financially from these clicks. Ad networks claim to be constantly working to remove these sites from their programs, although there is an active polemic about this since these networks benefit directly from the clicks generated at these kind of sites. From the advertiser's point of view, the networks don't seem to be making enough effort to stop this problem.

Scrapers tend to be associated with link farm
Link farm
On the World Wide Web, a link farm is any group of web sites that all hyperlink to every other site in the group. Although some link farms can be created by hand, most are created through automated programs and services. A link farm is a form of spamming the index of a search engine...

s and are sometimes perceived as the same thing, when multiple scrapers link to the same target site. A frequent target victim site might be accused of link-farm participation, due to the artificial pattern of incoming links to a victim website, linked from multiple scraper sites.

Domain hijacking



Some spammers who create scraper sites may hijack a recently-expired domain name. Doing so will allow spammers to utilize the already-established search rankings
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

 for the domain name and incoming links
Backlink
Backlinks, also known as incoming links, inbound links, inlinks, and inward links, are incoming links to a website or web page...

. Some spammers may even try to match the topic of the expired site, to utilize their search rankings for those keywords. For example, an expired website for a photographer may be hijacked by a spammer who would generate a scraper site about photography tips.