Website correlation
Encyclopedia
Website correlation, or website matching, is a process used to identify websites that are similar or related. Websites are inherently easy to duplicate. This led to proliferation of identical websites or very similar websites for purposes ranging from translation
Translation
Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. Whereas interpreting undoubtedly antedates writing, translation began only after the appearance of written literature; there exist partial translations of the Sumerian Epic of...

 to Internet marketing
Internet marketing
Internet marketing, also known as digital marketing, web marketing, online marketing, search marketing or e-marketing, is referred to as the marketing of products or services over the Internet...

 (especially affiliate marketing
Affiliate marketing
Affiliate marketing is a marketing practice in which a business rewards one or more affiliates for each visitor or customer brought about by the affiliate's own marketing efforts...

) to Internet crime. Locating similar websites is inherently problematic because they may be in different languages, on different servers, in different countries (different top-level domains).

Uses

Website correlation is used in:
  • Internet Investigations to determine the overall scope of an investigation
  • market research
    Market research
    Market research is any organized effort to gather information about markets or customers. It is a very important component of business strategy...

     to locate competitors or determine the market reach of competing companies or for cluster sampling
    Cluster sampling
    Cluster Sampling is a sampling technique used when "natural" groupings are evident in a statistical population. It is often used in marketing research. In this technique, the total population is divided into these groups and a sample of the groups is selected. Then the required information is...

  • Web filtering systems to ensure that all websites of a specific type are blocked from view
  • Data mining
    Data mining
    Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

     systems to maximize input or output data
  • risk management
    Risk management
    Risk management is the identification, assessment, and prioritization of risks followed by coordinated and economical application of resources to minimize, monitor, and control the probability and/or impact of unfortunate events or to maximize the realization of opportunities...

     programs to ensure websites are being monitored for problems that introduce fiscal risk
  • Compliance monitoring as part of a compliance and ethics program
    Compliance and ethics program
    There has been a long history of business and government excesses and subsequent legal, public and political reaction. Response to criminal misconduct has resulted in legal sanctions, governance practices, compliance standards and cultural transformation...

     or policy to ensure websites follow established guidelines

Correlation types

There are several known types of correlation, each demonstrating different strengths and weaknesses. A practical website correlation process may require combining two or more of these methods.

Similar structure

To save time and effort, website owners duplicate major portions of website code across many domain
Domain name
A domain name is an identification string that defines a realm of administrative autonomy, authority, or control in the Internet. Domain names are formed by the rules and procedures of the Domain Name System ....

s. Similarity of code structure can provide enough information for correlation. Organizations known to have a publicly search-able databases for this kind of correlation include:
  • http://www.delineal.com


note: Websites can sometimes utilize the same structure but have no relationship to each other (as when websites coincidentally utilize the same content management system
Content management system
A content management system is a system providing a collection of procedures used to manage work flow in a collaborative environment. These procedures can be manual or computer-based...

).

Same server or subnet

Also known as correlated Reverse DNS lookup
Reverse DNS lookup
In computer networking, reverse DNS lookup or reverse DNS resolution is the determination of a domain name that is associated with a given IP address using the Domain Name System of the Internet....

. Websites may be served from the same server
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...

, on one or more ip address
IP address
An Internet Protocol address is a numerical label assigned to each device participating in a computer network that uses the Internet Protocol for communication. An IP address serves two principal functions: host or network interface identification and location addressing...

, on one or more subnet
Subnet
The word subnet may refer to:* In computer networks, an abbreviation for subnetwork.* In mathematics, a subnet of a net in a topological space....

. Several organizations retain archives of ip address data and correlate the data. Examples include:
  • http://webboar.com
  • http://www.domaintools.com

note: Correlation via this method may be misleading because websites frequently exist on the same server (aka shared hosting) but have no relationship to each other.

Same owner

Websites may be authored by the same person or organization. Website owners are required to provide contact information to a registrar
Registrar
A registrar is an official keeper of records made in a register. Registrar may also refer to:-Government records:* Recorder of deeds, government office which maintains public records related to real estate...

 to obtain a domain name
Domain name
A domain name is an identification string that defines a realm of administrative autonomy, authority, or control in the Internet. Domain names are formed by the rules and procedures of the Domain Name System ....

. Domain ownership can be determined via the WHOIS
WHOIS
WHOIS is a query and response protocol that is widely used for querying databases that store the registered users or assignees of an Internet resource, such as a domain name, an IP address block, or an autonomous system, but is also used for a wider range of other information. The protocol stores...

 protocol which provides no mechanism for searching or correlating ownership. Several organizations retain archives of WHOIS information and provide searching and correlation services. Examples include:
  • http://www.webboar.com
  • http://www.domaintools.com


note: Website ownership information can be falsified, outdated, or hidden from public view. Website Correlation via this method can be accurate, misleading, or impossible depending on the information contained in WHOIS records.

Similar content

Search engines provide search-able databases of indexed website content. Search engine results lists are correlated by content similarity.

Google

  • on Google.com type 'related:website_name_here.com' to find websites related by name or phrases
  • find a unique-sounding phrase on the website then use search engine(s) to locate the phrase literally on other websites
    • In the search box, place quotes around the phrase to do a literal phrase search
    • instead of copyright 2010 xyzcompany use "copyright 2010 xyzcompany"

note: This method of correlation is inherently slow because one must guess which phrases to search for. Also, related websites may not contain literally similar content (as when a site is translated into another language).

Same category

Websites are frequently categorized or tagged similarly via automated or manual means. Examples of publicly accessible website categorization databases include:
  • http://www.similarsitesearch.com/
  • http://similarsites.com
  • open directory project
    Open Directory Project
    The Open Directory Project , also known as Dmoz , is a multilingual open content directory of World Wide Web links. It is owned by Netscape but it is constructed and maintained by a community of volunteer editors.ODP uses a hierarchical ontology scheme for organizing site listings...



note: Manual Categorization
Categorization
Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. Ideally, a category illuminates a relationship between the subjects and objects of knowledge...

 and tag (metadata)
Tag (metadata)
In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...

 methods are inherently subjective. Automated categorization and tagging methods are inherently subject to the varying weaknesses and strengths of underlying categorization algorithms.

Same tracking ID

Tracking IDs, used for analytics
Analytics
Analytics is the application of computer technology, operational research, and statistics to solve problems in business and industry. Analytics is carried out within an information system: while, in the past, statistics and mathematics could be studied without computers and software, analytics has...

or affiliate identification are frequently embedded in website code. These ids can be used for correlation because they imply common management of websites. Publicly available websites for correlating by tracking id include:
  • http://w3who.net
  • http://www.webboar.com/tools/id-lookup/
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK