Google Sitemaps
Encyclopedia
The Sitemaps protocol allows a webmaster to inform search engines about URLs
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

 on a website that are available for crawling. A Sitemap is an XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.

Sitemaps are particularly beneficial on websites where:
  • some areas of the website are not available through the browsable interface, or
  • webmasters use rich Ajax
    Ajax (programming)
    Ajax is a group of interrelated web development methods used on the client-side to create asynchronous web applications...

    , Silverlight
    Microsoft Silverlight
    Microsoft Silverlight is an application framework for writing and running rich Internet applications, with features and purposes similar to those of Adobe Flash. The run-time environment for Silverlight is available as a plug-in for web browsers running under Microsoft Windows and Mac OS X...

    , or Flash
    Adobe Flash
    Adobe Flash is a multimedia platform used to add animation, video, and interactivity to web pages. Flash is frequently used for advertisements, games and flash animations for broadcast...

     content that is not normally processed by search engines.


The webmaster can generate a Sitemap containing all accessible URLs on the site and submit it to search engines. Since Google, Bing, Yahoo, and Ask use the same protocol now, having a Sitemap would let the biggest search engines have the updated pages information.

Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results.

History

Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

 first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites. Google, MSN
MSN
MSN is a collection of Internet sites and services provided by Microsoft. The Microsoft Network debuted as an online service and Internet service provider on August 24, 1995, to coincide with the release of the Windows 95 operating system.The range of services offered by MSN has changed since its...

 and Yahoo announced joint support for the Sitemaps protocol in November 2006. The schema version was changed to "Sitemap 0.90", but no other changes were made.

In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MS announced auto-discovery for sitemaps through robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.

The Sitemaps protocol is based on ideas from "Crawler-friendly Web Servers".

File format

The Sitemap Protocol format consists of XML tags. The file itself must be UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 encoded. Sitemaps can also be just a plain text list of URLs. They can also be compressed in .gz format.

A sample Sitemap that contains just one URL and uses all optional tags is shown below.






http://example.com/
2006-11-18
daily
0.8



Element definitions

The definitions for the elements are shown below:
Element Required? Description
Yes The document-level element for the Sitemap. The rest of the document after the '' element must be contained in this.
Yes Parent element for each entry. The remaining elements are children of this.
Yes Provides the full URL of the page, including the protocol (e.g. http, https) and a trailing slash, if required by the site's hosting server. This value must be less than 2,048 characters.
No The date that the file was last modified, in ISO 8601
ISO 8601
ISO 8601 Data elements and interchange formats – Information interchange – Representation of dates and times is an international standard covering the exchange of date and time-related data. It was issued by the International Organization for Standardization and was first published in 1988...

 format. This can display the full date and time or, if desired, may simply be the date in the format YYYY-MM-DD.
No How frequently the page may change:
  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

'Always' is used to denote documents that change each time that they are accessed. 'Never' is used to denote archived URLs (i.e. files that will not be changed again).

This is used only as a guide for crawlers
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

, and is not used to determine how frequently pages are indexed.
No The priority of that URL relative to other URLs on the site. This allows webmasters to suggest to crawlers which pages are considered more important.

The valid range is from 0.0 to 1.0, with 1.0 being the most important. The default value is 0.5.

Rating all pages on a site with a high priority does not affect search listings, as it is only used to suggest to the crawlers how important pages in the site are to one another.


Support for the elements that are not required can vary from one search engine to another.

Sitemap index

The Sitemap XML protocol is also extended to provide a way of listing multiple Sitemaps in a 'Sitemap index' file. The maximum Sitemap size of 10 MB or 50,000 URLs means this is necessary for large sites. As the Sitemap needs to be in the same directory as the URLs listed, Sitemap indexes are also useful for websites with multiple subdomain
Subdomain
In the Domain Name System hierarchy, a subdomain is a domain that is part of a larger domain.- Overview :The Domain Name System has a tree structure or hierarchy, with each node on the tree being a domain name. A subdomain is a domain that is part of a larger domain, the only domain that is not...

s, allowing the Sitemaps of each subdomain to be indexed using the Sitemap index file and robots.txt.

Text file

The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 10 MB large or contain more than 50,000 URLs, but can be compressed as a gzip file.

Syndication feed

A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling.

Search engine submission

If Sitemaps are submitted directly to a search engine (pinged), it will return status information and any processing errors. The details involved with submission will vary with the different search engines. The location of the sitemap can also be included in the robots.txt file by adding the following line to robots.txt:


Sitemap:


The should be the complete URL to the sitemap, such as: http://www.example.org/sitemap.xml (however, see the discussion). This directive is independent of the user-agent line, so it doesn't matter where it is placed in the file. If the website has several sitemaps, this URL can simply point to the main sitemap index file.

The following table lists the sitemap submission URLs for several major search engines:
Search engine Submission URL Help page
Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

http://www.google.com/webmasters/tools/ping?sitemap= Submitting a Sitemap
Yahoo!
Yahoo!
Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...

http://search.yahooapis.com/SiteExplorerService/V1/updateNotification?appid=SitemapWriter&url= Does Yahoo! support Sitemaps?
Ask.com
Ask.com
Ask is a Q&A focused search engine founded in 1996 by Garrett Gruener and David Warthen in Berkeley, California. The original software was implemented by Gary Chevsky from his own design. Warthen, Chevsky, Justin Grant, and others built the early AskJeeves.com website around that core engine...

http://submissions.ask.com/ping?sitemap= Q: Does Ask.com support sitemaps?
Bing (Live Search) http://www.bing.com/webmaster/ping.aspx?siteMap= Bing Webmaster Tools
Yandex
Yandex
Yandex is a Russian IT company which operates the largest search engine in Russia and develops a number of Internet-based services and products. Yandex is ranked as 5-th world largest search engine...

 — Sitemaps files (in Russian)
Didikle (Turkish Mobile Search Engine) http://www.didikle.com/ping?sitemap= Sitemap and robots.txt info (in Turkish)


Sitemap URLs submitted using the sitemap submission URLs need to be URL-encoded, replacing : with %3A, / with %2F, etc.

Sitemap limits

Sitemap files have a limit of 50,000 URLs and 10 megabytes per sitemap. Sitemaps can be compressed using gzip
Gzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index
Sitemap index
A Sitemap index is an XML file that lists multiple XML or RSS sitemap files. The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file. It allows webmasters to include additional information about each sitemap...

 file serving as an entry point. Sitemap index
Sitemap index
A Sitemap index is an XML file that lists multiple XML or RSS sitemap files. The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file. It allows webmasters to include additional information about each sitemap...

 files may not list more than 50,000 Sitemaps and must be no larger than 10MB (10,485,760 bytes) and can be compressed. You can have more than one Sitemap index
Sitemap index
A Sitemap index is an XML file that lists multiple XML or RSS sitemap files. The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file. It allows webmasters to include additional information about each sitemap...

 file.

As with all XML files, any data values (including URLs) must use entity escape codes for the characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>).

See also

  • Biositemap
    Biositemap
    A Biositemap is a way for a biomedical research institution of organisation to show how biological information is distributed throughout their Information Technology systems and networks...

  • Metadata
    Metadata
    The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

  • Resources of a Resource
    Resources of a Resource
    Resources of a Resource is an XML format for describing the content of an internet resource or website in a generic fashion so this content can be better understood by search engines, spiders, web applications, etc. The ROR format provides several pre-defined terms for describing objects like...

  • Yahoo! Site Explorer
    Yahoo! Site Explorer
    Yahoo! Site Explorer was a Yahoo! service which allowed viewing of information on websites in Yahoo!'s search index. The service has been closed and the technology transferred to microsoft....

  • Google Webmaster Tools
    Google Webmaster Tools
    Google Webmaster Tools is a no-charge web service by Google for webmasters. It allows webmasters to check indexing status and optimize visibility of their websites.It has tools that let the webmasters:* Submit and check a sitemap...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK