Sphinx (search engine)
Encyclopedia
Sphinx is a free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 search engine designed with indexing database content in mind. It currently supports MySQL
MySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

, PostgreSQL
PostgreSQL
PostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...

, and ODBC-compliant databases as data sources natively. Other data sources can be indexed via pipe in a custom XML format. It is distributed under the terms of the GNU General Public License
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....

 version two or a proprietary
Proprietary software
Proprietary software is computer software licensed under exclusive legal right of the copyright holder. The licensee is given the right to use the software under certain conditions, while restricted from other uses, such as modification, further distribution, or reverse engineering.Complementary...

 license.

Starting from version 0.9.9, querying is possible using SphinxQL, a subset of SQL. Starting from version 1.10-beta, both incremental (via Real-Time backend) and batch indexing is supported.

Sphinx is implemented by more than 100 web sites and services, including Craigslist.org
Craigslist
Craigslist is a centralized network of online communities featuring free online classified advertisements, with sections devoted to jobs, housing, personals, for sale, services, community, gigs, résumés, and discussion forums....

.

Features

  • Batch and incremental (soft real-time) full-text indexing.
  • Support for non-text attributes (scalars, strings, sets).
  • Direct indexing of SQL databases. Native support for MySQL
    MySQL
    MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

    , PostgreSQL
    PostgreSQL
    PostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...

    , MSSQL, plus ODBC connectivity.
  • XML documents indexing support
  • Distributed searching support out of the box.
  • Integration via access APIs
  • SQL-like syntax support via MySQL protocol (since 0.9.9)
  • Full-text searching syntax.
  • Database-like result set processing.
  • Relevance ranking utilizing additional factors besides standard BM25
    Okapi BM25
    In information retrieval, Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and...

    .
  • Text processing support for SBCS
    SBCS
    SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters...

     and UTF-8
    UTF-8
    UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

     encodings, stopwords, indexing of words known not to appear in the database ("hitless"), stemming
    Stemming
    In linguistic morphology and information retrieval, stemming is the process for reducing inflected words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same...

    , word forms, tokenizing exceptions, and "blended characters" (dual-indexing as both a real character and a word separator).
  • supports UDF (since 2.0.1)

Performance and scalability

  • Indexing speed of up to 10-15 MB/sec per core and HDD.
  • Searching speed of up to 200-300 queries/sec against 1,000,000-document, 1.2 GB collection.
  • Biggest known production instances indexes 8.1 billion documents, busiest known one (craigslist) serves over 50,000,000 queries/day
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK