FileQuirks
Encyclopedia
FileQuirks is a bioinformatic web server
Web server
Web server can refer to either the hardware or the software that helps to deliver content that can be accessed through the Internet....

 for recognition of biological data types developed in Laboratory of Bioinformatics and Protein Engineering in IIMCB Warsaw (GeneSilico
GeneSilico
Laboratory of Bioinformatics and Protein Engineering in International Institute of Molecular and Cell Biology in Warsaw, Poland.-Fields of research:* Protein and nucleic acid structure modeling* Discovery and analysis of enzymes that act on DNA or RNA...

). It enables to quickly check the format of a file with a biological data.

Background

We currently observe an explosion of publicly available bioinformatic tools and data. In parallel we can also observe constant increase in number of data formats used, such as: FASTA format
FASTA format
In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences...

, Mass spectrometry data format, European Data Format
European data format
European Data Format is a standard file format designed for exchange and storage of medical time series. Being an open and non-proprietary format, EDF is commonly used to archive, exchange and analyse data from commercial devices in a format that is independent of the acquisition system. In this...

, Protein Data Bank (file format)
Protein Data Bank (file format)
The Protein Data Bank file format is a textual file format describing the three dimensional structures of molecules held in the Protein Data Bank. The pdb format accordingly provides for description and annotation of protein and nucleic acid structures including atomic coordinates, observed...

. For example, despite several unification attempts, there are more than 20 formats for biological sequences used (and the number is still growing). Although standardized XML, CSV or tabular formats are promoted by different initiatives, most of commonly used file formats have a form of raw text files and have no characteristic features that might be used to identify or distinguish them. As a result, users of bioinformatic software spend significant amount of time on checking what are their File formats and assessing whether they are compatible with input or output formats of the tools they would like to use.

Algorithm

FileQuirks checks the format of the data file using an extremely simple and data-driven algorithm.

Example files for each of the file formats are stored in the database. Adding a new file format to recognize requires only providing example files of this format.

Systems calculates a set of (hundreds or more) descriptors of values 0 or 1, which are evaluated for each of the stored files. The currently used descriptors are regular expressions. Regular expressions are designed in a way to recognize common patterns used in biology, like word "BLAST" present in every BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

 report or ">" sign at the beginning of the line of sequence formats. If a regular expressions matches given file, the value of the descriptor is 1, otherwise it is 0. The matching is performed by python module re, with multiline flag enabled.

User query is evaluated against all regular expressions in the database. Afterwards, the data formats which example files match similare regular expressions are presented to the user.

To improve the result a set of "expert" regular experssions are also present, which are designed to recognize only one specific format. An example of such expression is "(>([^\t\n\r\f\v]*)\r?\n\r?([ANCTGUanctgu\n\r]{20,})){2,}" - which (believe or not) matches only files with more than one sequence of nucleic acid in FASTA format. Expert expressions are evaluated against every user query and matching data types are presented.

See also


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK