FASTA format
Encyclopedia
In bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

, FASTA format is a text-based format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...

 for representing either nucleotide sequences or peptide sequence
Peptide sequence
Peptide sequence or amino acid sequence is the order in which amino acid residues, connected by peptide bonds, lie in the chain in peptides and proteins. The sequence is generally reported from the N-terminal end containing free amino group to the C-terminal end containing free carboxyl group...

s, in which nucleotides or amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

s are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...

 software package, but has now become a standard in the field of bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

.

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting language
Scripting language
A scripting language, script language, or extension language is a programming language that allows control of one or more applications. "Scripts" are distinct from the core code of the application, as they are usually written in a different language and are often created or at least modified by the...

s like Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

, Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

, and Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

.

Format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus
Asian Elephant
The Asian or Asiatic elephant is the only living species of the genus Elephas and distributed in Southeast Asia from India in the west to Borneo in the east. Three subspecies are recognized — Elephas maximus maximus from Sri Lanka, the Indian elephant or E. m. indicus from mainland Asia, and E. m....

]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

History

The original FASTA/Pearson format is described in the documentation for the FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...

 suite of programs. It can be downloaded with any free distribution of FASTA (see fasta20.doc, fastaVN.doc or fastaVN.me --where VN is the Version Number).

A sequence in FASTA format is represented as a series of lines, which should be no longer than 120 characters and usually
do not exceed 80 characters. This probably was because to allow for preallocation of fixed line sizes in software: at the time most users relied on DEC VT (or compatible) terminals which could display 80 or 132 characters per line. Most people preferred the bigger font in 80-character modes and so it became the recommended fashion to use 80 characters or less (often 70)
in FASTA lines.

The first line in a FASTA file starts either with a ">" (greater-than) symbol or a ";" (semicolon) and was taken as a comment. Subsequent lines starting with a semicolon would be ignored by software. Since the only comment used was the first, it quickly became used to hold a summary description of the sequence, often starting with a unique library accession number, and with time it has become commonplace use to always use ">" for the first line and to not use ";" comments (which would otherwise be ignored).

Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard
one-letter code. Anything other than a valid code would be ignored (including spaces, tabulators, asterisks, etc...). Originally it was also common to end the sequence with an "*" (asterisk) character (in analogy with use in PIR formatted sequences) and, for the same reason, to leave a blank line between the description and the sequence.

A few sample sequences:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus
Asian Elephant
The Asian or Asiatic elephant is the only living species of the genus Elephas and distributed in Southeast Asia from India in the west to Borneo in the east. Three subspecies are recognized — Elephas maximus maximus from Sri Lanka, the Indian elephant or E. m. indicus from mainland Asia, and E. m....

]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files. This does not imply a contradiction with the format as only the first line in a FASTA file may start with a ";" or ">", hence forcing all subsequent sequences to start with a ">" in order to be taken as different ones (and further forcing the exclusive reservation of ">" for the sequence definition line). Thus, the examples above may as well be taken as a multisequence file if taken together.

Format converters

FASTA files can be batch converted to or from MultiFASTA format using tools, some of which are available as freeware. Tools are also available for batch conversion from [chromatogram] formats (ABI/SCF) to FASTA.

Header line

The header line, which begins with '>', gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence database
Sequence database
In the field of bioinformatics, a sequence database is a large collection of computerized nucleic acid sequences, protein sequences, or other sequences stored on a computer...

s use standardized headers, which helps when automatically extracting information from the header. The header line may contain more than one header, separated by a ^A (Control-A) character.

In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Most databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification. An example of a multiple sequence FASTA file follows:


>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Sequence representation

After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences
Primary structure
The primary structure of peptides and proteins refers to the linear sequence of its amino acid structural units. The term "primary structure" was first coined by Linderstrøm-Lang in 1951...

 or nucleic acid
Nucleic acid
Nucleic acids are biological molecules essential for life, and include DNA and RNA . Together with proteins, nucleic acids make up the most important macromolecules; each is found in abundance in all living things, where they function in encoding, transmitting and expressing genetic information...

 sequences, and they can contain gaps or alignment characters (see sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...

). Sequences are expected to be represented in the standard IUB/IUPAC amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

 and nucleic acid
Nucleic acid
Nucleic acids are biological molecules essential for life, and include DNA and RNA . Together with proteins, nucleic acids make up the most important macromolecules; each is found in abundance in all living things, where they function in encoding, transmitting and expressing genetic information...

 codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.

The nucleic acid codes supported are:
Nucleic Acid Code Meaning
A Adenosine
C Cytosine
Cytosine
Cytosine is one of the four main bases found in DNA and RNA, along with adenine, guanine, and thymine . It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attached . The nucleoside of cytosine is cytidine...

G Guanine
Guanine
Guanine is one of the four main nucleobases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine . In DNA, guanine is paired with cytosine. With the formula C5H5N5O, guanine is a derivative of purine, consisting of a fused pyrimidine-imidazole ring system with...

T Thymidine
Thymidine
Thymidine is a chemical compound, more precisely a pyrimidine deoxynucleoside. Deoxythymidine is the DNA nucleoside T, which pairs with deoxyadenosine in double-stranded DNA...

U Uracil
Uracil
Uracil is one of the four nucleobases in the nucleic acid of RNA that are represented by the letters A, G, C and U. The others are adenine, cytosine, and guanine. In RNA, uracil binds to adenine via two hydrogen bonds. In DNA, the uracil nucleobase is replaced by thymine.Uracil is a common and...

R G A (puRine
Purine
A purine is a heterocyclic aromatic organic compound, consisting of a pyrimidine ring fused to an imidazole ring. Purines, including substituted purines and their tautomers, are the most widely distributed kind of nitrogen-containing heterocycle in nature....

)
Y T U C (pYrimidine
Pyrimidine
Pyrimidine is a heterocyclic aromatic organic compound similar to benzene and pyridine, containing two nitrogen atoms at positions 1 and 3 of the six-member ring...

)
K G T U (Ketone
Ketone
In organic chemistry, a ketone is an organic compound with the structure RCR', where R and R' can be a variety of atoms and groups of atoms. It features a carbonyl group bonded to two other carbon atoms. Many ketones are known and many are of great importance in industry and in biology...

)
M A C (aMino group)
S G C (Strong interaction)
W A T U (Weak interaction)
B G T U C (not A) (B comes after A)
D G A T U (not C) (D comes after C)
H A C T U (not G) (H comes after G)
V G C A (not T, not U) (V comes after U)
N A G C T U (aNy)
X masked
- gap of indeterminate length


The codes supported (24 amino acids and 3 special codes) are:
Amino Acid Code Meaning
A Alanine
Alanine
Alanine is an α-amino acid with the chemical formula CH3CHCOOH. The L-isomer is one of the 20 amino acids encoded by the genetic code. Its codons are GCU, GCC, GCA, and GCG. It is classified as a nonpolar amino acid...

B Aspartic acid
Aspartic acid
Aspartic acid is an α-amino acid with the chemical formula HOOCCHCH2COOH. The carboxylate anion, salt, or ester of aspartic acid is known as aspartate. The L-isomer of aspartate is one of the 20 proteinogenic amino acids, i.e., the building blocks of proteins...

 or Asparagine
Asparagine
Asparagine is one of the 20 most common natural amino acids on Earth. It has carboxamide as the side-chain's functional group. It is not an essential amino acid...

C Cysteine
Cysteine
Cysteine is an α-amino acid with the chemical formula HO2CCHCH2SH. It is a non-essential amino acid, which means that it is biosynthesized in humans. Its codons are UGU and UGC. The side chain on cysteine is thiol, which is polar and thus cysteine is usually classified as a hydrophilic amino acid...

D Aspartic acid
Aspartic acid
Aspartic acid is an α-amino acid with the chemical formula HOOCCHCH2COOH. The carboxylate anion, salt, or ester of aspartic acid is known as aspartate. The L-isomer of aspartate is one of the 20 proteinogenic amino acids, i.e., the building blocks of proteins...

E Glutamic acid
Glutamic acid
Glutamic acid is one of the 20 proteinogenic amino acids, and its codons are GAA and GAG. It is a non-essential amino acid. The carboxylate anions and salts of glutamic acid are known as glutamates...

F Phenylalanine
Phenylalanine
Phenylalanine is an α-amino acid with the formula C6H5CH2CHCOOH. This essential amino acid is classified as nonpolar because of the hydrophobic nature of the benzyl side chain. L-Phenylalanine is an electrically neutral amino acid, one of the twenty common amino acids used to biochemically form...

G Glycine
Glycine
Glycine is an organic compound with the formula NH2CH2COOH. Having a hydrogen substituent as its 'side chain', glycine is the smallest of the 20 amino acids commonly found in proteins. Its codons are GGU, GGC, GGA, GGG cf. the genetic code.Glycine is a colourless, sweet-tasting crystalline solid...

H Histidine
Histidine
Histidine Histidine, an essential amino acid, has a positively charged imidazole functional group. It is one of the 22 proteinogenic amino acids. Its codons are CAU and CAC. Histidine was first isolated by German physician Albrecht Kossel in 1896. Histidine is an essential amino acid in humans...

I Isoleucine
Isoleucine
Isoleucine is an α-amino acid with the chemical formula HO2CCHCHCH2CH3. It is an essential amino acid, which means that humans cannot synthesize it, so it must be ingested. Its codons are AUU, AUC and AUA....

K Lysine
Lysine
Lysine is an α-amino acid with the chemical formula HO2CCH4NH2. It is an essential amino acid, which means that the human body cannot synthesize it. Its codons are AAA and AAG....

L Leucine
Leucine
Leucine is a branched-chain α-amino acid with the chemical formula HO2CCHCH2CH2. Leucine is classified as a hydrophobic amino acid due to its aliphatic isobutyl side chain. It is encoded by six codons and is a major component of the subunits in ferritin, astacin and other 'buffer' proteins...

M Methionine
Methionine
Methionine is an α-amino acid with the chemical formula HO2CCHCH2CH2SCH3. This essential amino acid is classified as nonpolar. This amino-acid is coded by the codon AUG, also known as the initiation codon, since it indicates mRNA's coding region where translation into protein...

N Asparagine
Asparagine
Asparagine is one of the 20 most common natural amino acids on Earth. It has carboxamide as the side-chain's functional group. It is not an essential amino acid...

O Pyrrolysine
Pyrrolysine
Pyrrolysine is a naturally occurring, genetically coded amino acid used by some methanogenic archaea and one known bacterium in enzymes that are part of their methane-producing metabolism. It is similar to lysine, but with an added pyrroline ring linked to the end of the lysine side chain...

P Proline
Proline
Proline is an α-amino acid, one of the twenty DNA-encoded amino acids. Its codons are CCU, CCC, CCA, and CCG. It is not an essential amino acid, which means that the human body can synthesize it. It is unique among the 20 protein-forming amino acids in that the α-amino group is secondary...

Q Glutamine
Glutamine
Glutamine is one of the 20 amino acids encoded by the standard genetic code. It is not recognized as an essential amino acid but may become conditionally essential in certain situations, including intensive athletic training or certain gastrointestinal disorders...

R Arginine
Arginine
Arginine is an α-amino acid. The L-form is one of the 20 most common natural amino acids. At the level of molecular genetics, in the structure of the messenger ribonucleic acid mRNA, CGU, CGC, CGA, CGG, AGA, and AGG, are the triplets of nucleotide bases or codons that codify for arginine during...

S Serine
Serine
Serine is an amino acid with the formula HO2CCHCH2OH. It is one of the proteinogenic amino acids. By virtue of the hydroxyl group, serine is classified as a polar amino acid.-Occurrence and biosynthesis:...

T Threonine
Threonine
Threonine is an α-amino acid with the chemical formula HO2CCHCHCH3. Its codons are ACU, ACA, ACC, and ACG. This essential amino acid is classified as polar...

U Selenocysteine
Selenocysteine
Selenocysteine is an amino acid that is present in several enzymes .-Nomenclature:...

V Valine
Valine
Valine is an α-amino acid with the chemical formula HO2CCHCH2. L-Valine is one of 20 proteinogenic amino acids. Its codons are GUU, GUC, GUA, and GUG. This essential amino acid is classified as nonpolar...

W Tryptophan
Tryptophan
Tryptophan is one of the 20 standard amino acids, as well as an essential amino acid in the human diet. It is encoded in the standard genetic code as the codon UGG...

Y Tyrosine
Tyrosine
Tyrosine or 4-hydroxyphenylalanine, is one of the 22 amino acids that are used by cells to synthesize proteins. Its codons are UAC and UAU. It is a non-essential amino acid with a polar side group...

Z Glutamic acid
Glutamic acid
Glutamic acid is one of the 20 proteinogenic amino acids, and its codons are GAA and GAG. It is a non-essential amino acid. The carboxylate anions and salts of glutamic acid are known as glutamates...

 or Glutamine
Glutamine
Glutamine is one of the 20 amino acids encoded by the standard genetic code. It is not recognized as an essential amino acid but may become conditionally essential in certain situations, including intensive athletic training or certain gastrointestinal disorders...

X any
* translation stop
- gap of indeterminate length

Sequence identifiers

The NCBI
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

 defined a standard for the unique identifier used for the sequence (SeqID) in the header line. The formatdb
Formatdb
formatdb is an outdated software tool in molecular bioinformatics to format protein or nucleotide databases for BLAST. It has been replaced by the tool makeblastdb and the NCBI "strongly encourage[s]" users to stop using formatdb....

 man page has this to say on the subject: "formatdb will automatically parse the SeqID and create indexes, but the database identifiers in the FASTA definition line must follow the conventions of the FASTA Defline Format."

However they do not give a definitive description of the FASTA defline format. An attempt to create such a format is given below (see also "The NCBI Handbook", Chapter 16, The BLAST Sequence Analysis Tool.).

GenBank gi|gi-number|gb|accession|locus
EMBL Data Library gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|name
Brookhaven Protein Data Bank (1) pdb|entry|chain
Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifier lcl|identifier

The vertical bars in the above list are not separators in the sense of the Backus-Naur form, but are part of the format.

File extension

There is no standard file extension for a text file containing FASTA formatted sequences. The table below shows each extension and its respective meaning.
Extension Meaning Notes
fasta generic fasta Any generic fasta file. Other extensions can be fa, seq, fsa
fna fasta nucleic acid For coding regions of a specific genome, use ffn, but otherwise fna is useful for generically specifying nucleic acids.
ffn FASTA nucleotide coding regions Contains coding regions for a genome.
faa fasta amino acid Contains amino acids. A multiple protein fasta file can have the more specific extension mpfa.
frn FASTA non-coding RNA Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA

See also

  • FASTA
    FASTA
    FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...

     Search
  • FASTQ format
    FASTQ format
    FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity...

  • Stockholm format
    Stockholm format
    Stockholm format is a Multiple sequence alignment format used by Pfam and Rfam to disseminate protein and RNA sequence alignments. The alignment editors...

  • List of file formats for molecular biology

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK