Stockholm format
Encyclopedia
Stockholm format is a Multiple sequence alignment
Multiple sequence alignment
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...

 format used by Pfam
Pfam
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models.- Features :For each family in Pfam one can:* Look at multiple alignments* View protein domain architectures...

 and Rfam
Rfam
Rfam is a database containing information about non-coding RNA families and other structured RNA elements. It is an annotated, open access database hosted by the Wellcome Trust Sanger Institute in collaboration with Janelia Farm...

 to disseminate protein and RNA sequence alignments

. The alignment editors Ralee

and [ftp://ftp.cgb.ki.se/pub/prog/belvu Belvu] support Stockholm format as do the probabilistic database search tools, Infernal and HMMER
HMMER
HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences. It does this by comparing a profile-HMM to either a single sequence or a database of sequences...

, and the phylogenetic analysis tool Xrate
Xrate
XRATE is a program for prototyping phylogenetic hidden Markov models and stochastic context-free grammars.It is used to discover patterns of evolutionary conservation in sequence alignments....

. A simple example of an Rfam alignment (UPSK RNA
UPSK RNA
The Upstream pseudoknot domain is an RNA element found in the turnip yellow mosaic virus, beet virus Q, barley stripe mosaic virus and tobacco mosaic virus, which is thought to be needed for efficient transcription. Disruption of the pseudoknot structure gives rise to a 50% drop in transcription...

) with a pseudoknot
Pseudoknot
A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow mosaic virus in 1982...

 in Stockholm format is shown below:

  1. STOCKHOLM 1.0
  2. =GF ID UPSK
  3. =GF SE Predicted; Infernal
  4. =GF SS Published; PMID:9223489
  5. =GF RN [1]
  6. =GF RM 9223489
  7. =GF RT The role of the pseudoknot at the 3' end of turnip yellow mosaic
  8. =GF RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
  9. =GF RT polymerase.
  10. =GF RA Deiman BA, Kortlever RM, Pleij CW;
  11. =GF RL J Virol 1997;71:5990-5996.


AF035635.1/619-641 UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104 UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234 UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23 UAAGUUCUCGAUCUCUAAAAUCG
  1. =GC SS_cons .AAA....<<<>>>

//


Here is a slightly more complex example showing the Pfam CBS
Cystathionine beta synthase
Cystathionine-β-synthase, also known as CBS, is an enzyme that in humans is encoded by the CBS gene. It catalyzes the first step of the transsulfuration pathway, from homocysteine to cystathionine:...

domain:
  1. STOCKHOLM 1.0
  2. =GF ID CBS
  3. =GF AC PF00571
  4. =GF DE CBS domain
  5. =GF AU Bateman A
  6. =GF CC CBS domains are small intracellular modules mostly found
  7. =GF CC in 2 or four copies within a protein.
  8. =GF SQ 5
  9. =GS O31698/18-71 AC O31698
  10. =GS O83071/192-246 AC O83071
  11. =GS O83071/259-312 AC O83071
  12. =GS O31698/88-139 AC O31698
  13. =GS O31698/88-139 OS Bacillus subtilis

O83071/192-246 MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS
  1. =GR O83071/192-246 SA 9998877564535242525515252536463774777

O83071/259-312 MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY
  1. =GR O83071/259-312 SS CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE

O31698/18-71 MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS
  1. =GR O31698/18-71 SS CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH

O31698/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
  1. =GR O31698/88-139 SS CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH
  2. =GC SS_cons CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH

O31699/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
  1. =GR O31699/88-139 AS ________________*____________________
  2. =GR O31699/88-139 IN ____________1____________2______0____

//


A minimal well formed Stockholm files should contain the header which states the format and version identifier, currently '# STOCKHOLM 1.0'. Followed by the sequences and corresponding unique sequence names:







'' stands for "sequence name", typically in the form "name/start-end" or just "name". Finally, the "//" line indicates the end of the alignment. Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".

The alignment mark-up

Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.

  1. =GF
  2. =GC
  3. =GS
  4. =GR


Recommended features

#=GF

(See the [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/userman.txt Pfam] and the [ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT/USERMAN Rfam] documentation under "Description of fields")

Pfam and Rfam may use the following tags:

Compulsory fields:
------------------
AC Accession number: Accession number in form PFxxxxx (Pfam) or RFxxxxx (Rfam).
ID Identification: One word name for family.
DE Definition: Short description of family.
AU Author: Authors of the entry.
SE Source of seed: The source suggesting the seed members belong to one family.
SS Source of structure: The source (prediction or publication) of the consensus RNA secondary structure used by Rfam.
BM Build method: Command line used to generate the model
SM Search method: Command line used to perform the search
GA Gathering method: Search threshold to build the full alignment.
TC Trusted Cutoff: Lowest sequence score (and domain score for Pfam) of match in the full alignment.
NC Noise Cutoff: Highest sequence score (and domain score for Pfam) of match not in full alignment.
TP Type: Type of family -- presently Family, Domain, Motif or Repeat for Pfam.
-- a tree with roots Gene, Intron or Cis-reg for Rfam.
SQ Sequence: Number of sequences in alignment.

Optional fields:
----------------
DC Database Comment: Comment about database reference.
DR Database Reference: Reference to external database.
RC Reference Comment: Comment about literature reference.
RN Reference Number: Reference Number.
RM Reference Medline: Eight digit medline UI number.
RT Reference Title: Reference Title.
RA Reference Author: Reference Author
RL Reference Location: Journal location.
PI Previous identifier: Record of all previous ID lines.
KW Keywords: Keywords.
CC Comment: Comments.
NE Pfam accession: Indicates a nested domain.
NL Location: Location of nested domains - sequence ID, start and end of insert.
WK Wikipedia link: Wikipedia page
CL Clan: Clan accession
MB Membership: Used for listing Clan membership

For embedding trees:
----------------
NH New Hampshire A tree in New Hampshire eXtended format.
TN Tree ID A unique identifier for the next tree.

  • Notes: A tree may be stored on multiple #=GF NH lines.
  • If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.


#=GS

Rfam and Pfam may use these features:


Feature Description
--------------------- -----------
AC ACcession number
DE DEscription
DR ; ; Database Reference
OS OrganiSm (species)
OC Organism Classification (clade, etc.)
LO Look (Color, etc.)


#=GR


Feature Description Markup letters
------- ----------- --------------
SS Secondary Structure For RNA [.,;<>{}[]AaBb...],
For protein [HGIEBTSCX]
SA Surface Accessibility [0-9X]
(0=0%-10%; ...; 9=90%-100%)
TM TransMembrane [Mio]
PP Posterior Probability [0-9*]
(0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00)
LI LIgand binding [*]
AS Active Site [*]
pAS AS - Pfam predicted [*]
sAS AS - from SwissProt [*]
IN INtron (in or after) [0-2]
RF ReFerence annotation Often the consensus RNA or protein sequence is used as a reference
Any non-gap character (eg. x's) can indicate consensus/conserved/match columns
.'s or -'s indicate insert columns
~'s indicate unaligned insertions
Upper and lower case can be used to discriminate strong and weakly conserved
residues respectively


#=GC

The same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".

Recommended placements

  • #=GF Above the alignment
  • #=GC Below the alignment
  • #=GS Above the alignment or just below the corresponding sequence
  • #=GR Just below the corresponding sequence

Size limits

  • There are no explicit size limits on any field. However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits:

    • Line length: 10000.
    • : 255.
    • : 255.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK