Stockholm format - AbsoluteAstronomy.com

Stockholm format is a Multiple sequence alignment

Multiple sequence alignment

A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...

format used by Pfam

Pfam

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models.- Features :For each family in Pfam one can:* Look at multiple alignments* View protein domain architectures...

and Rfam

Rfam

Rfam is a database containing information about non-coding RNA families and other structured RNA elements. It is an annotated, open access database hosted by the Wellcome Trust Sanger Institute in collaboration with Janelia Farm...

to disseminate protein and RNA sequence alignments

. The alignment editors Ralee

and [ftp://ftp.cgb.ki.se/pub/prog/belvu Belvu] support Stockholm format as do the probabilistic database search tools, Infernal and HMMER

HMMER

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences. It does this by comparing a profile-HMM to either a single sequence or a database of sequences...

, and the phylogenetic analysis tool Xrate

Xrate

XRATE is a program for prototyping phylogenetic hidden Markov models and stochastic context-free grammars.It is used to discover patterns of evolutionary conservation in sequence alignments....

. A simple example of an Rfam alignment (UPSK RNA

UPSK RNA

The Upstream pseudoknot domain is an RNA element found in the turnip yellow mosaic virus, beet virus Q, barley stripe mosaic virus and tobacco mosaic virus, which is thought to be needed for efficient transcription. Disruption of the pseudoknot structure gives rise to a 50% drop in transcription...

) with a pseudoknot

Pseudoknot

A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow mosaic virus in 1982...

in Stockholm format is shown below:


 STOCKHOLM 1.0
=GF ID    UPSK
=GF SE    Predicted; Infernal
=GF SS    Published; PMID:9223489
=GF RN    [1]
=GF RM    9223489
=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
=GF RT    polymerase.
=GF RA    Deiman BA, Kortlever RM, Pleij CW;
=GF RL    J Virol 1997;71:5990-5996.




AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG

M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG

J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG

M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG

=GC SS_cons                   .AAA....<<<>>>


//

Here is a slightly more complex example showing the Pfam CBS

Cystathionine beta synthase

Cystathionine-β-synthase, also known as CBS, is an enzyme that in humans is encoded by the CBS gene. It catalyzes the first step of the transsulfuration pathway, from homocysteine to cystathionine:...

domain:


 STOCKHOLM 1.0
=GF ID CBS
=GF AC PF00571
=GF DE CBS domain
=GF AU Bateman A
=GF CC CBS domains are small intracellular modules mostly found
=GF CC in 2 or four copies within a protein.
=GF SQ 5
=GS O31698/18-71 AC O31698
=GS O83071/192-246 AC O83071
=GS O83071/259-312 AC O83071
=GS O31698/88-139 AC O31698
=GS O31698/88-139 OS Bacillus subtilis


O83071/192-246          MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS

=GR O83071/192-246 SA  9998877564535242525515252536463774777


O83071/259-312          MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY

=GR O83071/259-312 SS  CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE


O31698/18-71            MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS

=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH


O31698/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE

=GR O31698/88-139 SS   CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH
=GC SS_cons            CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH


O31699/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE

=GR O31699/88-139 AS   ________________*____________________
=GR O31699/88-139 IN   ____________1____________2______0____


//

A minimal well formed Stockholm files should contain the header which states the format and version identifier, currently '# STOCKHOLM 1.0'. Followed by the sequences and corresponding unique sequence names:

'' stands for "sequence name", typically in the form "name/start-end" or just "name". Finally, the "//" line indicates the end of the alignment. Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".

The alignment mark-up

Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.


=GF  
=GC  
=GS   
=GR

Recommended features

#=GF

(See the [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/userman.txt Pfam] and the [ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT/USERMAN Rfam] documentation under "Description of fields")

Pfam and Rfam may use the following tags:

Compulsory fields:

------------------

AC Accession number: Accession number in form PFxxxxx (Pfam) or RFxxxxx (Rfam).

ID Identification: One word name for family.

DE Definition: Short description of family.

AU Author: Authors of the entry.

SE Source of seed: The source suggesting the seed members belong to one family.

SS Source of structure: The source (prediction or publication) of the consensus RNA secondary structure used by Rfam.

BM Build method: Command line used to generate the model

SM Search method: Command line used to perform the search

GA Gathering method: Search threshold to build the full alignment.

TC Trusted Cutoff: Lowest sequence score (and domain score for Pfam) of match in the full alignment.

NC Noise Cutoff: Highest sequence score (and domain score for Pfam) of match not in full alignment.

TP Type: Type of family -- presently Family, Domain, Motif or Repeat for Pfam.

-- a tree with roots Gene, Intron or Cis-reg for Rfam.

SQ Sequence: Number of sequences in alignment.

Optional fields:

----------------

DC Database Comment: Comment about database reference.

DR Database Reference: Reference to external database.

RC Reference Comment: Comment about literature reference.

RN Reference Number: Reference Number.

RM Reference Medline: Eight digit medline UI number.

RT Reference Title: Reference Title.

RA Reference Author: Reference Author

RL Reference Location: Journal location.

PI Previous identifier: Record of all previous ID lines.

KW Keywords: Keywords.

CC Comment: Comments.

NE Pfam accession: Indicates a nested domain.

NL Location: Location of nested domains - sequence ID, start and end of insert.

WK Wikipedia link: Wikipedia page

CL Clan: Clan accession

MB Membership: Used for listing Clan membership

For embedding trees:

----------------

NH New Hampshire A tree in New Hampshire eXtended format.

TN Tree ID A unique identifier for the next tree.

Notes: A tree may be stored on multiple #=GF NH lines.
If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.

#=GS

Rfam and Pfam may use these features:



      Feature                    Description

      ---------------------      -----------

      AC              ACcession number

      DE               DEscription

      DR ; ;      Database Reference

      OS               OrganiSm (species)

      OC                  Organism Classification (clade, etc.)

      LO                   Look (Color, etc.)

#=GR



      Feature   Description            Markup letters

      -------   -----------            --------------

      SS        Secondary Structure    For RNA [.,;<>{}[]AaBb...],

                                       For protein [HGIEBTSCX]

      SA        Surface Accessibility  [0-9X]

                    (0=0%-10%; ...; 9=90%-100%)

      TM        TransMembrane          [Mio]

      PP        Posterior Probability  [0-9*]

                    (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00)

      LI        LIgand binding         [*]

      AS        Active Site            [*]

     pAS        AS - Pfam predicted    [*]

     sAS        AS - from SwissProt    [*]

      IN        INtron (in or after)   [0-2]

      RF        ReFerence annotation   Often the consensus RNA or protein sequence is used as a reference

                                       Any non-gap character (eg. x's) can indicate consensus/conserved/match columns

                                       .'s or -'s indicate insert columns

                                       ~'s indicate unaligned insertions

                                       Upper and lower case can be used to discriminate strong and weakly conserved

                                       residues respectively

#=GC

The same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".

Recommended placements

#=GF Above the alignment
#=GC Below the alignment
#=GS Above the alignment or just below the corresponding sequence
#=GR Just below the corresponding sequence

Size limits

There are no explicit size limits on any field. However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits:

- Line length: 10000.
- : 255.
- : 255.

External links

Erik Sonnhammers' definition of Stockholm format

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.