FASTA PARSABLE OUTPUT

[ Program Manual | User's Guide | Data Files | Databases ]

Introduction

This document may be useful for programmers and script writers, but can be skipped by most users of the FastA program family (FastA, FastX, TFastA, TFastX, and SSearch).

The standard alignment formats of the FastA program family are difficult to parse, and so it has been hard to extract the alignment information from the output file for further processing. A new command-line parameter, -MARKx=10, saves the alignments in a format which is easily parsed. The following is a description of the parsable output file.

Records

The output file has three types of records. The header record starts with >>> . It contains information about the search as a whole, which version of the program was used, which analysis parameters were used, etc. There is only one header record per output file.

An alignment record contains information pertaining to a pairwise alignment, such as the scores for the alignment. It starts with >>. There will be one alignment record for each alignment that was saved.

Following each alignment record are two aligned sequence records, which start with > . Each of these records contains the information for one of the sequences in the alignment: the length of the sequence, the beginning and end of the alignment in that sequence's coordinates, etc.

The end of the parsable records is denoted with >>><<<.

Record Parameters

Information in each record consists of parameters and their values in a specific format. Parameters consist of a parameter tag, followed by an underscore, followed by the parameter's name. The complete format is:


; tag_name: value(s)

Parameters originating in William Pearson's FASTA package always have a two-character tag. Current FASTA tags are:

mp - main program information: name, version, statistical info, etc.
pg - program function information: function name and version, matrix used, etc.
fa - FastA results: scores, expect values, etc.
sw - Smith-Waterman results: scores, overlap values, etc.
sq - sequence information: length, type, etc.
al - alignment information: start, stop, display offset, etc.

Redistributors of the FASTA package may create their own parameters. If they do, they must use a tag with more than two characters, for example:


; ebi_access: M61687 ; gcg_ver: 9.0

GCG currently has no Wisconsin Package-specific parameters.

Interpreting Aligned Sequence Records

Most of the parameters specified by two-character tags correspond to values that are presented in other FastA output formats. A notable exception is parameters with the al tag:

al_start gives the location of the alignment start in the original sequence

al_stop gives the location of the end of the alignment in the original sequence

al_display_start gives the location of the first displayed residue in the original sequence. (This may not be the same as the first residue in the aligned region, because FastA provides some context for an alignment; even if the -SHOWall parameter is not used, FastA will try to provide about 30 residues on either side of the actual aligned region if the alignment is in the middle of one or the other sequence.)

Sequences may be padded with leading hyphens, if necessary. For example, if the beginning of the query sequence aligns with the tenth residue of the library sequence, then the query sequence will be padded with ten leading hyphens (-) to produce the alignment. The leading hyphens are a formatting convenience only; they are not considered in the numbering system for al_display_start, al_start, or al_stop.

As an example, here is a pair of aligned sequence records:


>gtm1_mouse .. ; sq_len: 217 ; sq_offset: 1 ; sq_type: p ; al_start: 3 ; al_stop: 180 ; al_display_start: 1 ---PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLN EKFKLGLDFPNLPYLIDGSHKITQSNAILRYLARKHH---LDGETEEERI RADIVENQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKR PWFAGDKVTYVDFLAYDILDQYRMFEPKCLDA------FPNLRDFLARFE GLKKISAYMKSSRYIATPIFSKMAHWSNK >GTX2_TOBAC .. ; sq_len: 223 ; sq_type: p ; al_start: 6 ; al_stop: 181 ; al_display_start: 1 MAEVKLLGFW-YSPFSHRVEWALKIKGVKYE---YIEEDRDN--KSSLLL QSNPV---YKKVPVLIHNGKPIVESMIILEYIDETFEGPSILPKDPYDRA LARFWAKFLDDKVAAVVNTFFRKGEEQEKGK--EEVYEMLKVLDNELKDK KFFAGDKFGFADIAANLVGFWLGVFEEGYGDVLVKSEKFPNFSKWRDEYI NCSQVNESLPPRDELLAFFRARFQAVVASRSAPK

To properly display this alignment, the first P of gtm1_mouse must line up with the first V in GTX2_TOBAC, and the actual aligned region (the region that scores as the best local alignment) starts with the first I in gtm1_mouse (amino acid 3) and the first L (amino acid 6) in GTX2_TOBAC.

An Example

Here is a printout of a complete parsable output file containing three alignment records, followed by a printout of the first alignment as it is output by FastA when the default parameter -MARKx=3 is used.


>>>A41264, 496 aa vs @GLUT4.LIST library
; mp_name: FASTA
; mp_ver: Wisconsin Package 10.0 implementation of FASTA 3.1t12

; pg_name: FASTA
; pg_ver: 3.15 August, 1998
; pg_matrix: GenRunData:Blosum50.Cmp
; pg_gap-pen: -12 -2
; pg_ktup: 2
; pg_optcut: 25
; pg_cgap: 37
>>Pir2:A49158
; fa_initn: 1844
; fa_init1: 1201
; fa_opt: 1915
; sw_score: 1915
; sw_ident: 0.593
; sw_overlap: 496
>A41264 ..
; sq_len: 496
; sq_offset: 1
; sq_type: p
; al_start: 4
; al_stop: 493
; al_display_start: 1
-------------MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEK
IIQAFYNRTLSQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVS
LFVNRFGRRNSMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCG
LCTGFVPMYISEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEA
LWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGT
QDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL
SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRR
TLHLVGLGGMAVCAAVMTIALALKEK--WIRYISIVATFGFVALFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVF
LIFLVFLLIFFIFTYFKVPETKGRTFEDISRGFEEQVETSSPSSPPIEKN
PMVEMNSIEPDKEVA
>A49158 ..
; sq_len: 509
; sq_type: p
; al_start: 17
; al_stop: 507
; al_display_start: 1
MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQK
VIEQSYNETWLGRQGPEGPSSIPPGTLTTLWALSVAIFSVGGMISSFLIG
IISQWLGRKRAMLVNNVLAVLGGSLMGLANAAASYEMLILGRFLIGAYSG
LTSGLVPMYVGEIAPTHLRGALGTLNQLAIVIGILIAQVLGLESLLGTAS
LWPLLLGLTVLPALLQLVLLPFCPESPRYLYIIQNLEGPARKSLKRLTGW
ADVSGVLAELKDEKRKLERERPLSLLQLLGSRTHRQPLIIAVVLQLSQQL
SGINAVFYYSTSIFETAGVGQPAYATIGAGVVNTVFTLVSVLLVERAGRR
TLHLLGLAGMCGCAILMTVALLLLERVPAMSYVSIVAIFGFVAFFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGFSNWTSNFIIGMGFQYVAEAMGPYVF
LLFAVLLLGFFIFTFLRVPETRGRTFDQISAAFHR-----TPSLLEQEVK
PSTELEYLGPDEND
>>Pir2:A32101
; fa_initn: 1822
; fa_init1: 1188
; fa_opt: 1883
; sw_score: 1883
; sw_ident: 0.589
; sw_overlap: 496
>A41264 ..
; sq_len: 496
; sq_offset: 1
; sq_type: p
; al_start: 4
; al_stop: 493
; al_display_start: 1
-------------MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEK
IIQAFYNRTLSQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVS
LFVNRFGRRNSMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCG
LCTGFVPMYISEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEA
LWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGT
QDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL
SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRR
TLHLVGLGGMAVCAAVMTIALALKEKW--IRYISIVATFGFVALFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVF
LIFLVFLLIFFIFTYFKVPETKGRTFEDISRGFEEQVETSSPSSPPIEKN
PMVEMNSIEPDKEVA
>A32101 ..
; sq_len: 509
; sq_type: p
; al_start: 17
; al_stop: 507
; al_display_start: 1
MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQK
VIEQSYNATWLGRQGPGGPDSIPQGTLTTLWALSVAIFSVGGMISSFLIG
IISQWLGRKRAMLANNVLAVLGGALMGLANAAASYEILILGRFLIGAYSG
LTSGLVPMYVGEIAPTHLRGALGTLNQLAIVIGILVAQVLGLESMLGTAT
LWPLLLAITVLPALLQLLLLPFCPESPRYLYIIRNLEGPARKSLKRLTGW
ADVSDALAELKDEKRKLERERPLSLLQLLGSRTHRQPLIIAVVLQLSQQL
SGINAVFYYSTSIFELAGVEQPAYATIGAGVVNTVFTLVSVLLVERAGRR
TLHLLGLAGMCGCAILMTVALLLLERVPSMSYVSIVAIFGFVAFFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGFSNWTCNFIVGMGFQYVADAMGPYVF
LLFAVLLLGFFIFTFLRVPETRGRTFDQISATFRR-----TPSLLEQEVK
PSTELEYLGPDEND
>>Pir2:B30310
; fa_initn: 1796
; fa_init1: 1179
; fa_opt: 1862
; sw_score: 1862
; sw_ident: 0.585
; sw_overlap: 496
>A41264 ..
; sq_len: 496
; sq_offset: 1
; sq_type: p
; al_start: 4
; al_stop: 493
; al_display_start: 1
-------------MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEK
IIQAFYNRTLSQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVS
LFVNRFGRRNSMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCG
LCTGFVPMYISEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEA
LWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGT
QDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL
SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRR
TLHLVGLGGMAVCAAVMTIALALKEKW--IRYISIVATFGFVALFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVF
LIFLVFLLIFFIFTYFKVPETKGRTFEDISRGFEEQVETSSPSSPPIEKN
PMVEMNSIEPDKEVA
>B30310 ..
; sq_len: 508
; sq_type: p
; al_start: 17
; al_stop: 506
; al_display_start: 1
MPSGFQQIGSDDGEPPRQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQK
VIEQSYNATWLGRQGPGGPDSIPQGTLTTLWALSVAIFSVGGMISSFLIG
IISQWLGRKRAMLANNVLAVLGGALMGLANAVASYEILILGRFLIGAYSG
LTSGLVPMYVGEIAPTHLRGALGTLNRLAIVIGILVAQVLGLESMLGTAT
LWPLLLALTVLPALLQLILLPFCPESPRYLYIIRNLEGPARKSLKPLTGW
ADVSDALAELKDEKRKLERERPMSLLQLLGSRTHRQPLIIAVVLQLSQQL
SGINAVFYYSTSIFESAGVGQPAYATIGAGVVNTVFTLVSVLLVERAGRR
TLHLLGLAGMCGCAILMTVALLLLERVPAMSYVSIVAIFGFVAFFEIGPG
PIPWF-VAELFSQGPRPAAMAVAGFSNWTCNFIVGMGFQYVADRMGPYVF
LLFAVLLLGFFIFTFLKVPETRGRTFDQISAAFRR-----TPSLLEQEVK
PSTELEYLGPDEND

>>><<<

------------------------------------------------------------------------------

SCORES Init1: 1201 Initn: 1844 Opt: 1915 Smith-Waterman score: 1915; 59.3% identity in 496 aa overlap 10 20 30 40 A41264 MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEKIIQAFYNRTL ::::|::|: || |::|||||||| ||||||:|:|: ||:| A49158 MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQKVIEQSYNETW 10 20 30 40 50 60 50 60 70 80 90 100 A41264 SQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFVNRFGRRNSMLLVNVLAF |:| :| | ||:||:||||||||||||:|| :::: : :||: :||: |||| A49158 LGRQGPEGPSSIPPGTLTTLWALSVAIFSVGGMISSFLIGIISQWLGRKRAMLVNNVLAV 70 80 90 100 110 120 110 120 130 140 150 160 A41264 AGGALMALSKIAKAVEMLIIGRFIIGLFCGLCTGFVPMYISEVSPTSLRGAFGTLNQLGI ||:||:|:: | : ||||:|||:|| : || :|:||||::|::|| ||||:||||||:| A49158 LGGSLMGLANAAASYEMLILGRFLIGAYSGLTSGLVPMYVGEIAPTHLRGALGTLNQLAI 130 140 150 160 170 180 170 180 190 200 210 220 A41264 VVGILVAQIFGLEGIMGTEALWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKA |:|||:||::|||:::|| :|||||||:|::||:|| | | |||||||:| | : | | A49158 VIGILIAQVLGLESLLGTASLWPLLLGLTVLPALLQLVLLPFCPESPRYLYIIQNLEGPA 190 200 210 220 230 240 230 240 250 260 270 280 A41264 QTVLQKLRGTQDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL : |::| | ||| ::|:|:|: |: :|: ::|:|: | ::|||:||:::||||||| A49158 RKSLKRLTGWADVSGVLAELKDEKRKLERERPLSLLQLLGSRTHRQPLIIAVVLQLSQQL 250 260 270 280 290 300 290 300 310 320 330 340 A41264 SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRRTLHLVGLGGM |||||||||||:||| ||: ||:||||||||||||||:||::||||||||||||:||:|| A49158 SGINAVFYYSTSIFETAGVGQPAYATIGAGVVNTVFTLVSVLLVERAGRRTLHLLGLAGM 310 320 330 340 350 360 350 360 370 380 390 400 A41264 AVCAAVMTIALALKEK--WIRYISIVATFGFVALFEIGPGPIPWFIVAELFSQGPRPAAM || :||:|| | |: : |:|||| |||||:|||||||||||||||||||||||||| A49158 CGCAILMTVALLLLERVPAMSYVSIVAIFGFVAFFEIGPGPIPWFIVAELFSQGPRPAAM 370 380 390 400 410 420 410 420 430 440 450 460 A41264 AVAGCSNWTSNFLVGMLFPYAEKLCGPYVFLIFLVFLLIFFIFTYFKVPETKGRTFEDIS |||| |||||||::|| | |: : ||||||:| |:|| |||||:::||||:||||::|| A49158 AVAGFSNWTSNFIIGMGFQYVAEAMGPYVFLLFAVLLLGFFIFTFLRVPETRGRTFDQIS 430 440 450 460 470 480 470 480 490 A41264 RGFEEQVETSSPSSPPIEKNPMVEMNSIEPDKEVA :|:: :|| | :| :|:: : ||:: A49158 AAFHR-----TPSLLEQEVKPSTELEYLGPDEND 490 500

Printed: December 9, 1998 16:24 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982-2001 Genetics Computer Group, Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com