Uppsala Software Factory

Uppsala Software Factory - SEQMAN Manual

1 SEQMAN - GENERAL INFORMATION
2 REFERENCES
3 VERSION HISTORY
4 START-UP MACRO
5 SEQUENCE LENGTH AND NUMBER OF SEQUENCES
6 INTRODUCTION
7 FORMATS

7.1 PIR

7.2 FASTA

7.3 SWISSPROT

7.4 ONO datablock

7.5 ALSCRIPT

7.6 CLUSTAL W

7.7 SEQRES records PDB
8 STARTUP
9 GENERAL COMMANDS

9.1 ?

9.2 !

9.3 #

9.4 @

9.5 &

9.6 QUit

9.7 ZP_restart

9.8 ECho

9.9 $
10 PROGRAM-SPECIFIC COMMANDS

10.1 SRead

10.2 MRead

10.3 ARead

10.4 SWrite

10.5 MWrite

10.6 AWrite

10.7 FOrmats

10.8 RAndom

10.9 SEed

10.10 TYpe

10.11 LIst

10.12 CLean

10.13 DElete

10.14 TRanslate

10.15 CHou_fasman

10.16 UNitary_matrix

10.17 SUbstitution_matrix file

10.18 DUmp_matrix

10.19 MAtrix

10.20 DOt_plot

10.21 NWunschs

10.22 SMithwat
11 EXAMPLE
12 KNOWN BUGS

1 SEQMAN - GENERAL INFORMATION

Program : SEQMAN
Version : 070222
Author : Gerard J. Kleywegt, Dept. of Cell and Molecular Biology, Uppsala University, Biomedical Centre, Box 596, SE-751 24 Uppsala, SWEDEN
E-mail : gerard@xray.bmc.uu.se
Purpose : sequence manipulation and alignment
Package : SBIN

2 REFERENCES

Reference(s) for this program:

* 1 * G.J. Kleywegt (1992-2005). Uppsala University, Uppsala, Sweden. Unpublished program.

* 2 * Kleywegt, G.J., Zou, J.Y., Kjeldgaard, M. & Jones, T.A. (2001). Around O. In: "International Tables for Crystallography, Vol. F. Crystallography of Biological Macromolecules" (Rossmann, M.G. & Arnold, E., Editors). Chapter 17.1, pp. 353-356, 366-367. Dordrecht: Kluwer Academic Publishers, The Netherlands.

3 VERSION HISTORY

0.1.0 @ 990318 - first version
0.2.0 @ 990406 - continued implementing READ formats
0.3.0 @ 990407 - continued implementing READ formats
0.4.0 @ 000505 - continued (at last); SUbstitution_matrix DOt_plot and NWunsch commands
0.4.1 @ 001110 - some bugs removed
0.4.2 @ 010326 - NWunsch: include any trailing N-terminal residues in the alignment
0.5.0 @ 050324 - continued (use as tool for making exam questions); proper Needleman-Wunsch-Sellers implementation (with affine gap penalty and optional overhangs); added Smith-Waterman as well; UNit_matrix option
1.0 @ 050329 - first production version
1.1 @ 050330 - written manual; many small changes; allocate memory for dynamic-programming matrices ... dynamically; new LIst, CLean, DElete, REname commands; better dot-plots
1.2 @ 060116 - write dotplots to terminal as well if both sequences contain no more than 30 residues
1.3 @ 060130 - new SEed and RAndom commands to generate random protein and DNA sequences; new TRanslate command to translate DNA into protein sequence; new TYpe command to show sequecnes in memory; new MAtrix command to set individual substitution matrix elements; new DUmp_matrix command to show the current substitution matrix
1.3.1 @ 060203 - optional parameter to NW command to enforce 'classic' Needleman-Wunsch algorithm; the pointer matrix now also shows elements where multiple paths lead to the same score
1.3.2 @ 060206 - minor bug fix
1.4 @ 070202 - implemented simplistic CHou_fasman command
1.4.1 @ 070222 - show border residues in CHou_fasman command in lowercase; NWunsch now prints if the alignment is unique or not

4 START-UP MACRO

SEQMAN can execute a macro at start-up (whether it is run interactively or in batch mode). This can be used to execute commands which you (almost) always want to have executed (e.g., to read your favourite substitution matrix). To use this feature, set the environment variable GKSEQMAN to point to a SEQMAN macro file, e.g.:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 setenv GKSEQMAN /home/gerard/seqman.init
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5 SEQUENCE LENGTH AND NUMBER OF SEQUENCES

SEQMAN allocates memory for data sets dynamically. This means that you can increase the size and number of data sets that the program can handle on the fly:

1 - through the environment variables SEQLENG and NUMSEQS (must be in capital letters !), for example put the following in your .cshrc file:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 setenv SEQLENG 10000
 setenv NUMSEQS 1000
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

2 - by using command-line arguments SEQLENG and NUMSEQS (need not be in capitals), for example:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 run seqman seqleng 5000 numseqs 200
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Note that command-line arguments take precedence over environment variables. So you can set the environment variables in your .cshrc file to "typical" values, and if you have to deal with a data set which is bigger than that, you can use the command-line argument(s).

You can also use the ZP_restart command from within the program itself to increase memory allocation. WARNING : all memory is reset, so any previous read sequence data will be lost !!!

If sufficient memory cannot be allocated, the program will print a message and quit. In that case, increase the amount of virtual memory (this will not help, of course, if you try to allocate more memory than can be addressed by your machine (for 32-bit machines, something (2^32)-1 bytes, I think)), or reduce the size requirements.

SEQMAN needs 2 * (SEQLENG + 1) ^ 2 words for its major arrays (two dynamic programming matrices).

6 INTRODUCTION

When I wrote the first bits of code for this program, it was intended to become the sequence-equivalent of MOLEMAN2 and LSQMAN and to become part of a set of programs (called SEQSYST) that do "stuff" with sequences. However, as most of the things I wanted to do with sequences (and much more) have since been implemented in our Indonesia program ( Indonesia site ) by Dennis Madsen, Patrik Johansson, Susan Arent, and others, there is no longer a need for the SEQSYST suite. Recently, however, I decided to dust off the SEQMAN program a bit so that I could use it as part of our teaching sequence alignment algorithms in bioinformatics courses (and, in particular, to generate dynamic-programming matrices that can be used as exam questions :-).

The program can now read and write sequences in a variety of formats, produce simple dot-plots, and do Needleman-Wunsch-Seller (global) and Smith-Waterman (local) alignment. All the subsitution matrices of the SBIN package can be read, or simple unitary matrices can be defined (e.g., for DNA sequences in exam questions). The program only knows the symbols of the 20 standard amino acids. This means that DNA sequences can be handled easily (if you use unitary substitution matrices), but not RNA. To align RNA sequences, simply replace all the Us by something else (e.g., Ts), read them in, align them, and replaces the Ts by Us in the output.

7 FORMATS

Use the FOrmats command to get a list of formats supported by the version of the program you are using. For instance, for version 1.0 of SEQMAN the following formats are supported:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN FORMATS :  INPUT MODES :  OUTPUT MODES :
 ----------------  -------------  --------------
 PIR                 S  M  A        S  M  A
 FASta               S  M  A        S  M  A
 SWIssprot           S  M  A        S  M  A
 ONO datablock       S  M  -        S  M  A
 ALScript            -  -  A        -  -  A
 CLUstal W           -  -  A        -  -  A
 SEQres records PDB  S  M  -        -  -  -
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The three modes S, M and A stand for "Single sequence in a file", "Multiple sequences in a file", and "Alignment in a file". The difference between M and A is that A-type files contain may gap characters. Each of the formats is discussed below.

Note: you only need to type the first three letters of any format type with the various read and write commands in SEQMAN.

7.1 PIR

A single sequence looks like this:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
>P1;PIRSEQ
No title
NQKCSGNPRR YNGKSCASTT NYHDSHKGAC GCGPASGDAQ FGWNAGSFVA AASQMYFDSG
NKGWCGQHCG QCIKLTTTGG YVPGQGGPVR EGLSKTFMIT NLCPNIYPNQ DWCNQGSQYG
GHNKYGYELH LDLENGRSQV TGMGWNNPET TWEVVNCDSE HNHDHRTPSN SMYGQCQCAH
*
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

In other words:
- an identifier line beginning with ">"
- a title line (free text)
- one or more lines containing the sequence (spaces are ignored)
- a termination line containing "*"

7.2 FASTA

A single sequence looks like this:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- >AOFA_HUMAN MENQEKASIA GHMFDVVVIG GGISGLSAAK LLTEYGVSVL VLEARDRVGG RTYTIRNEHV DYVDVGGAYV GPTQNRILRL SKELGIETYK VNVSERLVQY VKGKTYPFRG AFPPVWNPIA YLDYNNLWRT IDNMGKEIPT DAPWEAQHAD KWDKMTMKEL IDKICWTKTA RRFAYLFVNI NVTSEPHEVS ALWFLWYVKQ CGGTTRIFSV TNGGQERKFV GGSGQVSERI MDLLGDQVKL NHPVTHVDQS SDNIIIETLN HEHYECKYVI NAIPPTLTAK IHFRPELPAE RNQLIQRLPM GAVIKCMMYY KEAFWKKKDY CGCMIIEDED APISITLDDT KPDGSLPAIM GFILARKADR LAKLHKEIRK KKICELYAKV LGSQEALHPV HYEEKNWCEE QYSGGCYTAY FPPGIMTQYG RVIRQPVGRI FFAGTETATK WSGYMEGAVE AGERAAREVL NGLGKVTEKD IWVQEPESKD VPAVEITHTF WERNLPSVSG LLKIIGFSTS VTALGFVLYK YKLLPRS

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

In other words:
- an identifier line beginning with ">"
- one or more lines containing the sequence (spaces are ignored)
- one or more empty lines to signal the end of the sequence

7.3 SWISSPROT

A single sequence looks like this:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
SQ   SEQUENCE   75 AA;  8523 MW;  AFF911AB CRC32;
     MEKKSIAGLC FLFLVLFVAQ EVVVQSEAKT CENLVDTYRG PCFTTGSCDD HCKNKEHLLS
     GRCRDDVRCW CTRNC
//
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

In other words:
- an identifier line beginning with "SQ "
- one or more lines containing the sequence (spaces are ignored)
- one or more empty lines or a line beginning with "// " to signal the end of the sequence

7.4 ONO datablock

A single sequence looks like this:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
LEFTH_RESIDUE_TYPE        C          4 (1x,5a)
 ALA   ALA   ALA   ALA
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

In other words:
- an identifier line beginning with "XXX_RESIDUE_TYPE C NNN (fffff)" where "XXX" is a name (max. 6 characters), "NNN" is the number of residues, and "(fffff)" is the Fortran format in which the residue types are to be read
- one or more lines containing the sequence in three-letter code and formatted according to the identifier line

7.5 ALSCRIPT

This is the type of file required as input to ALSCRIPT, e.g.:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
   
Conversion of CLUSTAL NBRF-PIR file to AMPS BLOCKFILE format
clus2blc:  Geoffrey J. Barton (1992)
   
>P1;albp mature rbpjr
>P1;rbp rbp
>P1;gbp gbp
>P1;abp abp
* iteration 1
AK-N
ADTL
ETRK
YIIL
AAGG
[...]
--EL
--FG
--TG
---K
*
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

7.6 CLUSTAL W

This is the type of multiple-sequence alignment file that is produced by CLUSTAL W, e.g.:

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- CLUSTAL W(1.60) Created by SEQMAN V. 990407/0.3 at Wed Apr 7 22:59:04 1999 for gerard SEQ0001 AAEYAVVLKTLSNPFWVDMKKGIEDEAKTLGVSVDIFASPSEGDFQSQLQLFEDLSNKNY SEQ0002 KDTIALVVSTLNNPFFVSLKDGAQKEADKLGYNLVVLD--SQNNPAKELANVQDLTVRGT SEQ0003 -TRIGVTIYKYDDNFMSVVRKAIEKDGKSAPDVQLLMND-SQNDQSKQNDQIDVLLAKGV SEQ0004 NLKLGFLVKQPEEPWFQTEWKFADKAGKDLGFEVIKIA---VPDGEKTLNAIDSLAASGA SEQ0001 KGIAFAPLSSVNLVMPVARAWKKGIYLVNLDEKIDMDNLKKAGGNVEAFVTTDNVAVGAK SEQ0002 KILLINPTDSDAVGNAVKMANQANIPVITLDR-------QATKGEVVSHIASDNVLGGKI SEQ0003 KALAINLVDPAAAGTVIEKARGQNVPVVFFNKEPSRK--ALDSYDKAYYVGTDSKESGVI SEQ0004 KGFVICTPDPKLGSAIVAKARGYDMKVIAVDDQFVNA--KGKPMDTVPLVMMAATKIGER [...] SEQ0001 KLVDSILVTQ---------- SEQ0002 KLV----VKQ---------- SEQ0003 KIVRVPYVGVDKDNLSEFT- SEQ0004 VLITRDNFKEELEKKGLGGK

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

7.7 SEQRES records PDB

These are the SEQRES records as found in PDB files, e.g.:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
SEQRES   1 B   61  THR MET CYS TYR SER HIS THR THR THR SER ARG ALA ILE  1FSS 153
SEQRES   2 B   61  LEU THR ASN CYS GLY GLU ASN SER CYS TYR ARG LYS SER  1FSS 154
SEQRES   3 B   61  ARG ARG HIS PRO PRO LYS MET VAL LEU GLY ARG GLY CYS  1FSS 155
SEQRES   4 B   61  GLY CYS PRO PRO GLY ASP ASP ASN LEU GLU VAL LYS CYS  1FSS 156
SEQRES   5 B   61  CYS THR SER PRO ASP LYS CYS ASN TYR                  1FSS 157
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Note that you can feed SEQMAN a complete PDB file and it will extract any and all sequences it encounters on the SEQRES cards.

8 STARTUP

When you start SEQMAN, it welcomes you with a list of available commands and options. It also loads the BLOSUM45 substitution matrix (stored internally in the program).

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** Version - 060116/1.2 (c) 1992-2005 Gerard J. Kleywegt, Dept. Cell Mol. Biol., Uppsala (SE) User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL) Others - T.A. Jones, G. Bricogne, Rams, W.A. Hendrickson Others - W. Kabsch, CCP4, PROTEIN, E. Dodson, etc. etc. Started - Mon Jan 30 18:14:20 2006 User - gerard Mode - interactive Host - sarek (Irix/SGI) ProcID - 11726 Tty - /dev/ttyq16 *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** Reference(s) for this program: * 1 * G.J. Kleywegt (1992-2005). Uppsala University, Uppsala, Sweden. Unpublished program. * 2 * Kleywegt, G.J., Zou, J.Y., Kjeldgaard, M. & Jones, T.A. (2001). Around O. In: "International Tables for Crystallography, Vol. F. Crystallography of Biological Macromolecules" (Rossmann, M.G. & Arnold, E., Editors). Chapter 17.1, pp. 353-356, 366-367. Dordrecht: Kluwer Academic Publishers, The Netherlands. ==> For manuals and up-to-date references, visit: ==> http://xray.bmc.uu.se/usf ==> For reprints, visit: ==> http://xray.bmc.uu.se/gerard ==> For downloading up-to-date versions, visit: ==> ftp://xray.bmc.uu.se/pub/gerard *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** SEQMAN *** Allocate sequences of length : ( 1000) Maximum allowed value : ( 10000) Max number of sequences : ( 100) Maximum allowed value : ( 10000) Max nr of sequences : ( 100) Max nr of residues per sequence : ( 1000) Random-number seed : ( 123456) => Random number generator initialised with seed : 123456 *** BLOSUM-45 substitution matrix loaded *** Amino-acid frequencies : ( 0.081 0.044 0.046 0.058 0.019 0.058 0.037 0.080 0.022 0.053 0.081 0.059 0.020 0.040 0.047 0.068 0.063 0.016 0.038 0.071) Symbol START_TIME : (Mon Jan 30 18:14:20 2006) Symbol USERNAME : (gerard) SEQMAN options : ? (list options) ! (comment) QUit $ shell_command & symbol value & ? (list symbols) @ macro_file ZP_restart seqsize numseqs ECho on_off # parameter(s) (command history) SRead seq file format SWrite seq file format MRead seq_base file format MWrite seq file format ARead seq_base file format AWrite seq file format RAndom seq length [type] [freq] SEed iseed LIst seq CLean seq DElete seq REname seq new_name FOrmats SUbstitution_matrix file UNitary_matrix diagonal off_diagonal DOt_plot seq1 seq2 file [window_length] [min_count] NWunschs seq1 seq2 gap_open gap_ext overhang_penalty SMithwat seq1 seq2 gap_open gap_ext Max nr of sequences : ( 100) Max nr of residues per sequence : ( 1000)

SEQMAN > ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

9 GENERAL COMMANDS

9.1 ?

gives a list of the options and the current dimensioning of the program

9.2 !

does nothing (use this for comments in input scripts)

9.3 #

Command history. Possible uses (blank spaces are optional):
- # ? => list history of commands
- # ! => ditto, but without numbers (handy for copying into macros)
- # ON => switch command history on
- # OFf => switch command history off
- # # => repeat previous command
- # 14 => repeat command number 14 from the list
- # 0 => repeat previous command
- # -1 => repeat penultimate command, etc.
- # 7 more => repeat command number 7, but add "more" to it (e.g., if command 7 was "$ ls" you could type "#7 -FartCos" to get "$ ls -FartCos")

9.4 @

execute a macro

9.5 &

This command can be used to manipulate symbols. These are probably only useful for advanced users who want to write fancier macros. The command can be used in three ways:
(1) & ? -> lists currently defined symbols
(2) & symbol value -> sets "SYMBOL" to "value"
(3) & symbol -> prompts the user to supply a value for "SYMBOL" (even if the program is executing a macro)

A few symbols are predefined:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > & ?
 Nr of defined symbols : (       2)
 Symbol START_TIME : (Wed Mar 30 18:44:55 2005)
 Symbol USERNAME : (gerard)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The symbol mechanism is fairly simplistic and has some limitations:
- max length of a symbol name is 20 characters
- max length of a symbol value is 256 characters
- max number of symbols is 100
- symbols can not be deleted, but they can be redefined
- symbol values are accessed by supplying $SYMBOL_NAME as an argument on the command line; the line that you type on the terminal (or in a macro) is parsed once; if there are additional parameters which the program prompts you for, you cannot use symbols for those
- only one substitution per argument (e.g., "$file1 $file2" will lead to a substituion of the entire argument by the value of symbol FILE1 only !)
- command names (first argument on any command line) cannot be replaced by a symbol (e.g.: "$command $arg1 $arg2" is not valid)
- symbols may be equated to each other, e.g. "& file2 $file1" will give FILE2 the same value as FILE1
- symbol substitution is not recursive (e.g., if you set the value of FILE2 to be "$file1", any reference to $FILE2 will be replaced by "$file1", not by the value of FILE1
- symbols on comment lines (starting with "!") are not expanded
- symbols on system command lines (starting with "$") are not expanded

9.6 QUit

stop working with SEQMAN

9.7 ZP_restart

if you have started the program with too little memory allocated, you can restart it with this command. Provide new values for SEQLENG and NUMSEQS. (The mnemonic "ZP" may be counter-intuitive, but the Z and P keys are far apart on a QWERTY keyboard so the chances of accidentally typing this command are reduced.)

WARNING : all memory is reset, so any previously read sequence data will be lost !!!

9.8 ECho

if you run the program with scripts, it is sometimes useful to see input commands echoed. The parameter to the ECho command may be ON, OFf, or ? (to list the echo status).

9.9 $

execute a shell command (does not necessarily work on all machines !)

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > $ xterm &
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10 PROGRAM-SPECIFIC COMMANDS

10.1 SRead

Read a single sequence from a file. Provide a name under which the sequence will be stored. The format should be one of the options listed by the FOrmat command (the first three characters suffice).

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- SEQMAN > sread xxx sequence.pir pir Opening file : (sequence.pir) Reading a single sequence File format : (PIR)

Sequence : (XXX) Residues : ( 180) Comment : (>P1;PIRSEQ ; No title) NQKCSGNPRR YNGKSCASTT NYHDSHKGAC GCGPASGDAQ FGWNAGSFVA AASQMYFDSG NKGWCGQHCG QCIKLTTTGG YVPGQGGPVR EGLSKTFMIT NLCPNIYPNQ DWCNQGSQYG GHNKYGYELH LDLENGRSQV TGMGWNNPET TWEVVNCDSE HNHDHRTPSN SMYGQCQCAH ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.2 MRead

Read multiple sequences from a file. Provide a basename from which the names of the individual sequences will be derived by adding "0001", "0002", etc. to them. The format should be one of the options listed by the FOrmat command (the first three characters suffice).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > mread zzz q.fas fasta
 Opening file : (q.fas)
 Reading multiple sequences
 File format : (FASTA)
   
 Sequence : (ZZZ0001)
 Residues : (          5)
 Comment  : (>ALPHA_RESIDUE_TYPE        C          5 (1X,5A))
 SVVSQ
   
 Sequence : (ZZZ0002)
 Residues : (          5)
 Comment  : (>BETA_RESIDUE_TYPE         C          5 (1X,5A))
 EMVYG
   
 [...]
   
 Sequence : (ZZZ0017)
 Residues : (        180)
 Comment  : (>M14_RESIDUE_TYPE          C        382 (1X,5A))
 NQKCSGNPRR YNGKSCASTT NYHDSHKGAC GCGPASGDAQ FGWNAGSFVA AASQMYFDSG
 NKGWCGQHCG QCIKLTTTGG YVPGQGGPVR EGLSKTFMIT NLCPNIYPNQ DWCNQGSQYG
 GHNKYGYELH LDLENGRSQV TGMGWNNPET TWEVVNCDSE HNHDHRTPSN SMYGQCQCAH
   
 Nr of sequences read : (         17)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.3 ARead

Read multiple aligned sequences from a file. Provide a basename from which the names of the individual sequences will be derived by adding "0001", "0002", etc. to them. After the sequences have been read and listed, the number and percentage of sequence identities (according to the alignment read from the file !) are listed. The format should be one of the options listed by the FOrmat command (the first three characters suffice).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > aread aaa mali.alsc alscript
 Opening file : (mali.alsc)
 Reading multiple aligned sequences
 File format : (ALSCRIPT)
   
 Sequence : (AAA0001)
 Residues : (        320)
 Comment  : (>P1;albp mature rbpjr)
 AAEYAVVLKT LSNPFWVDMK KGIEDEAKTL GVSVDIFASP SEGDFQSQLQ LFEDLSNKNY
 KGIAFAPLSS VNLVMPVARA WKKGIYLVNL DEKIDMDNLK KAGGNVEAFV TTDNVAVGAK
 GASFIIDKLG A--------E G---GEVAII EGKA-GNASG EARRNGATEA FKK-ASQIKL
 VASQPADWDR IK-ALDVATN VLQRNP---N IKAIYCANDT MAMGVAQAVA NAGKTG----
 KVLVVGTDGI PEARKMVEAG QMTATVAQNP ADIGATGLKL MVDAEKS-GK VIPLDKAPEF
 KLVDSILVTQ ----------
   
 [...]
   
 Nr of sequences read : (          4)
 AAA0001    (N1=   288) <-> AAA0002    (N2=   271) ID=    96 ( 35.42 %)
 AAA0001    (N1=   288) <-> AAA0003    (N2=   305) ID=    62 ( 21.53 %)
 AAA0001    (N1=   288) <-> AAA0004    (N2=   305) ID=    50 ( 17.36 %)
 AAA0002    (N1=   271) <-> AAA0003    (N2=   305) ID=    66 ( 24.35 %)
 AAA0002    (N1=   271) <-> AAA0004    (N2=   305) ID=    56 ( 20.66 %)
 AAA0003    (N1=   305) <-> AAA0004    (N2=   305) ID=    56 ( 18.36 %)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.4 SWrite

Write a single sequence to a file. The format should be one of the options listed by the FOrmat command (the first three characters suffice).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > swrite AAA0001 zzz.odb ono
 Opening file : (zzz.odb)
 Writing a single sequence
 File format : (ONO)
 Sequence : (AAA0001)
 Nr of sequences written : (          1)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.5 MWrite

Write multiple sequences to a file. Any sequence whose name contains the string you provide as the first argument to this command will be written. The format should be one of the options listed by the FOrmat command (the first three characters suffice).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > mwrit qqq qqq.pir pir
 Sequences selected : (         17)
 Opening file : (qqq.pir)
 Writing multiple sequences
 File format : (PIR)
 Sequence : (QQQ0001)
 Sequence : (QQQ0002)
 Sequence : (QQQ0003)
 [...]
 Sequence : (QQQ0016)
 Sequence : (QQQ0017)
 Nr of sequences written : (         17)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.6 AWrite

Write multiple sequences (preserving any gap characters) to a file. Any sequence whose name contains the string you provide as the first argument to this command will be written. The format should be one of the options listed by the FOrmat command (the first three characters suffice).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > awr qqq qqq.alsc alscr
 Sequences selected : (         17)
 Opening file : (qqq.alsc)
 Writing multiple aligned sequences
 File format : (ALSCR)
 Sequence : (QQQ0001)
 Sequence : (QQQ0002)
 Sequence : (QQQ0003)
 [...]
 Sequence : (QQQ0016)
 Sequence : (QQQ0017)
 Nr of sequences written : (         17)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.7 FOrmats

Use this command to get a brief overview of the implemented read and write formats of sequence files.

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- SEQMAN > form

SEQMAN FORMATS : INPUT MODES : OUTPUT MODES : ---------------- ------------- -------------- PIR S M A S M A FASta S M A S M A SWIssprot S M A S M A ONO datablock S M - S M A ALScript - - A - - A CLUstal W - - A - - A SEQ records PDB S M - - ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.8 RAndom

Generate a random protein or DNA sequence. Arguments are:

- name of the new sequence
- number of amino acids/bases
- type (P for protein, D for DNA)
- frequency-based (F) or even (flat) (E) distribution (always flat for DNA)

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- SEQMAN > rand s9 200 prot Generate sequence of length : ( 200) Of type : Protein Distribution : ( 0.081 0.125 0.171 0.229 0.248 0.306 0.343 0.423 0.445 0.498 0.579 0.639 0.658 0.698 0.745 0.813 0.876 0.891 0.929 1.000)

Sequence : (S9) Residues : ( 200) Comment : (Randomly generated sequence) PRKEPVFALS ATSFYALSIE EIRSGIKDFT DRKSKEVDAL DHAVASDRND QVYVPKPVLP FERDFTPPEE PFCSNFKEQC EEGKPEKVSG PPKWVCYKLQ LLDYGLLEQV VLANQYGRVD TFSNTGKMFP LYEELLPALG TLRWVDSYGQ TLSGKNGGLK KLKKAFLKKV LFPGLSAHIS NLKTKGRRDG PVSKLTVVIS ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.9 SEed

Set a new seed for the random-number generator.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > seed
 Random-number seed ? (123456) 987654
 Random-number seed : (  987654)
 => Random number generator initialised with seed :     987654
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.10 TYpe

Similar to the LIst command, but also shows the actual sequences themselves. Any sequence whose name contains the string you provide as the argument to this command will be listed.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > ty p
 Sequences selected : (          2)
   
 Sequence : (P1)
 Residues : (         31)
 Comment  : (Translation of S1)
 QNMRKLRPWP FLTVKAKYQH LHRSVGSIAS E
   
 Sequence : (P2)
 Residues : (         31)
 Comment  : (Translation of S2)
 RPRHSGRGID SAHHGTRELI TGCHITTVYG M
 Nr of sequences listed : (          2)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.11 LIst

List name, length, number of residues and comment strings for any or all sequences. Any sequence whose name contains the string you provide as the argument to this command will be listed.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > list real
 Sequences selected : (          5)
     # Seq. name Length   #Res Comment
     1 REAL01        707    526 D00688
     2 REAL0001      527    527 >AOFA_HUMAN
     3 REAL0002      520    520 >AOFB_HUMAN
     4 REAL0003      495    495 >AOFN_ASPNG
     5 REAL0004      478    478 >PUO_MICRU
 Nr of sequences listed : (          5)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.12 CLean

Clean any or all sequences. (This means removing any non-amino-acid characters from the sequence, e.g. question marks, insertion characters, etc.).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > list QQQ0017
     # Seq. name Length   #Res Comment
     1 QQQ0017       707    557 P06617
 Nr of sequences listed : (          1)
 SEQMAN > clean QQQ0017
 Cleaning sequence : (QQQ0017)
 Nr of sequences cleaned : (          1)
 SEQMAN > list QQQ0017
     # Seq. name Length   #Res Comment
     1 QQQ0017       557    557 P06617
 Nr of sequences listed : (          1)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.13 DElete

Delete any or all sequences. Warning - every sequence whose name contains the string you provide as the argument to this command will be deleted ! Use the LIst command first with the same argument to see which sequences will be deleted if you use the same argument with the DElete command.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > delete real00
 Sequences selected : (          4)
 Deleting sequence : (REAL0001)
 Deleting sequence : (REAL0002)
 Deleting sequence : (REAL0003)
 Deleting sequence : (REAL0004)
 Nr of sequences deleted : (          4)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.14 TRanslate

Translate a DNA into a protein sequence. Provide the name of the (existing) DNA sequence, the name of the new protein sequence (which will be created), and the value of the frameshift (0, 1, 2).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > tr SEQ0041
 New name for protein sequence ? (PROT)
 Frame shift : (       0)
   
 Sequence : (SEQ0041)
 Residues : (         10)
 Comment  : (>41 exam 060301)
 GCCAGGTAAG
   
 Sequence : (PROT)
 Residues : (          3)
 Comment  : (Translation of SEQ0041)
 AR*
   
 Use the CLean command to remove any
 stop codons or unknown residue types.
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.15 CHou_fasman

This command implements a simplified version of the Chou-Fasman method for secondary structure prediction that is suitable for teaching to undergraduates as they can do the calculations themselves with a piece of paper and a pen.

The amino acids are classified as HA, hA, IA, iA, bA, BA for predicting helices and as HB, hB, IB, iB, bB, BB for predicting strands as follows:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
               Amino acids  A R N D C E Q G H I L K M F P S T W Y V
          Helix prediction  H i b I i H h B I h H h H h B i i h b h
         Strand prediction  i i i B h B h b i H h b h h B b h h H H
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The command has two arguments, namely the name of the sequence and (optionally) the mode. The default mode is 'S' in which all H and h residues get assigned a numerical value of +1, all I and i a value of 0, and all b and B a value of -1. If the mode is 'A', the residues with 'h' become +0.5 and those with 'b' become -0.5.

The nucleation scores are calculated in a window of 6 residues (helix) and 5 residues (strand), and the propagation scores are calculated in a window of 4 residues for both. Nucleation occurs if a window scores 4 or more for helix, or 3 or more for strand. Propagation continues until the propagation score goes below +1.

The following example uses a bit of sequence that contains a chameleon sequence (VNHFIAEF; helix in 1ATR, strand in 1IEB):

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- SEQMAN > ch SEQ0002 Amino acids A R N D C E Q G H I L K M F P S T W Y V Helix prediction H i b I i H h B I h H h H h B i i h b h Strand prediction i i i B h B h b i H h b h h B b h h H H Residue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Type N R M V N H F I A E F R S G A-score -1 0 1 1 -1 0 1 1 1 1 1 0 0 -1 A 6 Nucl 0 0 0 2 3 3 3 5 5 4 2 0 0 0 A 4 Prop 0 1 1 1 1 1 3 4 4 3 2 0 0 0 A predict - a A A A A A A A A A a - -

Residue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Type N R M V N H F I A E F R S G B-score 0 0 1 1 0 0 1 1 0 -1 1 0 -1 -1 B 5 Nucl 0 0 2 2 3 3 2 1 2 1 -1 -2 0 0 B 4 Prop 0 2 2 2 2 2 2 1 1 0 -1 -1 0 0 B predict - b B B B B B B B b - - - - ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Interestingly, when the same sequence is run with the 'A' option, no secondary structure at all is predicted (because there are not enough strong helix or strand formers in this bit of sequence):

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- SEQMAN > ch SEQ0002 a Amino acids A R N D C E Q G H I L K M F P S T W Y V Helix prediction H i b I i H h B I h H h H h B i i h b h Strand prediction i i i B h B h b i H h b h h B b h h H H Residue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Type N R M V N H F I A E F R S G A-score -0.5 0.0 1.0 0.5 -0.5 0.0 0.5 0.5 1.0 1.0 0.5 0.0 0.0 -1.0 A 6 Nucl 0.0 0.0 0.5 1.5 2.0 2.0 2.5 3.5 3.5 3.0 1.5 0.0 0.0 0.0 A 4 Prop 0.0 1.0 1.0 1.0 0.5 0.5 2.0 3.0 3.0 2.5 1.5 -0.5 0.0 0.0 A predict - - - - - - - - - - - - - -

Residue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Type N R M V N H F I A E F R S G B-score 0.0 0.0 0.5 1.0 0.0 0.0 0.5 1.0 0.0 -1.0 0.5 0.0 -0.5 -0.5 B 5 Nucl 0.0 0.0 1.5 1.5 2.0 2.5 1.5 0.5 1.0 0.5 -1.0 -1.5 0.0 0.0 B 4 Prop 0.0 1.5 1.5 1.5 1.5 1.5 1.5 0.5 0.5 -0.5 -1.0 -0.5 0.0 0.0 B predict - - - - - - - - - - - - - - ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.16 UNitary_matrix

Define a simple unitary substitution matrix, where any match is given a fixed score, and any mismatch is given another fixed score. This will work fine for DNA sequences.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > unit 3 -1
 Diagonal term : (   3.000)
 Off-iagonal term : (  -1.000)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.17 SUbstitution_matrix file

Read a substitution matrix file (in SBIN format). If you have defined the environment variable GKLIB, then the program will look for the file whose name you provide in your GKLIB directory.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > subst sbin_blosum62.lib
 Comment : (! BLOSUM 62 matrix made from BLOCKS v. 5.0 and scaled in half-
  bits.)
 Comment : (! ARNDCQEGHILKMFPSTWYVBZX)
 Comment : (! integer matrix)
 Read INTR matrix with format : ((I2,30I3))
 Average matrix value : (  -1.065)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.18 DUmp_matrix

Show the current substitution matrix.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > du
     A  R  N  D  C  E  Q  G  H  I  L  K  M  F  P  S  T  W  Y  V
 A   5 -2 -1 -2 -1 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -2 -2  0
 R  -2  7  0 -1 -3  1  0 -2  0 -3 -2  3 -1 -2 -2 -1 -1 -2 -1 -2
 N  -1  0  6  2 -2  0  0  0  1 -2 -3  0 -2 -2 -2  1  0 -4 -2 -3
 D  -2 -1  2  7 -3  0  2 -1  0 -4 -3  0 -3 -4 -1  0 -1 -4 -2 -3
 C  -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1
 E  -1  1  0  0 -3  6  2 -2  1 -2 -2  1  0 -4 -1  0 -1 -2 -1 -3
 Q  -1  0  0  2 -3  2  6 -2  0 -3 -2  1 -2 -3  0  0 -1 -3 -2 -3
 G   0 -2  0 -1 -3 -2 -2  7 -2 -4 -3 -2 -2 -3 -2  0 -2 -2 -3 -3
 H  -2  0  1  0 -3  1  0 -2 10 -3 -2 -1  0 -2 -2 -1 -2 -3  2 -3
 I  -1 -3 -2 -4 -3 -2 -3 -4 -3  5  2 -3  2  0 -2 -2 -1 -2  0  3
 L  -1 -2 -3 -3 -2 -2 -2 -3 -2  2  5 -3  2  1 -3 -3 -1 -2  0  1
 K  -1  3  0  0 -3  1  1 -2 -1 -3 -3  5 -1 -3 -1 -1 -1 -2 -1 -2
 M  -1 -1 -2 -3 -2  0 -2 -2  0  2  2 -1  6  0 -2 -2 -1 -2  0  1
 F  -2 -2 -2 -4 -2 -4 -3 -3 -2  0  1 -3  0  8 -3 -2 -1  1  3  0
 P  -1 -2 -2 -1 -4 -1  0 -2 -2 -2 -3 -1 -2 -3  9 -1 -1 -3 -3 -3
 S   1 -1  1  0 -1  0  0  0 -1 -2 -3 -1 -2 -2 -1  4  2 -4 -2 -1
 T   0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1  2  5 -3 -1  0
 W  -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2  1 -3 -4 -3 15  3 -3
 Y  -2 -1 -2 -2 -3 -1 -2 -3  2  0  0 -1  0  3 -3 -2 -1  3  8 -1
 V   0 -2 -3 -3 -1 -3 -3 -3 -3  3  1 -2  1  0 -3 -1  0 -3 -1  5
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.19 MAtrix

Set indivivual elements of the substitution matrix. Provide the two residue types (one-letter code) and the value of the element in the matrix, and both entries (type-1/type-2 and type-2/type-1) will be set to that value.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > ma g t 2
 Before : S(G,T) =  -1 = S(G,T)
 After  : S(G,T) =   2 = S(G,T)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.20 DOt_plot

With this command you can make a simple type of dot-plot. You provide the name of the PostScript plot file that will be produced. Optionally, you can supply the length of the window that should be taken into account (an odd number) and the minimum score for a dot to be plotted. For every residue pair I in sequence 1 and J in sequence 2, the program will count how many identities occur in a small window of both sequences around these two residues. If this number exceeds the minimum count, a dot is plotted in the plot at position (I,J).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > dot real0001 real0002 dotplot.ps 7 3
 Cleaning sequence 1
 Cleaning sequence 2
 Dot-plot
 Window size : (          7)
 Half-window : (          3)
 Min count   : (          3)
 Max count   : (          7)
 => XPS_GRAF - GJK (19981216/3.1.2)
 Opened PostScript file : (dotplot.ps)
 Date    : (Wed Mar 30 23:02:34 2005)
 User    : (gerard)
 Program : (SEQMAN)
 Nr of pixels in plot : (     266760)
 Nr of dots plotted   : (       2049)
 Percentage plotted   : (   0.768)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.21 NWunschs

With this command you can do a global alignment of two sequences, using the Needleman-Wunsch-Sellers algorithm. The program uses an affine gap penalty, i.e. of the form: Penalty = gap_open + L * gap_extend, where gap_open is the gap-opening penalty, gap_extend is the gap-extension penalty, and L is the length of the gap. You may choose to have an overhang penalty or not (if Y, then "gaps" at the termini are penalised equally heavily as internal gaps; if N, then such gaps are not penalised at all). Note that both penalties are to be provided as non-negative numbers !!! You may use a value of zero for either or both penalties, of course. If gap_ext is zero, then a gap of one residue gets the same penalty as one of 100 residues.

An optional parameter also allows you to use the classic Needleman-Wunsch algorithm (i.e., in every matrix element you only consider the possibility of adding one gap to the element above or to the left - this tends to lead to fewer and shorter gaps). The default is to use the Sellers modification, i.e., looking back in the entire row and column to see if any element, with X gaps added, would give the maximum score.

If both sequences contain 15 or fewer residues, then the two dynamic-programming matrices are printed. This is great to make exam questions where students are given the score matrix (but not the pointer matrix) and asked to derive the optimal sequence alignment from it. The pointer matrix encodes for each cell of the score matrix how its score was obtained. A value of 0 in the pointer matrix indicates a diagonal (alignment) move. A positive value +N means that the best move came from a cell N columns to the left, and a negative value -N means that the best move came from a cell N rows above the current cell. In case of a tie, only one result is retained (diagonal before left, left before up). However, if this case, a character is printed behind the pointer value to indicate that one or both alternative moves yielded the same score. For instance:

- "0<" means: we took the diagonal, but some move to the left scored the same
- "0^" menas: we took the diagonal, but some move upwards scored the same
- "0*" means: we took the diagonal, but both to the left and upwards there is at least one move that scored the same
- "1^" means: we went one element to the left, but some move upwards scored the same

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > mread seq exam_multi.fasta fasta
 Opening file : (exam_multi.fasta)
 Reading multiple sequences
 File format : (FASTA)
   
 Sequence : (SEQ0001)
 Residues : (          9)
 Comment  : (>DNA1)
 GATTACATA
   
 Sequence : (SEQ0002)
 Residues : (          9)
 Comment  : (>DNA2)
 GAGACTTAG
   
 [...]
   
 Nr of sequences read : (         35)
 SEQMAN > unit_matrix 3 -1
 Diagonal term : (   3.000)
 Off-iagonal term : (  -1.000)
 SEQMAN > nwunsch SEQ0001 SEQ0002 3 1 no
 Needleman-Wunsch-Sellers alignment
 Gap open penalty   : (   3.000)
 Gap extend penalty : (   1.000)
 Overhang penalty   : (N)
 Classic NW method  : (N)
   
 Score matrix
    .    -    G    A    T    T    A    C    A    T    A
    -    0    0    0    0    0    0    0    0    0    0
    G    0    3   -1   -1   -1   -1   -1   -1   -1   -1
    A    0   -1    6    2    1    2   -1    2   -2    2
    G    0    3    2    5    1    0    1   -2    1   -2
    A    0   -1    6    2    4    4    0    4    0    4
    C    0   -1    2    5    1    3    7    3    3    1
    T    0   -1    1    5    8    4    3    6    6    2
    T    0   -1    0    4    8    7    3    2    9    5
    A    0   -1    2    0    4   11    7    6    5   12
    G    0    3   -1    1    3    7   10    6    5    8
   
 Pointer matrix
    .    -    G    A    T    T    A    C    A    T    A
    -    0    0    0    0    0    0    0    0    0    0
    G    0    0    0<   0    0    0    0    0    0    0
    A    0    0^   0    1    2    0    4    0    0<   0
    G    0    0   -1    0    0<   0<   0    0*   0   -1
    A    0    0^   0    1    0    0    1    0    1    0
    C    0    0   -1    0    0<   0    0    1    0    3
    T    0    0   -2    0    0    1    2^   0    0    0<
    T    0    0   -3    0    0    0    0<   0*   0    0<
    A    0    0    0   -1   -1    0    1    0<   3^   0
    G    0    0    1    0   -2   -1    0    0<   0<  -1
   
 Alignment score : (  12.000)
 Highest right row : (       8)
   
 Alignment length   : (      10)
 Number of gaps     : (       1)
 Length sequence 1  : (       9)
 Length sequence 2  : (       9)
 Nr of identities   : (       6)
 Perc identities    : (  66.667)
 Nr of similarities : (       6)
 Perc similarities  : (  66.667)
 Note: similarities include identities !
   
 SEQ0001    GATTACATA-
 |ID +SIM   ||  || ||
 SEQ0002    GA-GACTTAG
   
 The alignment is NOT unique!
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

Doing the same but with the overhang penalty switched on:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 SEQMAN > nwunsch SEQ0001 SEQ0002 3 1 yes
 Needleman-Wunsch-Sellers alignment
 Gap open penalty   : (   3.000)
 Gap extend penalty : (   1.000)
 Overhang penalty   : (Y)
 Classic NW method  : (N)
   
 Score matrix
    .    -    G    A    T    T    A    C    A    T    A
    -    0   -4   -5   -6   -7   -8   -9  -10  -11  -12
    G   -4    3   -1   -2   -3   -4   -5   -6   -7   -8
    A   -5   -1    6    2    1    0   -1   -2   -3   -4
    G   -6   -2    2    5    1    0   -1   -2   -3   -4
    A   -7   -3    1    1    4    4    0    2   -2    0
    C   -8   -4    0    0    0    3    7    3    2    1
    T   -9   -5   -1    3    3   -1    3    6    6    2
    T  -10   -6   -2    2    6    2    2    2    9    5
    A  -11   -7   -3   -2    2    9    5    5    5   12
    G  -12   -8   -4   -3    1    5    8    4    4    8
   
 Pointer matrix
    .    -    G    A    T    T    A    C    A    T    A
    -    0    0    0    0    0    0    0    0    0    0
    G    0    0    1    2    3    4    5    6    7    8
    A    0   -1    0    1    2    0<   4    0<   6    0<
    G    0    0^  -1    0    0<   0<   0<   0<   0<   0<
    A    0   -3    0^   0^   0    0    1    0    1    0
    C    0   -4   -3    0^   0^   0    0    1    2    3
    T    0   -5   -4    0    0    0*  -1    0    0    1
    T    0   -6   -5    0    0    0<  -2    0^   0    0<
    A    0   -7    0^  -1   -1    0    1    0   -1    0
    G    0    0^  -7   -2   -2   -1    0    0<   0^  -1
   
 Alignment score : (   8.000)
   
 Alignment length   : (      10)
 Number of gaps     : (       2)
 Length sequence 1  : (       9)
 Length sequence 2  : (       9)
 Nr of identities   : (       6)
 Perc identities    : (  66.667)
 Nr of similarities : (       6)
 Perc similarities  : (  66.667)
 Note: similarities include identities !
   
 SEQ0001    GATTACATA-
 |ID +SIM   ||  || ||
 SEQ0002    GA-GACTTAG
   
 The alignment is NOT unique!
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

10.22 SMithwat

This command finds the single best local aligment of two sequences using the Smith-Waterman algorithm. See the NWunschs command for details.

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- SEQMAN > smithw SEQ0001 SEQ0002 3 1 Smith-Waterman alignment Gap open penalty : ( 3.000) Gap extend penalty : ( 1.000) Score matrix . - G A T T A C A T A - 0 0 0 0 0 0 0 0 0 0 G 0 3 0 0 0 0 0 0 0 0 A 0 0 6 2 1 3 0 3 0 3 G 0 3 2 5 1 0 2 0 2 0 A 0 0 6 2 4 4 0 5 1 5 C 0 0 2 5 1 3 7 3 4 1 T 0 0 1 5 8 4 3 6 6 3 T 0 0 0 4 8 7 3 2 9 5 A 0 0 3 0 4 11 7 6 5 12 G 0 3 0 2 3 7 10 6 5 8 Pointer matrix . - G A T T A C A T A - 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 0 0 0 A 0 0 0 1 2 0 0 0 0 0 G 0 0 -1 0 0< 0 0 0 0 0 A 0 0 0 1 0 0 0 0 1 0 C 0 0 -1 0 0< 0 0 1 0 3^ T 0 0 -2 0 0 1 2^ 0 0 0 T 0 0 0 0 0 0 0< 0* 0 0< A 0 0 0 0 -1 0 1 0< 3^ 0 G 0 0 0 0 -2 -1 0 0< 0< -1 Alignment score : ( 12.000) Column : ( 9) Row : ( 8) End of alignment reached Column : ( 0) Row : ( 0) Alignment length : ( 9) Number of gaps : ( 1) Length sequence 1 : ( 9) Length sequence 2 : ( 9) Nr of identities : ( 6) Nr of similarities : ( 6) Note: similarities include identities !

SEQ0001 GATTACATA |ID +SIM || || || SEQ0002 GA-GACTTA ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

11 EXAMPLE

Here's an example of an exam question one could use:

Given is the following dynamic-programming matrix:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
    .    -    G    C    C    A    A    G    T    A    G    G
    -    0   -4   -5   -6   -7   -8   -9  -10  -11  -12  -13
    A   -4   -1   -5   -6   -3   -4   -8   -9   -7  -11  -12
    C   -5   -5    2   -2   -3   -4   -5   -6*  -7   -8   -9
    G   -6   -2   -2    1   -3   -4   -1   -5*  -6   -4   -5
    A   -7   -6   -3   -3    4    0   -1   -2*  -2   -4   -5
    G   -8   -4   -4   -4    0    3    3   -1   -2    1   -1
    C   -9   -8   -1   -1   -1   -1    2    2   -2   -3    0
    G  -10   -6   -5   -2   -2   -2    2    1    1    1    0
    T  -11  -10   -6   -6   -3   -3   -2    5    1    0    0
    A  -12  -11   -7   -7   -3    0   -3    1*   8    4    3
    T  -13  -12   -8   -8   -5   -4   -1    0*   4    7    3
    G  -14  -10   -9   -9   -6   -5   -1   -1*   3    7   10
    A  -15  -14  -10  -10   -6   -3   -5   -2    2    3    6
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

(a) What can you say about the algorithm that was used to produce this matrix ?

(b) To construct the matrix, a match was scored A, a mismatch B, the gap-open penalty was C and the gap-extension penalty was D. Determine the values of A, B, C and D (and show how you did it) !

(d) Derive the sequence alignment. Is the solution unique ?

12 KNOWN BUGS

None, at present.

Created at Thu Feb 22 15:41:05 2007 by MAN2HTML version 070111/2.0.8 . This manual describes SEQMAN, a program of the Uppsala Software Factory (USF), written and maintained by Gerard Kleywegt. © 1992-2007.