Uppsala Software Factory - STRUPAT Manual

1 STRUPAT - GENERAL INFORMATION
2 REFERENCES
3 VERSION HISTORY
4 INTRODUCTION
5 INPUT TO THE PROGRAM

5.1 Start-up

5.2 Random-number seed

5.3 Random sequence

5.4 Cut-off distances and frameshifts

5.5 Minimum pattern length

5.6 Substitution-group mode

5.7 Little variation

5.8 PDB file
6 OUTPUT
7 RESULTS
8 PATTERN REDUCTION
9 KNOWN BUGS
10 UNKNOWN BUGS

1 STRUPAT - GENERAL INFORMATION

Program : STRUPAT
Version : 041001
Author : Gerard J. Kleywegt, Dept. of Cell and Molecular Biology, Uppsala University, Biomedical Centre, Box 596, SE-751 24 Uppsala, SWEDEN
E-mail : gerard@xray.bmc.uu.se
Purpose : generate PROSITE patterns from aligned 3D protein structures
Package : SBIN

2 REFERENCES

Reference(s) for this program:

* 1 * G.J. Kleywegt & T.A. Jones (1998). Databases in protein crystallography. Acta Cryst D54, 1119-1131. [http://xray.bmc.uu.se/gerard/papers/databases.html] [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=10089488&dopt=Citation] [http://scripts.iucr.org/cgi-bin/paper?ba0001]

* 2 * Kleywegt, G.J., Zou, J.Y., Kjeldgaard, M. & Jones, T.A. (2001). Around O. In: "International Tables for Crystallography, Vol. F. Crystallography of Biological Macromolecules" (Rossmann, M.G. & Arnold, E., Editors). Chapter 17.1, pp. 353-356, 366-367. Dordrecht: Kluwer Academic Publishers, The Netherlands.

3 VERSION HISTORY

970512 - 0.1 - first version
970804 - 0.5 - first documented version
970805 - 0.6 - try to extend alignments backwards as well; minor changes
971030 - 1.0 - cleaned up code and manual
010227 - 1.1 - calculated empirical estimate for the probability of each pattern using the formula used in EMOTIF (CG Nevill-Manning, TD Wu, DL Brutlag, PNAS 95, 5865-5871 (1998)); also implemented EMOTIF's substitution-group mode and made it the default
020823 - 1.2 - skip alt. conf. (B, C, ...) when reading PDB files
041001 - 1.3 - replaced Kabsch' routine U3BEST by quaternion-based routine (U3QION) to do least-squares superpositioning

4 INTRODUCTION

This program generates PROSITE patterns from a set of aligned three-dimensional protein structures in PDB format.

Suppose that you solve a new protein structure which turns out to contain a fold which is (partly) similar to that of one or more other proteins (e.g., using DEJAVU or SPASM). If you align the two structures (e.g., using LSQMAN), you can feed them into the program STRUPAT which will look for more or less conserved residues in structurally conserved regions. It will use these to generate PROSITE-type sequence patterns (a.k.a. footprints, fingerprints, motifs, ...).

Such a pattern may look as follows: G-x(3)-C-x(2)-[ILV]. This means: glycine - three residues of any type - cysteine - two residues of any type - one residue of type Ile/Leu/Val. A protein which contains the peptide GYAVCPSV would fit this pattern.

If you want to scan PROSITE ( http://www.expasy.ch/sprot/prosite.html ) patterns against the SWISS-PROT (and TREMBL) database, you can use the WWW-based PROSITE server ( http://www.expasy.ch/sprot/scnpsit2.html ) at ExPASy in Geneva.

5 INPUT TO THE PROGRAM

5.1 Start-up

When you start the program, it prints some information:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT ***
   
 Version  - 010227/1.1
 (C) 1992-2001 Gerard J. Kleywegt, Dept. Cell Mol. Biol., Uppsala (S)
 User I/O - routines courtesy of Rolf Boelens, Univ. of Utrecht (NL)
 Others   - T.A. Jones, G. Bricogne, Rams, W.A. Hendrickson
 Others   - W. Kabsch, CCP4, PROTEIN, E. Dodson, etc. etc.
   
 Started  - Wed Feb 28 01:50:07 2001
 User     - gerard
 Mode     - interactive
 Host     - sarek
 ProcID   - 14618
 Tty      - /dev/ttyq14
   
 *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT ***
   
 Reference(s) for this program:
   
 *  1 * G.J. Kleywegt & T.A. Jones (1998). Databases in protein
        crystallography. Acta Cryst D54, 1119-1131.
        [http://xray.bmc.uu.se/gerard/papers/databases.html]
        [http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=10089488&form=6&db=m&Dopt=b]
        [http://www.iucr.org/iucr-top/journals/acta/tocs/actad/1998/actad5406_1.html]
   
 *  2 * G.J. Kleywegt, J.Y. Zou, M. Kjeldgaard & T.A. Jones (2001).
        Chapter 17.1.  Around O. Int. Tables for Crystallography,
        Vol. F. (In press.)
   
 ==> For manuals and up-to-date references, visit:
 ==>     http://xray.bmc.uu.se/usf
 ==> For downloading up-to-date versions, visit:
 ==>     ftp://xray.bmc.uu.se/pub/gerard
   
 *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT *** STRUPAT ***
   
 Max nr of atoms/residues       : (      50000)
 Max nr of molecules            : (        500)
 Max nr of residues in sequence : (       2000)
 Max nr of PROSITE patterns     : (        500)
 Random sequence length         : (    2000000)
 One-letter codes   : ( A R N D C E Q G H I L K M F P S T W Y V)
 Three-letter codes : ( ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS
  MET PHE PRO SER THR TRP TYR VAL)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.2 Random-number seed

The first bit of input is an integer seed for the random-number generator. This will be used to generate a random amino-acid sequence. If you repeat this run of the program on the same machine with the same seed, you should be getting identical results.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Random-number seed ? (  123456)
 Random-number seed : (  123456)
 => Random number generator initialised with seed :     123456
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.3 Random sequence

The program will now generate a random amino-acid sequence of (at present) 2,000,000 residues. This sequence has an aminoa-acid distribution similar to that found in proteins in the PDB (GJK, unpublished results). It will be used later to test how often generated PROSITE patterns occur in this sequence, which gives you some idea of the pattern occurring by chance. Of course, a random sequence is unlikely to be "protein-like", but if a pattern matches the random sequence more than, say, 5 or 10 times, it is unlikely to be a very discriminating one.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Generating random sequence ...
 Target composition    : (   0.081    0.044    0.046    0.058    0.019
  0.058    0.037    0.080    0.022    0.053    0.081    0.059    0.020
  0.040    0.047    0.068    0.063    0.016    0.038    0.071)
 Working ...
 Actual composition    : (   0.081    0.044    0.046    0.058    0.019
  0.057    0.037    0.080    0.022    0.053    0.081    0.060    0.020
  0.040    0.046    0.068    0.063    0.015    0.038    0.070)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.4 Cut-off distances and frameshifts

You are to provide a cut-off distance (in Å) for CA atoms of different molecules to be considered equivalent. If this number is very high, frameshifts may occur in the structural alignments, although the program can be instructed to try and correct for these. Another cut-off distance determines how bits of equivalent structure are extended at their ends.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Equivalent CA distance ? (   3.500) 5
 Equivalent CA distance : (   5.000)
   
 Extension CA distance ? (   5.000) 8
 Extension CA distance : (   8.000)
   
 Try to correct frame-shifts (Y/N) ? (Y)
 Try to correct frame-shifts (Y/N) : (Y)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.5 Minimum pattern length

Very short patterns are unlikely to be very specific. Also, for calculating RMSDs between aligned stretches, at least 3 CA atoms are required.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Min pattern length ? (      10) 5
 Min pattern length : (       5)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.6 Substitution-group mode

You can use either the old STRUPAT algorithm for grouping residues, or the set that is used by EMOTIF (recommended).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Choose substitution-group mode:
   E = EMOTIF (recommended)
   S = STRUPAT
 Substitution-group mode (E/S) ? (E)
 Substitution-group mode (E/S) : (E)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.7 Little variation

This block of input is only required when you use the old STRUPAT substitution-group mode (not recommended !).

If only 2, 3 or 4 different residue types occur in a certain position of all structures/sequences, this can be included in the pattern. But this only makes sense if you have a reasonable number of structures. For instance, if you only have three structures, and you observe residue types Arg, Lys, and Gln in a certain position, you probably would not want to conclude that this residue is always Arg, Lys or Gln. However, if you have 30 aligned structures, you might.

There are a few exceptions to this, namely if the various observed residue types are similar, such as: D/E, R/K, F/Y, F/Y/W, N/Q, S/T, A/G, and I/L/V.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 If only 2, 3, or 4 different residue types
 occur in at least NMIN2, NMIN3, NMIN4 of
 your sequences, an entry will be generated
 (e.g., [SE], [TGW], [KILM]).  By setting
 NMIN2/3/4 greater than the number of sequences
 you can prevent that such entries are used.
 Value for MIN2 (>2) ? (       6)
 Value for MIN3 (>3) ? (      15)
 Value for MIN4 (>4) ? (     100)
 Value for MIN2 : (       6)
 Value for MIN3 : (      15)
 Value for MIN4 : (     100)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

5.8 PDB file

Provide the name of the PDB file which contains ALL molecules. Note that the molecules must have been superimposed previously (e.g., with O or LSQMAN; LSQMAN contains a BRute_force command to find structural alignments "ab initio"). Any two subsequent molecules in the file must have different chain identifiers. However, not all identifiers have to be unique (which would otherwise limit you to a maximum of 26 molecules), e.g. you could alternate chain identifiers A and B. Note that the program *ONLY* reads the CA atoms, so you can make your files considerably smaller by only including these (e.g.: grep ^ATOM myfile.pdb | grep ' CA ' > new.pdb).

The example below is for a PDB file which contains a number of superimposed glutathione S-transferase structures.

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- Name of PDB file ? (aligned.pdb) 1rbp/aligned.pdb Name of PDB file : (1rbp/aligned.pdb) Nr of CA atoms : ( 840) Nr of molecules : ( 5)

Mol # 1 Atoms 1 to 174 Mol # 2 Atoms 175 to 350 Mol # 3 Atoms 351 to 510 Mol # 4 Atoms 511 to 683 Mol # 5 Atoms 684 to 840 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

6 OUTPUT

STRUPAT will now start looking for residues that are structurally equivalent in all aligned structures (i.e., a residue in the first protein has a partner in each of the other structures within the cut-off distance). When it encounters such a residue, it checks to see if neighbouring residues (on either side) also have partners in all the other structures (now using the second distance cut-off).

In this way, a set of residues is equivalenced between all structures. However, the structural superposition may not always be optimal, so the program will try to detect and fix any frameshift errors. It does this simply by checking for each structure if shifting the alignment to the first structure by one residue forward or backward would improve the superpositioning RMSD. If so, the equivalenced residues are altered accordingly, and the frameshift test is carried out again, until no more frameshifts occur.

----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- ---------------------------------------------------------------------- Shift mol 5 by +1 (RMSD -1/0/+1 : 6.2 4.2 2.9 A)

---------------------------------------------------------------------- ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

At that stage, the program will again try to extend the alignments in both directions using the extension distance cut-off. If the resulting conserved set of residues contains at least the minimum number of residues defined by the user, a potential pattern has been found.

For every (potential) pattern that the program discovers, the output includes:

- a listing of the first residue of the stretch of structurally conserved residues in every molecule

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 New structurally conserved stretch !
 Starts at residue LYS -   12
   molecule    2 @ LYS -   12
   molecule    3 @ VAL -    3
   molecule    4 @ VAL -   15
   molecule    5 @ GLY -   11
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- for every residue, the amino-acid type in each molecule, and the program's "reduction" of it in terms of PROSITE pattern elements. For instance, a strictly conserved glycine will be "reduced" to "G", whereas "|YFFY|" would yield "[FY]". (The precise reduction depends on the substitution-group mode chosen !)

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 |KKVVG| ==>  X -
 |EEKDR| ==>  X -
 |NNDNN| ==>  [DN] -
 |FFFFF| ==>  F -
 |DDDDN| ==>  [DN] -
 |KKIWV| ==>  X -
 |AASSE| ==>  X -
 |RRKNK| ==>  X -
 |FFFYI| ==>  [FILVY] -
 |SALHN| ==>  X -
 |GGGGG| ==>  G -
 |TTFKE| ==>  X -
 |WWWWW| ==>  W -
 |YYYWH| ==>  X -
 |AAEET| ==>  X -
 |MMIVI| ==>  [ILMV] -
 |AAAAI| ==>  X -
 |KKFKL| ==>  X -
 |KKAYA| ==>  X -
 |DDSPS| ==>  X -
 |PPKND| ==>  X -
 |EEMSK| ==>  X -
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- some information about the aligned set of residues, namely their number, and the RMS (RMSD) value (in Å). This is calculated from all Nmol*(Nmol-1)/2 possible pair-wise superpositionings of this stretch of residues.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of residues  : (      22)
 RMS (RMSD) (A)  : (   1.640)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- now the program goes to work and "reduces" the partial PROSITE patterns, by collecting sequential "X"s and stripping any "X"s from the start and end

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 PROSITE pattern : ([DN] - F - [DN] - X(3) - [FILVY] - X - G - X - W - X(2)
   - [ILMV])
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- the program calculates the probability of finding this pattern by chance (using the same formula as EMOTIF). The number of false positives is roughly this probability times the number of amino-acid residues in the database against which the scan is performed (2,000,000 for STRUPAT's internal testing)

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Probability     : (  3.711E-08)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- the program calculates two scores to help you judge the value of the pattern.

"Score 1" is calculated as SUM 10LOG(Ntotal/Nposs), where the sum extends over all residues in the pattern, Ntotal is the number of aligned structures, and Nposs the number of different residue types that occur in each position IF (and only if) the residue resulted in a non-"X" partial pattern. E.g., if there are four different residue types for four sequences, (Ntotal/Nposs) will be 1, and the contribution to the sum of logs will be zero. If there are only two possible residue types observed in 30 different structures, the contribution will be 10LOG(20/2), since the maximum number of possible different residue types is 20. The higher the total sum, the more specific information the pattern contains. Usually, this is strongly correlated to the length of the pattern.

"Score 2" is an integer number calculated as a sum over all residues of the pattern of a subjective score of the quality of the pattern element. The subjective score varies between 0 (for an "X" entry) to 10 (for a strictly conserved residue type).

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Score 1         : (   6.884)
 Score 2         : (      52)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

- finally, the program will check the random sequence it prepared earlier to see how often the pattern occurs in it. If it occurs more than a few times, the pattern is probably not suitable for searching againts a database, since it is likely to result in many false positives.

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of matches to random sequence : (          0)
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

7 RESULTS

When the program has finished, it will print a summary of the PROSITE patterns of sufficient length. For every pattern the following is listed:

- nr of the pattern
- length (nr of residues that it spans)
- probability of matching the pattern by chance
- score
- number of chance matches found in the program's random sequence
- number of chance matches expected in the program's random sequence

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 Nr of PROSITE patterns found : (       3)
   
    # Leng  Probability  Score      #Random    #Expected
    1   14   3.7111E-08    6.9            0   7.4223E-02
 [DN]-F-[DN]-X(3)-[FILVY]-X-G-X-W-X(2)-[ILMV]
   
    # Leng  Probability  Score      #Random    #Expected
    2   11   3.8500E-06    5.3           10   7.7001E+00
 [IV]-X(2)-T-D-X(3)-[FY]-X-[ILMV]
   
    # Leng  Probability  Score      #Random    #Expected
    3   33   3.1830E-05    2.6          217   6.3660E+01
 [ILV]-[FLY]-X-R-X(11)-[FILMV]-X(16)-[FILVY]
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

You can now take these patterns and scan them against SWISSPROT/TrEMBL.

8 PATTERN REDUCTION

The program uses a simple algorithm to reduce a string of residue types to a PROSITE sub-pattern. The original STRUPAT algorithm uses the following cases:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 CASE #     "Score 2"   Description
 ------     ---------   ---------------------------------------------------------
    1          10       absolutely conserved residue type
    2           8       (potentially) negatively charged residue [DE]
    3           8       (potentially) positively charged residue [RK]
    4           8       only Phe and Tyr occur
    5           6       only Phe and Tyr and Trp occur
    6           5       only Asn and Gln occur
    7           5       only Ser and Thr occur
    8           5       only Ala and Gly occur
    9           6       only Ile and Leu and Val occur
   10           3       only 2 different types occur and at least NMIN2 sequences
   11           2       only 3 different types occur and at least NMIN3 sequences
   12           1       only 4 different types occur and at least NMIN4 sequences
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

The recommended EMOTIF substitution-group mode uses the following cases:

      
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----
 CASE #     "Score 2"   Description
 ------     ---------   ---------------------------------------------------------
    1          10       absolutely conserved residue type
    2           8       sets of 2: IV, FY, HY, KR, EQ, DE, DN, ST, AS
    3           6       sets of 3: ILV, FLY, FWY, KQR, EKQ, AST
    4           4       sets of 4: ILMV, EKQR
    5           2       sets of 5: FILMV, FILVY
    6           1       sets of 6: FILMVY
 ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE ----- EXAMPLE -----

9 KNOWN BUGS

None, at present ("peppar, peppar").

10 UNKNOWN BUGS

Does not compute.

Created at Fri Jan 14 20:12:42 2005 by MAN2HTML version 050114/2.0.6 . This manual describes STRUPAT, a program of the Uppsala Software Factory (USF), written and maintained by Gerard Kleywegt. © 1992-2005.