Scanning Protein with DNA sequence

Next: APPENDIX I - Revision Up: Getting more from SCANPS Previous: The scanps_gapdefs.dat file

Scanning Protein with DNA sequence

Note: This is not availble in Version 2.4.x, but may come back in a later version.

Version 2.3 allows a DNA sequence to be compared to a protein database (MODE 22). Scanps first translates the DNA in three forward reading frames to create a pseudo sequence of amino acids where every base is an amino acid residue or STOP. This sequence is then compared to each protein sequence using a dynamic programming algorithm. The result is an alignment of the DNA with a protein that can include frameshifts.

Additional adjustable parameters are:

FS_PEN and STOP_WEIGHT. These are defined in the command summary APPENDIX II.

Work is in progress to establish suitable values for these parameters. I have found that a large and negative value for STOP_WEIGHT is appropriate. FS_PEN should be set larger than LI_PEN and LD_PEN.

Here is an example of the comparison of the primary transcript of human alpha haemoglobin versus the corresponding protein sequence. The DNA has been modified to introduce a single frameshift error, an additional G at position 356.

Here is the command line:

scanps -s hahunucerror.fasta -d hahu.fasta -db_format 1 -mode 22 \
-ugd 0 -m PAM250 -ld_pen 0.2 -li_pen 12 -fs_pen 24 -qseq_format 1

we are scanning a database of only one sequence (-d hahu.fasta), overriding the gap-penalty defaults for this MODE and matrix (-ugd 0) and supplying our own gap-penalties (-ld_pen, -li_pen, -fs_pen). The query sequence is in FASTA format, but we have set the default sequence format to 0 (PIR) in the scanps_defaults.dat file. Accordingly, we need to override the default on the command line (-qseq_format).

Here is the output:

# SCANPS Scan: Mon Aug 11 15:36:04 1997
# SCANPS Version: 2.3 (Fri Jul  4 16:16:41 BST 1997)
# Author:  G. J. Barton,  University of Oxford, UK
# Query file: hahunucerror.fasta
# Query ID: HSHBA4
# Query Title: Dummy title inserted by gseq_fasta                              
# Query Length: 835
# SCANPS MODE: 22
# Matrix file: /gjb/delly/gjb/c/scanps/gjscanps/mat//PAM250
# Matrix title: NCBI Matrix - see file for type
# LD penalty:   0.20
# LI penalty:  12.00
# FS  penalty:  24.00
# MAX number of scores output: 1
# MAX number of alignments output: 10000
# MIN score to output: 769.70
# MIN score of alignments to  output: 769.70
# Number of residues/bases in database: 141
# Score type: ln() scores
# Start_Scores: ----------------------------------------------------
   1 769 HAHU Hemoglobin alpha chain - Human and chimpanzees    
# End_Scores: ------------------------------------------------------
# -------------------------------------------------------------------
# Start_Alignments_Rank: 1
# Query: HSHBA4 Target: HAHU Number_of_Alignments: 1
# Target_Title: Hemoglobin alpha chain - Human and chimpanzees                                                                                                                                                          
# N: 1 Raw_Score: 566 Query_Length: 835 Target_Length: 141 ln()_Score: 769
     41 gctcggaaagaggtgagggcgggtggggcgatgtccttgcgttcgcacac
        ttcccaacataccggatgcacgaagcactagggctcgcacgccccgcagc
        ggttccgcccgccgtgccgctcgttggcgggacccccccgccgccgccga     190
     41 VLSPADKTNVKAAWGKVGAHAGEYGAEALER!GSLPCSDPGSSPARTHRP      90
        *******************************                   
      1 VLSPADKTNVKAAWGKVGAHAGEYGAEALER                         31

    191 ctagcgcgcacacctgtccaatcttcaaaattcctgcacgtgcgagcgaa
        cccttccacacccaccccggtttctcccacatcatatgagccatagagaa
        caccgcgcaccctctttccggcgcccccgcccgcccgccctcgtgcccgg     340
     91 PSTVLAPDPNPTPHSASPRRMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK     140
                            ******************************
     32                     MFLSFPTTKTYFPHFDLSHGSAQVKGHGKK      61

    341 ggggcGaagggcgggacagctgcagccgcaccggcgataGtgggcgctgt
        tcact cactcataatcactcctgatacaatgtactata gccggagcgc
        gcccg cccggcgccgccggccgccgcgcgtggcgcccg aggcggatgg     486
    141 VADAL^TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK^!AAGRERSGS     188
        ***** *********************************           
     62 VADAL^TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK^                99

    487 aggagcttcgagtcgtcggtcaccccgtggcacttgcccactccgacggc
        ggatccccagggcggtgataggggtgcgcctcttccattgagtttctcca
        gcgggtctgcaaacggggggcgggggcgcagccctagcacccgggcgccc     636
    189 RGEMAPSSQGRGSRGLREV!RRRRLRAWAALTLFSAQLLSHCLLVTLAAH     238
                                             *************
    100                                      LLSHCLLVTLAAH     112

    637 ccggtacggcgtcgatcgtgaagcatatc
        tccatccctacctaattcctgcttccaag
        cccgcctggcccgcgcgttgccggccact     723
    239 LPAEFTPAVHASLDKFLASVSTVLTSKYR     267
        *****************************
    113 LPAEFTPAVHASLDKFLASVSTVLTSKYR     141

# End_N: 1
# End_Alignments_Rank: 1
# ------------------------------------------------------
# Scan Completed at: Mon Aug 11 15:36:04 1997
# 
# Times:                              (Seconds)
# Load matrix, query and index:        0.0     0.0     0.0     0.0
# Load database:                       0.0     0.0     0.0     0.0
# Scan database:                       0.1     0.1     0.9     0.9
# Sort results :                       0.0     0.1     0.0     0.9
# Output scores:                       0.0     0.1     0.0     0.9
# Generate and write alignments:       0.1     0.2     0.1     1.0
# Millions of cell updates/sec in scan (CPU):   1.18 (Elapsed):   0.13

The output shows the base triplets arranged VERTICALLY above the corresponding amino acid residue. The protein sequence is shown conventionally below this. Frameshifts are indicated by a caret symbol "" and the bases involved in the frameshift are shown in uppercase. You may see odd effects around frameshifted gaps in alignments. My advice is to look carefully at any frameshifted gaps to see if there might be alternative alignments by taking a different reading frame in the region, or shifting the frameshift gap a base or so to either side. With any dynamic programming method, there may be more than one, equally valid alignment in a region, but the program only reports one solution. There are more possibilities for such alternatives in DNA vs protein comparisons than for DNA v DNA or protein v protein.

As well as MODE 22, there is MODE 20. MODE 20 does not have affine gap penalties and as a consequence it is about a factor fo 3 faster than MODE 22. Penalties for frameshifts are length dependent in MODES 20 and 22, thus if FS_PEN is 8, then a single frameshift costs 8 and a double costs 16.

Next: APPENDIX I - Revision Up: Getting more from SCANPS Previous: The scanps_gapdefs.dat file

Geoff Barton (GJB) 2002-07-23