bcantarel.github.io

Course Material

View the Project on GitHub

CSHL Sequence Homology and Alignment Workshop

CD Search and HMM Profiles Searching

>NP_416557.1 GDP-mannose 4,6-dehydratase [Escherichia coli str. K-12 substr. MG1655]
MSKVALITGVTGQDGSYLAEFLLEKGYEVHGIKRRASSFNTERVDHIYQDPHTCNPKFHLHYGDLSDTSN
LTRILREVQPDEVYNLGAMSHVAVSFESPEYTADVDAMGTLRLLEAIRFLGLEKKTRFYQASTSELYGLV
QEIPQKETTPFYPRSPYAVAKLYAYWITVNYRESYGMYACNGILFNHESPRRGETFVTRKITRAIANIAQ
GLESCLYLGNMDSLRDWGHAKDYVKMQWMMLQQEQPEDFVIATGVQYSVRQFVEMAAAQLGIKLRFEGTG
VEEKGIVVSVTGHDAPGVKPGDVIIAVDPRYFRPAEVETLLGDPTKAHEKLGWKPEITLREMVSEMVAND
LEAAKKHSLLKSHGYDVAIALES

Questions:

  1. What domains are identified in using each method?
  2. What is the expectation value for these hits?
  3. What percent of the protein is “covered” by these domains?
  4. How many iteration of psi-search is necessary to find the protein ABNA_ASPNG?
  5. What domain(s) match ABNA_ASPNG and P45796? Hint: See the Visual Output

Write a simple global nucleotide alignment program of your own for sequences of similar size (think amplicon sequences) using Python

The input for your program should be the scores for match and mismatches, and 2 sequences. Allow for an option to calculate the scores for each combination of sequences and their reverse complements ie:

  1. What is the score of the alignment using the scores 1,-1 and the sequences:
    • sequence1 = “agtctgtca”
    • sequence1 = “gatctctgc”
  2. What are the scores of each combinations of reverse complements using the score 2/-2.

Write a fasta parser to determine percent alignment coverage for each hit

The input for your program shoud be this file

The output should include: Hit Name, Percent Query Coverage (alignment length/Query Length), Bit Score, Alignment Length, Query Length

Query: @
  1>>>sp|P45796.1|XYND_PAEPO RecName: Full=Arabinoxylan arabinofuranohydrolase; Short=AXH; AltName: Full=AXH-m2,3; Short=AXH-m23; AltName: Full=Alpha-L-arabinofuranosidase; Short=AF; Flags: Precursor - 635 aa
Library: UniProtKB/Swiss-Prot
  200544181 residues in 558590 sequences


>>SP:XYNA2_CLOSR P33558 Endo-1,4-beta-xylanase A
OS=Clostridium stercorarium OX=1510 GN=xynA PE=1 SV=2 (512 aa)
 s-w opt: 230  Z-score: 447.3  bits: 92.7 E(558590): 2.3e-17
Smith-Waterman score: 230; 32.9% identity (61.4% similar) in 249 aa overlap (398-618:267-497)

Here is an example: SP:XYNA2_CLOSR 39% 92.7 249 635

The objective of this exercise is to practice regular expressions using the original FASTA program output.

Hints:

Parse this file and create output with the following columns