Lab2

docx

School

Johns Hopkins University *

*We aren’t endorsed by this school

Course

633

Subject

Biology

Date

Jan 9, 2024

Type

docx

Pages

10

Uploaded by murphydanyael

Report
Intro to Bioinformatics - Lab Two Name: Danyael Murphy 1. Goal: Examine the effect of search space on E value. Fill in the table and answer questions. Given: The protein sequence with accession XP_001741646 Search #1: BLASTP vs the nr database with no organism limits Search #2: BLASTP vs the pdb database, organism limited to archamoebae (taxid:555406) Target: Look for and report the best match from the pdb database (#3OJ7_A) a. (0.2) Fill out the table below. Search Space Total Score (3OJ7_A) E Value Search #1 nr, no limit 220 8e-72 Search #2 pdb, archamoebae 220 1e-75 b. (0.5) Explain any discrepancies between score and E value of the two searches for the 3O7J_A result. Specifically, where are the differences and what is the cause of the difference? Please answer in terms of the E value equations that we have discussed in class. i. Search space size, N, is decreased when the searched is narrowed to organism archaemoebae, but the total score does not change. Since E value is calculated E = N (search space)/S (adjusted score) then E value will decrease since the search space has been reduced. c. (0.1) Click the link for 3OJ7_A and then follow the Structures link. Name the Super Family. Histidine Triad Family Protein
d. (0.1) When was this structure deposited? 8/20/2010 e. (0.1) What two Chemicals and Non- standard biopolymers are listed? Zinc Ion Sulfate Ion 2. Goal: Calculate the bit score (S’) to two decimal places based on the raw score S. Given: The BLASTP query sequence AAA52280.1. Given: The equation S’ = ( S – lnK)/ln2 λ Search #1: Use all defaults but limit to reference proteins (refseq_protein) and limit the organism to archaea (taxid:2157). Search #2: Same as search #1 except flip the matrix to PAM 70 and change the Gap Costs to 7/2. Target: Examine the best match (lowest E value) in each search. Hint #1: Look at the alignment for the best match. Next to the bit score (S’) is the raw score in parentheses. That is S. Hint #2: For each search, click search summary to get to obtain and K (use the λ right column, not the left). a. (0.5) Enter the data below. The last column should contain the S’ value you calculated using the equation. Matrix S (from BLAST) S’ (from BLAST) λ K S’ (calculated to two decimal places) BLOSUM 62 231 93.6 0.267 0.04 1 93.59 PAM 70 191 82.2 0.286 0.09 3 82.24 b. (0.2) How well do the two S’ values match the S’ value in BLAST? i. They are the same. c. (0.3) Explain why both and K changed when the matrix was changed λ and why that is important in calculating the adjusted score in bits. i. The significance of the raw score is dependent on K and lambda which can change depending on the base composition of the sequence being examined.
3. Goal: Perform pairwise alignment on two seemingly different proteins. Answer some questions about the alignment and the proteins. Given: BLASTP pairwise alignment, default values. Given: The following two sequences: >NP_476761.3 abnormal wing discs, isoform C [Drosophila melanogaster] >CAA35621.1 Nm23 protein, partial [Homo sapiens] a. (0.2) Based on the bits and the E value, are these two proteins homologous? Briefly explain what led you to that conclusion. b. Yes, E value is less than 1 x 10^-4 c. (0.2) List the % identity and the % positives and briefly explain the difference between identity and similarity. % identity = (119/157) 76% % positives = (135/157) 85% Identity refers to residues in a sequence that are an exact match. While similarity includes both exact matches and conservative substitutions. c. (0.1) What gap-existence penalty was used to create this alignment? What gap extension penalty? Existence Extension -11 -1 d. (0.2) Look at the last (C-terminal) 10 amino acids in the alignment. Using the BLOSUM62 matrix (see lecture notes or use R) to evaluate the matches and mismatches, score the last 10 amino acids in the alignment. Show your work (or the R command and output) a. TSCAQNWIYE b. TPAAKDWIYE (5)(-1)(0)(4)(1)(1)(11)(4)(7)(5) = 37-11= 26 e. (0.05) Go to the human protein record and find the reference by Rosengard et al. What is stated to be the principal cause of death for cancer patients? Tumor metastasis f. (0.05) What does awd (the name of the fly protein) stand for? Abnormal wing discs g. (0.2) On the surface, it might seem surprising that two such proteins would align. According to the abstract, how are tumor metastases related to abnormal wing disc development in flies? Mutations in AWD cause abnormal tissue morphology and necrosis and widespread aberrant differentiation in Drosophila, analogous to changes in malignant progression. The metastatic state may therefore be determined by the loss of genes such as nm23/awd which normally regulate development.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4. Goal: Compare E values from BLAST and FASTA. Query: CAJ14017 Target: P68368 Search #1: BLASTP with limits mouse (10088), Swiss-Prot Search #2: protein-protein FASTA with database Mouse Uniprot ref For FASTA, change matrix to BlastP62 so it matches BLAST. a. (0.2) Describe the query protein (what protein, what organism?). i. Prosthecobacter debontii, bacterial tubulin A1 b. (0.4) Fill in the table. Search Bits (P68368) E value (P68368) BLASTP 269 5e-74 FASTA 282.7 3.6e-76 c. (0.4) What is the primary reason that BLAST and FASTA E values differ? i. BLAST is mostly involved in finding of ungapped, locally opti- mal sequence alignments whereas FASTA is involved in finding similarities between less similar sequences. 5. Goal: Find a cow protein with “ NEMO ” in the protein name field in the NCBI protein database. Limit to the RefSeq database. Goal: Once you have found NEMO, run some translated BLAST searches and answer questions. Resources: NCBI protein database, BLAST suite of programs. a. (0.1) List the accession number of NEMO. NP_776779 XP_002707239 b. (0.1) From what breed of cow was this sequence derived? Dairy cow Do a translated BLAST search (default values) using the protein from part A and search the expressed sequence tags (est) nucleotide database, limiting your output to human ESTs. c. (0.1) What version of BLAST must be used? tblastn d. (0.1) For the best matching EST, list: E value 0.0 Bits 607 % Identity 85 % Similarity 90 Alignment Length 1962
e. (0.1) Explain why amino acids are shown in the alignment in a search of a nucleotide database. The program TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. f. (0.1) Open the nucleotide record of the best-matching EST. From what cell type were the cDNAs corresponding to these ESTs developed? keratinocyte g. (0.1) What is the name of the contact from the comment field? Scarafia LE Take that nucleotide EST sequence and do a blastx search against the model organisms (landmark) database. Exclude Models (XM/XP). h. (0.1) For the best matching protein sequence with optineurin in the title , list: E value 1e-11 Bits 71.6 % Identity 30 % Similarity 53 Alignment Length 524 i. (0.1) From what organism is the best matching optineurin sequence? Danio rerio j. (0.1) Open that optineurin record. Provide the numerical location of the “NEMO” region. 27..92 6. This problem is based on a paper by Sandesh Acharya and others published in August 2021 ( PLoS One 16(8):e0241093). Goal: Examine a putative homolog found using PSI-BLAST. Given: Query sequence O73947 (the first character is the letter O). Target: Sequence OPZ48648 (the first character is the letter O). Iteration #1: PSI-BLAST nr, limited to bacteria (taxid:2). Iteration #2: Include any sequence with an E value of 0.005 or lower. a. (0.1) At what iteration of PSI-BLAST does the OPZ48648 protein appear in the results? Iteration #2 b. (0.1) List the E value from that iteration. 1e-20 c. (0.1) List the title of the OPZ48648 record. DNA Polymerase Sliding Clamp d. (0.2) List the organism for the OPZ4848 record. Bacteroidetes bacterium
e. (0.3) Based on the Iteration 2 E value, can you definitively state that O73947 and OPZ48648 are homologs? Why or why not? No, the strong E-value in the the PSI-blast does not confirm homology. The E value is to the PSSM, and if there is any weakness in the PSSM, unusual results occur. f. (0.1) Confirm homology between O73947 and OPZ48648 using pairwise BLAST alignment. Are they homologous based on the pairwise alignment E value? Yes. Pairwise alignment e-value is 2e-09. g. (0.1) Why might BLAST have not found all homologs on the first iteration? No search program is 100% sensitive. The second iteration is based off the PSSM of the multiple sequence alignment, so they are putative homologs. It may not come up the first time without the alignment. h. (0.1) Other than pairwise alignment, describe what else you could have done on your computer with BLAST and alignment tools to try to confirm homology between Q9UQ84.2 and P14629. Run blastp on just those sequences or looked at the log odds table. i. (0.1) Describe what practical steps you could do in the wet lab to try to confirm homology between Q9QUQ84.2 and P14629 (Hint: protein structure determination is NOT practical in a short time period). Run a microarray analysis of the two proteins to analyze pathways, clusters, and correlated mutations in the multiple sequence alignment 7. Goal: Run PHI-BLAST and answer a few questions. Given: Pattern is [ F Y W ] - P - [ E Q H ] - [ L I V ] ( 2 ) - G - x ( 2 ) - [ S TA G V ] - x ( 2 ) - A Given: Sequence below. > Chymotrypsin inhibitor-2 OS=Hordeum vulgare var. distichum GN=CI-2 PE=4 SV=1 MSSMEKKPEGVNIGAGDRQNQKTEWPELVGKSVEEAKKVILQDKPEAQIIVLPVG- TIVTI EYRIDRVRLFVDRLDNIAQVPRVG Search: PHI-BLAST, refseq_protein DB, threshold 0.0001, organism limit to aster- ids (taxid: 71274). Check the box to exclude Models (XM/XP). Also same search, same parameters, standard BLASTP.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
a. (0.1) What are asterids? One of the largest subgroups of flowering plants. They are mostly shrubs with opposite leaves and no stipules. b. (0.3) List the number of results above thresh- old (E value < 0.0001) for each search: Standard BLASTP 100 PHI-BLAST iteration 1 136 PHI-BLAST iteration 2 268 c. (0.2) At what amino acid position does the pat- tern begin in the query sequence? 25 e. (0.2) Should PHI/PSI-BLAST always continue to find new hits at every iteration? Briefly explain. No. PSIBLAST can repeatedly search the target databases, using a multiple alignment of high scoring sequences found in each search round to generate a new PSSM for use in the next round of searching. PSIBLAST will iterate until no new sequences are found, or the user speci- fied maximum number of iterations is reached. f. (0.2) At what iteration are no more new se- quences added? The first 8. Goal: Determine the relatedness of influenza hemagglutinins and neu- raminidases Given: The file flu_DNA.txt. The top three sequences are hemagglutinins and the bottom three are neuraminidases. Hint: With flu subtypes (H1N1), what does the H refer to and what does the N re- fer to? Resources: blastn, pairwise, a series of alignments must be used a. (0.2) Which are the two most closely related hemagglutinins based upon E value from pair- wise alignment? KX004707.1 and KY522887.1 b. (0.2) Which are the two most closely related neuraminidases based upon E value from pair- wise alignment? MT341964.1 and MN225004.1 c. (0.3) Which is more important in determin- ing the relatedness: geographic location, date of isolation, subtype (e.g. H1N1) or host (hu- Subtype
man, swine)? d. (0.3) Protein comparison often works better than DNA comparison in blast analyses. This is coding DNA. Why is nucleotide blast more ap- propriate than protein blast in this particular instance? Protein sequences are more evolutionarily conserved than nucleotide sequences. 9. (0.6) What advantages does JACKHMMER have over PSI-BLAST? Include the type of profile and that profile’s ability to handle gaps. a. JACKHMMER uses Hidden Markov Models (HMM) and PSI-BLAST uses PSSMs. HMMs can include insertions and deletions while PSSMs cannot. HMMs are probabilistic models. 10. (1.4) Goal: Use JACKHMMER to find matches and then assess homology. Given: Everyone will be assigned a query protein, a database and a taxonomy limit. You will be looking for your target protein on Iteration 2 of JACKHMMER. You will also be looking for matches that match your assigned domain pattern. a. (0.2) Start with a few sentences about your particular query protein and your target protein. If you can find their functions, focus on that. i. WP_104828400 – iron sulfur cluster repair protein participates in the control of gene expression, oxygen/nitrogen sensing, control of labile iron pool and DNA damage recognition and repair. b. (0.2) What is the E value for your target protein in the iteration two results? 7.1e-88 c. (0.2) Based ONLY on the JACKHMMER iteration two results. can you definitively conclude homology be- tween your query and your target? Briefly explain. Yes. JACKHMMER is based on the HMM model and HMM assesses the likelihood of matches, mismatches, insertions, and dele- tions at a position in an alignment. It is a statistical model based on known se- quences. d. (0.2) After iteration two of JACKHMMER, how many sequences have the domain pattern to which you were assigned? List the identifiers for two of those matches (if there are two). 561 A0A165L7I9_9EURY A0A0B5H4R4_9EURY e. (0.3) Using only BLAST and/or pairwise BLAST, as- sess homology between your query sequence and your target sequence. You can use a series of blastp searches. Describe your strategy and your conclusion. Sequences are homologous. E value is 2e- 36. f. (0.2) List the domains are shared between your
query protein and your target protein. Based on do- main sharing, do you think that provides further evi- dence of homology between your query and target? Briefly explain. 11. The table shows nucleotide frequencies from a multiple alignment. Build a lo- godds PSSM. Use 2log 2 (obs/exp)) to calculate logodds values. The following cal- culator does log base 2 ( https://www.omnicalculator.com/math/log-2 ) Pos1 Pos2 Pos3 Pos4 A 0.63 0.05 0.68 0.37 T 0.16 0.05 0.09 0.16 G 0.04 0.86 0.08 0.43 C 0.17 0.04 0.15 0.04 (0.6) Assume that the expected frequency for C and G is .30 while A and T are .20. Calculate your logodds PSSM rounded to one decimal place. Write the PSSM below. Pos1 Pos2 Pos3 Pos4 A 3.3 -4 3.5 1.8 T -0.6 -4 -2.3 -0.6 G -5.8 3.0 -3.8 1.0 C -1.6 -5.8 -2.0 -5.8 (0.2) What is the highest possible scor- ing 4 nt sequence in your PSSM? What is its score? Sequence AGAA Score 11.6 (0.2) What is the next-highest possible score and its sequence? Sequence AGAG Score 10.8 12. Neighborhood words: a. (0.6) Explain how a neighborhood word helps gain back sensitivity lost from using a word length of 3 in BLAST searches. Your answer should in- clude how the length of the word affects sensitivity. i. If a query sequence has a QWRTG, the searched words are QWR, WRT, RTG. The algorithms find all common words be- tween the query sequence and the hit sequences. Only regions with a word hit will be used to build on an alignment. For each
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
word in the query sequence, a compilation of neighborhood words is also generated. The compilation of exact words and neighborhood words is then used to match against database sequences. By adjusting word size and neighborhood word threshold, it is possible to limit the search space. b. (0.4) If the BLOSUM62 threshold is 11, which of the following three letter words meet the criteria for a neighborhood word match to LRD. List the scores for each. LRD = 4+5+6 =15 Sequence (word) BLOSUM62 score vs LRD LKQ 4+2+0 = 6 IKD 2+2+6 = 10 MRD 2+5+6= 13 IRD 2+5+6 = 13 LKD 4+2+6= 12 MRQ 2+5+0= 7 IRQ 2+5+0=7 LRQ 4+5+0 = 9