5_Project

docx

School

Northeastern University *

*We aren’t endorsed by this school

Course

2301

Subject

Biology

Date

Apr 3, 2024

Type

docx

Pages

9

Uploaded by KidBeeMaster1068

Report
Project Five: Having a BLAST with pairwise alignments (20 points) In this project, you'll learn the basics of how pairwise alignments are carried out, how changes that have accumulated between two sequences affect an alignment, and how alignments are used to identify gene homologs. You'll have the opportunity to carry out homolog searches using NCBI's BLAST tool, and to understand how features of the resulting pairwise alignments represent actual biological events that occurred as those sequences diverged. In the last section, you’ll learn some guidelines in citing research papers. Pre-class prep: Read all text Watch this video on using BLAST, before class. Pairwise sequence alignments – comparisons of two sequences, position by position – give insights into evolutionary relationships between the sequences. They are used in finding homologs, which are sequences that are related by common ancestry. This week's exercises will familiarize you with various parameters in a BLAST search, how to conduct BLAST searches appropriate to the information you're seeking, and how to read the results. Imagine you think of a long sentence, such as "The tips of a phylogenetic tree can be living organisms or fossils, and represent the 'end', or the present, in an evolutionary lineage." You whisper it to a few people in one big auditorium filled with all three sections of genetic students, and they begin passing it on to their neighbors. Early in the process, you all separate into three different auditoriums by section. Eventually the final person in each room hears the sentence and repeats it. You will now have three slightly different versions of the original sequence, but it is apparent they all three originated from one source. All three versions are homologs . Homologs share "excess" similarity , meaning, they are more similar at the sequence level than would be expected by chance alone. I. Using BLAST to identify sequences (11 pts) 1. In the section of a BLAST pairwise alignment shown below, what does a '+' sign signify? A ‘+’ sign indicates the two sequences are similar, but not highly similar and that they have functional equivalence.
Jurassic Park, 1990 (pre-internet) The Lost World, 1995 Crichton's Jurassic Park involves a mosquito preserved in amber from the Jurassic period, when dinosaurs roamed the earth. Scientists are able to recover dinosaur DNA from the mosquito's last blood meal (which they then use to "de-extinct" a dinosaur). To make it more realistic, Jurassic Park even includes a nucleotide sequence that is allegedly from the dinosaur DNA, shown below as "JP". However, the novel was first published in 1990, before the internet was in use by the public, so it wasn't possible to simply "BLAST" the sequence against the NCBI databases to see what it matches. But you can do that easily now . . . First, translate the "JP" sequence with ExPASy translate, using "compact" for the output format. 2. Do any of the six reading frames appear to encode a protein? Don't be fooled by stretches of pink. Instead, look for long regions uninterrupted by stop codons. No, none of them appear to encode a protein. Now copy the nucleotide sequence for JP into the BLAST search window and choose the "blastn" tab above it. Keep all the default settings and click the BLAST button near the bottom left of the page. 3. What is the name of the first hit returned with coverage of 99% or more? The name of the first hit with coverage of 99% or more is Cloning vector pUD1074. 4. Does this sequence seem like a credible homolog of a dinosaur sequence? This does not seem like a credible homolog because there were only 4 matches in the sequence. A homolog was described above as having excess similarity and this does not appear to have excess similarity.
The second nucleotide sequence ("LW") is from the novel The Lost World – a sequel to Jurassic Park that was published in 1995. The Lost World also contains a nucleotide sequence, but by this time, Crichton knew that many people were using the internet, and it was possible to access the NCBI databases and do a BLAST search, as you are about to do . . . First, translate the "LW" sequence using the ExPASy translate tool. 5. Which reading frame is the most likely to encode a protein (based on the length of uninterrupted sequence of amino acids)? 5’3’ Frame 2 appears to be most likely to encode a protein. You want to copy the longest translated sequence (you can choose the longest sequence that is highlighted in pink/red) and paste it into a new BLAST search window. Since this is a presumed protein sequence, choose the tab above the window, "blastp", for searching against protein databases. Again, leave everything else at default values. Run the search... 6. What is the name of the top BLAST hit? The name of the top BLAST hit would be erythroid transcription factor [Gallus gallus] 7. What species is it from (give the common name, not the Latin)? The common name is the Red junglefowl. 8. What is the closest living relative of Tyrannosaurus rex (Google it if necessary)? The chicken is the closest living relative. Click on the link to the top match to be taken to the pairwise alignment and look for mismatches and gaps. 9. How many gaps (indels) do you see in the alignment? (Ignore the values given above the alignment – just use your eyes, looking for strings of dashes in one or the other sequence. Count each run of dashes as one gap .) I can identify 4 gaps in the alignment. 10. How many substitutions (mismatches) are there in the alignment? (Again, use your eyes. Look at the sequence in the middle of the two being compared. A letter appears wherever the two are perfectly matched. If not matched, there is a space, just as with the gaps.) I counted 1 substitution in the sequence. Think about it: If substitutions (represented by mismatches in an alignment) are more common in nature than insertions and deletions (represented by gaps), doesn't it seem unusual that this alignment has more gaps than mismatches? Hmmmm. 11. Look carefully at the gaps . What is unusual about the amino acids matched with the gaps? (Note: you'll know it right away when you see it!) The gaps spell out Mark was here NIH, and they are not actually the nucleotide sequence. Also, they occur before or after the amino acid asparagine. >JP gcgttgctggcgtttttccataggctccgcccccctgacgagcatcacaaaaatcgacgc ggtggcgaaacccgacaggactataaagataccaggcgtttccccctggaagctccctcg
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
tgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcgggaagcgtggc tgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctgggctgtgtg ccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtccaacccggtaa agtaggacaggtgccggcagcgctctgggtcattttcggcgaggaccgctttcgctggag atcggcctgtcgcttgcggtattcggaatcttgcacgccctcgctcaagccttcgtcact ccaaacgtttcggcgagaagcaggccattatcgccggcatggcggccgacgcgctgggct ggcgttcgcgacgcgaggctggatggccttccccattatgattcttctcgcttccggcgg cccgcgttgcaggccatgctgtccaggcaggtagatgacgaccatcagggacagcttcaa cggctcttaccagcctaacttcgatcactggaccgctgatcgtcacggcgatttatgccg caagtcagaggtggcgaaacccgacaaggactataaagataccaggcgtttcccctggaa gcgctctcctgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcggg ctttctcattgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctg acgaaccccccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtcca acacgacttaacgggttggcatggattgtaggcgccgccctataccttgtctgcctcccc gcggtgcatggagccgggccacctcgacctgaatggaagccggcggcacctcgctaacgg ccaagaattggagccaatcaattcttgcggagaactgtgaatgcgcaaaccaacccttgg ccatcgcgtccgccatctccagcagccgcacgcggcgcatctcgggcagcgttgggtcct gcgcatgatcgtgctagcctgtcgttgaggacccggctaggctggcggggttgccttact atgaatcaccgatacgcgagcgaacgtgaagcgactgctgctgcaaaacgtctgcgacct atgaatggtcttcggtttccgtgtttcgtaaagtctggaaacgcggaagtcagcgccctg >LW agaattccggaagcgagcaagagataagtcctggcatcagatacagttggagataaggac ggacgtgtggcagctcccgcagaggattcactggaagtgcattacctatcccatgggagc catggagttcgtggcgctgggggggccggatgcgggctcccccactccgttccctgatga agccggagccttcctggggctgggggggggcgagaggacggaggcgggggggctgctggc ctcctaccccccctcaggccgcgtgtccctggtgccgtgggcagacacgggtactttggg gaccccccagtgggtgccgcccgccacccaaatggagcccccccactacctggagctgct gcaacccccccggggcagccccccccatccctcctccgggcccctactgccactcagcag cgggcccccaccctgcgaggcccgtgagtgcgtcatggccaggaagaactgcggagcgac ggcaacgccgctgtggcgccgggacggcaccgggcattacctgtgcaactgggcctcagc ctgcgggctctaccaccgcctcaacggccagaaccgcccgctcatccgccccaaaaagcg cctgcgggtgagtaagcgcgcaggcacagtgtgcagccacgagcgtgaaaactgccagac atccaccaccactctgtggcgtcgcagccccatgggggaccccgtctgcaacaacattca cgcctgcggcctctactacaaactgcaccaagtgaaccgccccctcacgatgcgcaaaga cggaatccaaacccgaaaccgcaaagtttcctccaagggtaaaaagcggcgccccccggg ggggggaaacccctccgccaccgcgggagggggcgctcctatggggggagggggggaccc ctctatgccccccccgccgccccccccggccgccgccccccctcaaagcgacgctctgta cgctctcggccccgtggtcctttcgggccattttctgccctttggaaactccggagggtt ttttggggggggggcggggggttacacggcccccccggggctgagcccgcagatttaaat aataactctgacgtgggcaagtgggccttgctgagaagacagtgtaacataataatttgc acctcggcaattgcagagggtcgatctccactttggacacaacagggctactcggtagga ccagataagcactttgctccctggactgaaaaagaaaggatttatctgtttgcttcttgc tgacaaatccctgtgaaaggtaaaagtcggacacagcaatcgattatttctcgcctgtgt gaaattactgtgaatattgtaaatatatatatatatatatatatatctgtatagaacagc ctcggaggcggcatggacccagcgtagatcatgctggatttgtactgccggaattc
II. Understanding "excess similarity", used to identify homologs (7 pts) 12. Imagine generating two random sequences, each 200 nucleotides in length. If you align them pairwise, at how many positions are they expected to match just by chance? 50 would be expected to match just by chance. To help you answer this question, think about the chance a single randomly chosen nucleotide will match another randomly chosen nucleotide. 13. Now imagine generating two random sequences, each 200 amino acids in length. In a pairwise alignment, at how many positions are they expected to match by chance? 10 would be expected to match by chance. Read the excerpt below, from a very heavily cited paper by one of the leading figures in sequence analysis, William Pearson. In light of the information above, try doing a BLAST search with the nucleotide sequence for a gene, and then with the amino acid (translated DNA) sequence, and see how the results differ. We'll try this with an exon of the human LMNA gene. Investigate nucleotide vs amino acid homology To explore the dynamics of nucleotide homology or amino acid homology matching, go back to the BLAST website and select blastn (nucleotide BLAST). Copy and paste hu_LMNA_exon_nt sequence, including the header information into the empty BLAST query field. Click in the “job title” box, and a description title matching your FASTA header appears. Now, let’s restrict the taxonomy that we will search for homologs. In the “organism” box, select “bony fishes”. Finally, run your BLAST search. Once the hits are returned, click on the Taxonomy link above the graphics window, then select "Taxonomy" next to Reports. The first line will show number of hits and number of organisms that are included in the homologs that resulted from the search.
>hu_LMNA_exon_nt gctacgcctgtcccccagccctacctcgcagcgcagccgtggccgtgcttcctctcactc atcccagacacagggtgggggcagcgtcaccaaaaagcgcaaactggagtccactgagag ccgcagcagcttctcacagcacgcacgcactagcgggcgcgtggccgtggaggaggtgga tgaggagggcaagtttgtccggctgcgcaacaagtccaatgag >hu_LMNA_exon_aa LRLSPSPTSQRSRGRASSHSSQTQGGGSVTKKRKLESTESRSSFSQHARTSGRVAVEEVD EEGKFVRLRNKSNE 14. How many hits and number of organisms are returned for the nucleotide search? There were 806 hits and 318 organisms for the nucleotide search. Open another BLAST webpage with a new tab on your web browser. Select blastp for the protein search and enter the amino acid FASTA sequence (hu_LMNA_exon_aa) in the empty query field. Make sure your “job field” matches the header information of your FASTA sequence. Restrict the taxonomy to bony fish and then run your BLAST search. Once the hits are returned, repeat the steps from your previous search to compare the numbers of hits and organisms. 15. How many hits and number of organisms are returned for the amino acid search? Be sure you're looking at the Taxonomy Report section! There were 1,426 hits and 442 organisms. 16. Which search produced hits in a wider range of organisms? The second search produced hits in a wider range of organisms. When comparing two nucleotide sequences, the chance of a random match at any position is 1/4, since there are only 4 possible nucleotides. In contrast, when comparing two amino acid sequences, the chance of a random match at any position is only 1/20, since there are 20 possible amino acids. Therefore, an amino acid match is more significant, since it is less likely to have occurred only by chance. For example, what if a nucleotide alignment matches at 25% of the positions? That's the % expected to match just by chance. But consider an amino acid alignment that matches at 25% of the positions. That level of identity is highly significant, since only 5% (1/20) position should match just by chance. 17. Which type of search (nucleotide or amino acid) would you use if you wanted to find homologs to your sequence in distantly related organisms, meaning, those that shared a common ancestor long ago rather than more recently? Explain why. (2 pts) It would be most effective to use amino acid sequences because they are more conserved and less affected by silent mutations, allowing us to get a more accurate reflection of the evolutionary relationships compared to the amino acids. Also, the nucleotide sequences are more susceptible to mutation, making them less reliable for distant homology searches.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Note about LW sequence: Mark Borguski, a scientist at the NIH (National Institutes for Health) noticed the sequence in Jurassic Park wasn’t an accurate sequence to be used to represent a dinosaur sequence, so he provided the “LW” sequence to Crichton for use in the sequel. However, he very cleverly embedded his name into the sequence! III. Finding and Citing Academic Research (3 pts) Here are some guidelines for how you cite in biology. After reading through them, answer the three questions at the end of this section. Provide a citation for the discoverers of the information o Do: Find the original source! o Don’t: Cite the first paper you come across Provide a citation for any new information contributed o Cite the people with the idea, AND the people that improved upon it Do not use quotations, put ideas into your own words. o Remember, if the ideas are not yours, they still need to be cited! Common knowledge does not need citation (but what knowledge is common may differ based on the audience) o Assume the types of things you learned in AP/intro biology are common knowledge Different journals use different styles, but most are similar to APA In-text citations: o Appear within the text, usually at the ends of sentences, to display where the underlying information originally was discovered o We will use the parenthetical style (Author’s Last Name, Year) When one paper has two authors (Last Name and Last Name, Year), when there are more than two authors, write as (Last Name et al., Year) References list at the end of the document: o Full citation format, we will use APA 6 th /7 th edition format for journal articles o Only cite sources with information that actually made it into the final product, not every paper you read during your research o Alphabetical order o Citing fake sources or sources you did not actually use is academic misconduct and will receive a harsh penalty Generating Citations Using scholar. google.com
Citation Managers Help you keep track of papers you read and automatically generate citations Integrate with Word and other products Paid: EndNote (free for current students), RefWorks Free: Zotero, Mendeley Paraphrasing In academic Biology writing we do NOT use quotes That means you need to put things in your own words (while still citing where the information came from) There is a fine line between proper paraphrasing and plagiarism All of the following are considered plagiarism: Using text that is directly taken from another source (remember, we don’t use quotes in biology) Changing words but copying the sentence structure of a source Turning in someone else’s work as your own Copying ideas from someone else without giving credit Giving incorrect information about a source 18. After reading the Original Source and the citation, decide whether A, B, or C (below) is more appropriate as an example of content included in a paper you write. _C__ (1pt) 19. Why the other two are not appropriate? (2 pts) The other two are not appropriate because A uses the direct quote from the source, which you never do when writing in biology and B does not cite the original source. Original Source: “The manner in which a newly synthesized chain of amino acids transforms itself into a perfectly folded protein depends both on the intrinsic
properties of the amino-acid sequence and on multiple contributing influences from the crowded cellular milieu.” Dobson, C. M. (2003). Protein folding and misfolding. Nature, 426(6968), 884-890. A. “The manner in which a newly synthesized chain of amino acids transforms itself into a perfectly folded protein depends both on the intrinsic properties of the amino-acid sequence and on multiple contributing influences from the crowded cellular milieu (Dobson, 2003).” B. The way in which a protein folds depends both on the intrinsic properties of the amino acid sequence and on other contributing influences from the crowded cell. C. The way a protein folds is influenced by the sequence of amino acids in the protein and its cellular environment (Dobson, 2003).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help