5_Project
docx
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
2301
Subject
Biology
Date
Apr 3, 2024
Type
docx
Pages
9
Uploaded by KidBeeMaster1068
Project Five: Having a BLAST with pairwise alignments (20 points)
In this project, you'll learn the basics of how pairwise alignments are carried out, how changes that have accumulated between two sequences affect an alignment, and how
alignments are used to identify gene homologs. You'll have the opportunity to carry out homolog searches using NCBI's BLAST tool, and to understand how features of the resulting pairwise alignments represent actual biological events that occurred as those sequences diverged. In the last section, you’ll learn some guidelines in citing research papers.
Pre-class prep: Read all text Watch this video
on using BLAST, before class.
Pairwise sequence alignments
– comparisons of two sequences, position by position – give insights into evolutionary relationships between the sequences. They are used in finding homologs, which are sequences that are related by common ancestry. This week's
exercises will familiarize you with various parameters in a BLAST search, how to conduct BLAST searches appropriate to the information you're seeking, and how to read the results.
Imagine you think of a long sentence, such as "The tips of a phylogenetic tree can be living organisms or fossils, and represent the 'end', or the present, in an evolutionary lineage."
You whisper it to a few people in one big auditorium filled with all three sections of genetic students, and they begin passing it on to their neighbors. Early in the process, you all separate into three different auditoriums by section. Eventually the final person in each room hears the sentence and repeats it. You will now have three slightly different versions of the original sequence, but it is apparent they all three originated from one source. All three versions are homologs
. Homologs share "excess" similarity
, meaning, they are more similar at the sequence level than would be expected by chance alone.
I. Using BLAST to identify sequences (11 pts)
1.
In the section of a BLAST pairwise alignment shown below, what does a '+' sign signify? A ‘+’ sign indicates the two sequences are similar, but not highly similar and that
they have functional equivalence.
Jurassic Park, 1990 (pre-internet) The Lost World, 1995 Crichton's Jurassic Park
involves a mosquito preserved in amber from the Jurassic period, when dinosaurs roamed the earth. Scientists are able to recover dinosaur DNA from the mosquito's last blood meal (which they then use to "de-extinct" a dinosaur). To make it more realistic, Jurassic Park even includes a nucleotide sequence that is allegedly
from the dinosaur DNA, shown below as "JP". However, the novel was first published in 1990, before the internet was in use by the public, so it wasn't possible to simply "BLAST" the sequence against the NCBI databases to see what it matches. But you can do that easily now . . .
First, translate the "JP" sequence with ExPASy translate, using "compact" for the output format. 2. Do any of the six reading frames appear to encode a protein? Don't be fooled by stretches of pink. Instead, look for long regions uninterrupted by stop codons. No, none of them appear to encode a protein. Now copy the nucleotide
sequence for JP into the BLAST search window and choose the "blastn" tab above it. Keep all the default settings and click the BLAST button near the bottom left of the page. 3. What is the name of the first hit returned with coverage of 99% or more? The name of the first hit with coverage of 99% or more is Cloning vector pUD1074. 4. Does this sequence seem like a credible homolog of a dinosaur sequence? This does not seem like a credible homolog because there were only 4 matches in the sequence. A homolog was described above as having excess similarity and this does not appear to have excess similarity.
The second nucleotide sequence ("LW") is from the novel The Lost World
– a sequel to Jurassic Park
that was published in 1995. The Lost World
also contains a nucleotide sequence, but by this time, Crichton knew that many people were using the internet, and it was possible to access the NCBI databases and do a BLAST search, as you are about to
do . . .
First, translate the "LW" sequence using the ExPASy translate tool.
5. Which reading frame is the most likely to encode a protein (based on the length of uninterrupted sequence of amino acids)? 5’3’ Frame 2 appears to be most likely to encode a protein. You want to copy the longest translated
sequence (you can choose the longest sequence that is highlighted in pink/red) and paste it into a new BLAST search window. Since this is a presumed protein sequence, choose the tab above the window, "blastp", for searching against protein databases. Again, leave everything else at default values. Run the search...
6. What is the name of the top BLAST hit? The name of the top BLAST hit would be erythroid transcription factor [Gallus gallus]
7. What species is it from (give the common name, not the Latin)? The common name is the Red junglefowl. 8. What is the closest living relative of Tyrannosaurus rex
(Google it if necessary)? The chicken is the closest living relative. Click on the link to the top match to be taken to the pairwise alignment and look for mismatches and gaps. 9. How many gaps (indels) do you see in the alignment? (Ignore the values given above the alignment – just use your eyes, looking for strings of dashes in one or the other sequence. Count each run of dashes as one gap
.) I can identify 4 gaps in the alignment. 10. How many substitutions (mismatches) are there in the alignment? (Again, use your eyes. Look at the sequence in the middle of the two being compared. A letter appears wherever the two are perfectly matched. If not matched, there is a space, just as with the gaps.) I counted 1 substitution in the sequence. Think about it: If substitutions (represented by mismatches in an alignment) are more common in nature than insertions and deletions (represented by gaps), doesn't it seem unusual that this alignment has more gaps than mismatches? Hmmmm.
11.
Look carefully at the gaps
. What is unusual about the amino acids matched with the gaps? (Note: you'll know it right away when you see it!) The gaps spell out Mark was here NIH, and they are not actually the nucleotide sequence. Also, they occur before or after the amino acid asparagine. >JP
gcgttgctggcgtttttccataggctccgcccccctgacgagcatcacaaaaatcgacgc
ggtggcgaaacccgacaggactataaagataccaggcgtttccccctggaagctccctcg
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
tgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcgggaagcgtggc
tgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctgggctgtgtg
ccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtccaacccggtaa
agtaggacaggtgccggcagcgctctgggtcattttcggcgaggaccgctttcgctggag
atcggcctgtcgcttgcggtattcggaatcttgcacgccctcgctcaagccttcgtcact
ccaaacgtttcggcgagaagcaggccattatcgccggcatggcggccgacgcgctgggct
ggcgttcgcgacgcgaggctggatggccttccccattatgattcttctcgcttccggcgg
cccgcgttgcaggccatgctgtccaggcaggtagatgacgaccatcagggacagcttcaa
cggctcttaccagcctaacttcgatcactggaccgctgatcgtcacggcgatttatgccg
caagtcagaggtggcgaaacccgacaaggactataaagataccaggcgtttcccctggaa
gcgctctcctgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcggg
ctttctcattgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctg
acgaaccccccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtcca
acacgacttaacgggttggcatggattgtaggcgccgccctataccttgtctgcctcccc
gcggtgcatggagccgggccacctcgacctgaatggaagccggcggcacctcgctaacgg
ccaagaattggagccaatcaattcttgcggagaactgtgaatgcgcaaaccaacccttgg
ccatcgcgtccgccatctccagcagccgcacgcggcgcatctcgggcagcgttgggtcct
gcgcatgatcgtgctagcctgtcgttgaggacccggctaggctggcggggttgccttact
atgaatcaccgatacgcgagcgaacgtgaagcgactgctgctgcaaaacgtctgcgacct
atgaatggtcttcggtttccgtgtttcgtaaagtctggaaacgcggaagtcagcgccctg
>LW
agaattccggaagcgagcaagagataagtcctggcatcagatacagttggagataaggac
ggacgtgtggcagctcccgcagaggattcactggaagtgcattacctatcccatgggagc
catggagttcgtggcgctgggggggccggatgcgggctcccccactccgttccctgatga
agccggagccttcctggggctgggggggggcgagaggacggaggcgggggggctgctggc
ctcctaccccccctcaggccgcgtgtccctggtgccgtgggcagacacgggtactttggg
gaccccccagtgggtgccgcccgccacccaaatggagcccccccactacctggagctgct
gcaacccccccggggcagccccccccatccctcctccgggcccctactgccactcagcag
cgggcccccaccctgcgaggcccgtgagtgcgtcatggccaggaagaactgcggagcgac
ggcaacgccgctgtggcgccgggacggcaccgggcattacctgtgcaactgggcctcagc
ctgcgggctctaccaccgcctcaacggccagaaccgcccgctcatccgccccaaaaagcg
cctgcgggtgagtaagcgcgcaggcacagtgtgcagccacgagcgtgaaaactgccagac
atccaccaccactctgtggcgtcgcagccccatgggggaccccgtctgcaacaacattca
cgcctgcggcctctactacaaactgcaccaagtgaaccgccccctcacgatgcgcaaaga
cggaatccaaacccgaaaccgcaaagtttcctccaagggtaaaaagcggcgccccccggg
ggggggaaacccctccgccaccgcgggagggggcgctcctatggggggagggggggaccc
ctctatgccccccccgccgccccccccggccgccgccccccctcaaagcgacgctctgta
cgctctcggccccgtggtcctttcgggccattttctgccctttggaaactccggagggtt
ttttggggggggggcggggggttacacggcccccccggggctgagcccgcagatttaaat
aataactctgacgtgggcaagtgggccttgctgagaagacagtgtaacataataatttgc
acctcggcaattgcagagggtcgatctccactttggacacaacagggctactcggtagga
ccagataagcactttgctccctggactgaaaaagaaaggatttatctgtttgcttcttgc
tgacaaatccctgtgaaaggtaaaagtcggacacagcaatcgattatttctcgcctgtgt
gaaattactgtgaatattgtaaatatatatatatatatatatatatctgtatagaacagc
ctcggaggcggcatggacccagcgtagatcatgctggatttgtactgccggaattc
II. Understanding "excess similarity", used to identify homologs (7 pts)
12. Imagine generating two random sequences, each 200 nucleotides in length. If you align them pairwise, at how many positions are they expected to match just by chance? 50 would be expected to match just by chance. To help you answer this question, think about the chance a single randomly chosen nucleotide will match another randomly chosen nucleotide. 13. Now imagine generating two random sequences, each 200 amino acids
in length. In a pairwise alignment, at how many positions are they expected to match by chance? 10 would be expected to match by chance. Read the excerpt below, from a very heavily cited paper by one of the leading figures in sequence analysis, William Pearson. In light of the information above, try doing a BLAST search with the nucleotide
sequence for a gene, and then with the amino acid (translated DNA)
sequence, and see how the results differ. We'll try this with an exon of the human LMNA gene. Investigate nucleotide vs amino acid homology
To explore the dynamics of nucleotide homology or amino acid homology matching, go back to the BLAST website and select blastn (nucleotide BLAST). Copy and paste hu_LMNA_exon_nt sequence, including the header information into the empty BLAST query field. Click in the “job title” box, and a description title matching your FASTA header appears. Now, let’s restrict the taxonomy that we will search for homologs. In the “organism” box, select “bony fishes”. Finally, run your BLAST search. Once the hits are returned, click on the Taxonomy link above the graphics window, then select "Taxonomy" next to Reports. The first line will show number of hits and number of organisms that are included in the homologs that resulted from the search.
>hu_LMNA_exon_nt
gctacgcctgtcccccagccctacctcgcagcgcagccgtggccgtgcttcctctcactc
atcccagacacagggtgggggcagcgtcaccaaaaagcgcaaactggagtccactgagag
ccgcagcagcttctcacagcacgcacgcactagcgggcgcgtggccgtggaggaggtgga
tgaggagggcaagtttgtccggctgcgcaacaagtccaatgag
>hu_LMNA_exon_aa
LRLSPSPTSQRSRGRASSHSSQTQGGGSVTKKRKLESTESRSSFSQHARTSGRVAVEEVD
EEGKFVRLRNKSNE
14. How many hits and number of organisms are returned for the nucleotide search? There were 806 hits and 318 organisms for the nucleotide search. Open another BLAST webpage with a new tab on your web browser. Select blastp for the
protein search and enter the amino acid
FASTA sequence (hu_LMNA_exon_aa) in the empty query field. Make sure your “job field” matches the header information of your FASTA sequence. Restrict the taxonomy to bony fish and then run your BLAST search. Once the hits are returned, repeat the steps from your previous search to compare the numbers of hits and organisms.
15. How many hits and number of organisms are returned for the amino acid search? Be sure you're looking at the Taxonomy Report section! There were 1,426 hits and 442 organisms. 16. Which search produced hits in a wider range of organisms? The second search produced hits in a wider range of organisms. When comparing two nucleotide sequences, the chance of a random match at any position is 1/4, since there are only 4 possible nucleotides. In contrast, when comparing two amino acid sequences, the chance of a random match at any position is only 1/20, since there are 20 possible amino acids. Therefore, an amino acid match is more significant, since it is less likely to have occurred only by chance. For example, what if a nucleotide alignment matches at 25% of the positions? That's the % expected to match just by chance. But consider an amino acid alignment that matches at 25% of the positions. That level of identity is highly significant, since only 5% (1/20) position should
match just by chance.
17. Which type of search (nucleotide or amino acid) would you use if you wanted to find homologs to your sequence in distantly related organisms, meaning, those that shared a common ancestor long ago rather than more recently? Explain why. (2 pts) It would be most effective to use amino acid sequences because they are more conserved and less affected by silent mutations, allowing us to get a more accurate reflection of the evolutionary relationships compared to the amino acids. Also, the nucleotide sequences are more susceptible to mutation, making them less reliable for distant homology searches.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Note about LW sequence: Mark Borguski, a scientist at the NIH (National Institutes for Health) noticed the sequence in Jurassic Park wasn’t an accurate sequence to be used to represent a dinosaur sequence, so he provided the “LW” sequence to Crichton for use in the sequel. However, he very cleverly embedded his name into the sequence! III. Finding and Citing Academic Research (3 pts)
Here are some guidelines for how you cite in biology. After reading through them, answer the three questions at the end of this section.
Provide a citation for the discoverers of the information
o
Do: Find the original source!
o
Don’t: Cite the first paper you come across
Provide a citation for any new information contributed
o
Cite the people with the idea, AND the people that improved upon it
Do not use quotations, put ideas into your own words.
o
Remember, if the ideas are not yours, they still need to be cited!
Common knowledge does not need citation (but what knowledge is common may differ based on the audience)
o
Assume the types of things you learned in AP/intro biology are common knowledge
Different journals use different styles, but most are similar to APA
In-text citations:
o
Appear within the text, usually at the ends of sentences, to display where the underlying information originally was discovered
o
We will use the parenthetical style (Author’s Last Name, Year)
When one paper has two authors (Last Name and Last Name, Year), when there are more than two authors, write as (Last Name et al., Year)
References list at the end of the document:
o
Full citation format, we will use APA 6
th
/7
th
edition format for journal articles
o
Only cite sources with information that actually made it into the final product, not every paper you read during your research
o
Alphabetical order
o
Citing fake sources or sources you did not actually use is academic
misconduct and will receive a harsh penalty
Generating Citations
Using scholar. google.com
Citation Managers
Help you keep track of papers you read and automatically generate citations
Integrate with Word and other products
Paid: EndNote (free for current students), RefWorks
Free: Zotero, Mendeley
Paraphrasing
In academic Biology writing we do NOT use quotes
That means you need to put things in your own words (while still citing where the information came from)
There is a fine line between proper paraphrasing and plagiarism
All of the following are considered plagiarism:
Using text that is directly taken from another source (remember, we don’t use quotes in biology)
Changing words but copying the sentence structure of a source
Turning in someone else’s work as your own
Copying ideas from someone else without giving credit
Giving incorrect information about a source
18. After reading the Original Source and the citation, decide whether A, B, or C (below) is more appropriate as an example of content included in a paper you write. _C__ (1pt) 19. Why the other two are not appropriate? (2 pts)
The other two are not appropriate because A uses the direct quote from the source, which you never do when writing in biology and B does not cite the original source. Original Source: “The manner in which a newly synthesized chain of amino acids transforms itself into a perfectly folded protein depends both on the intrinsic
properties of the amino-acid sequence and on multiple contributing influences from the crowded cellular milieu.”
Dobson, C. M. (2003). Protein folding and misfolding. Nature, 426(6968), 884-890.
A. “The manner in which a newly synthesized chain of amino acids transforms itself into a perfectly folded protein depends both on the intrinsic properties of the amino-acid sequence and on multiple contributing influences from the crowded cellular milieu (Dobson, 2003).”
B. The way in which a protein folds depends both on the intrinsic properties of the amino acid sequence and on other contributing influences from the crowded cell.
C. The way a protein folds is influenced by the sequence of amino acids in the protein and its cellular environment (Dobson, 2003).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you
Case Studies In Health Information Management
Biology
ISBN:9781337676908
Author:SCHNERING
Publisher:Cengage

Biology: The Dynamic Science (MindTap Course List)
Biology
ISBN:9781305389892
Author:Peter J. Russell, Paul E. Hertz, Beverly McMillan
Publisher:Cengage Learning

Human Heredity: Principles and Issues (MindTap Co...
Biology
ISBN:9781305251052
Author:Michael Cummings
Publisher:Cengage Learning

Biochemistry
Biochemistry
ISBN:9781305577206
Author:Reginald H. Garrett, Charles M. Grisham
Publisher:Cengage Learning

Biology 2e
Biology
ISBN:9781947172517
Author:Matthew Douglas, Jung Choi, Mary Ann Clark
Publisher:OpenStax

Biology (MindTap Course List)
Biology
ISBN:9781337392938
Author:Eldra Solomon, Charles Martin, Diana W. Martin, Linda R. Berg
Publisher:Cengage Learning
Recommended textbooks for you
- Case Studies In Health Information ManagementBiologyISBN:9781337676908Author:SCHNERINGPublisher:CengageBiology: The Dynamic Science (MindTap Course List)BiologyISBN:9781305389892Author:Peter J. Russell, Paul E. Hertz, Beverly McMillanPublisher:Cengage LearningHuman Heredity: Principles and Issues (MindTap Co...BiologyISBN:9781305251052Author:Michael CummingsPublisher:Cengage Learning
- BiochemistryBiochemistryISBN:9781305577206Author:Reginald H. Garrett, Charles M. GrishamPublisher:Cengage LearningBiology 2eBiologyISBN:9781947172517Author:Matthew Douglas, Jung Choi, Mary Ann ClarkPublisher:OpenStaxBiology (MindTap Course List)BiologyISBN:9781337392938Author:Eldra Solomon, Charles Martin, Diana W. Martin, Linda R. BergPublisher:Cengage Learning
Case Studies In Health Information Management
Biology
ISBN:9781337676908
Author:SCHNERING
Publisher:Cengage

Biology: The Dynamic Science (MindTap Course List)
Biology
ISBN:9781305389892
Author:Peter J. Russell, Paul E. Hertz, Beverly McMillan
Publisher:Cengage Learning

Human Heredity: Principles and Issues (MindTap Co...
Biology
ISBN:9781305251052
Author:Michael Cummings
Publisher:Cengage Learning

Biochemistry
Biochemistry
ISBN:9781305577206
Author:Reginald H. Garrett, Charles M. Grisham
Publisher:Cengage Learning

Biology 2e
Biology
ISBN:9781947172517
Author:Matthew Douglas, Jung Choi, Mary Ann Clark
Publisher:OpenStax

Biology (MindTap Course List)
Biology
ISBN:9781337392938
Author:Eldra Solomon, Charles Martin, Diana W. Martin, Linda R. Berg
Publisher:Cengage Learning