Lab 3

docx

School

Johns Hopkins University *

*We aren’t endorsed by this school

Course

633

Subject

Biology

Date

Jan 9, 2024

Type

docx

Pages

14

Uploaded by murphydanyael

Report
Intro to Bioinformatics - Lab 3 Name: Danyael Murphy 1. Goal: Align the histone H3 variants using MUSCLE and MAFFT (default settings) and answer some questions. Given: The histone H3 protein sequences in the table below. MUSCLE: http://www.ebi.ac.uk/Tools/msa/muscle/ MAFFT: http://mafft.cbrc.jp/alignment/server/ seq # Accession Histone variant (description) 1 AAN10051 H3.1 2 AAN39283 H3.2 3 P84243 H3.3 4 NP_003484 H3t 5 P49450 CENP-A a. (0.2) What can you say about the protein regions with more asterisks (regarding sequence conservation)? 6705Regions with more asterisks are more conserved. b. (0.1) Looking at the N-terminal 40 amino acids, which sequence is the most different in sequence? P49450 CENP-A c. (0.2) Find the discrepancies between MUSCLE and MAFFT two results. Specifically, where is the alignment different? The alignments differ in gap placement. MAFT places a gap within the first 40 amino acids. And the gap placement at the C-terminal end is shifted over by one in MAFT. d. (0.1) In PubMed, find a 2005 paper by Govin et al. with “histone H3” in the title. Open the full text. What figure in that paper resembles your alignment? Figure 2 e. (0.2) Does the alignment more closely match MAFFT or MUSCLE? MUSCLE f. (0.1) Which of the original five proteins is the protein mentioned in the title of the Govin paper (look in figure legend)? H3t NP_003484 g. (0.1) What is CENP-A? With what DNA sequences is CENP-A associated? Centromere protein A (CENP-A) is specifically associated with centromeres. 2. Goal: Perform UPGMA phylogeny on the neuraminidases_21 sequences Given: Align with MAAFT then and use click Phylogenetic tree. Run NJ but turn on bootstrap, 100 replicates.
a. (0.1) In the MAFFT alignment, look for a string of six identical amino acids across all sequences. What are those six consecutive letters? ILRTQES b. (0.2) Looking at the phylogenetic tree, list all sister taxa. QJA10732.1 & QDQ70145.1 QEO32938.1 & QEO32938.1 c. (0.2) Do you think it’s wise to try sev- eral multiple alignment programs? Briefly explain. Yes, different software utilizes different MSA algorithms. It is difficult to place gaps for insertions/deletions accurately so when performing MSA it is best to compare between multiple platforms and choose the best one. d. (0.2) One sequence was isolated in 1934. Does it appear in the tree where you would expect and why? Yes e. (0.2) Looking at the tree, which se- quence represents the outgroup and from what virus is that sequence? QZP79717.1, Influenza B Virus f. (0.1) Submit the tree image as a separate file. 3. Goal: Compare Genscan and FGENESH results to annotation in NCBI GenBank. Given: NCBI Genomic sequence X87333.1, Genscan, FGENESH, Pairwise BLAST. Hint: Use raw format (NOT FASTA) for Genscan. Select Arabidopsis for organ- ism. a. (0.4) Fill in locations from GenBank record and the two prediction outputs.
GenBank record (NCBI) GENSCAN Prediction FGENESH Prediction 266…517 169..346 266…517 826…1011 371…449 826…1011 1212…1382 843…1028 1212…1382 1559…1675 1229…1399 1559…1675 Do a blast pairwise alignment between the predicted peptide from genscan and the corresponding protein sequence from GenBank (NCBI). b. (0.3) Is the alignment complete (start to finish) or do only part of the pro- tein sequences align? The sequences are 95% identical. Most of the protein sequences align, but it is not a complete alignment. c. (0.3) Explain why these sequences should either partially align or fully align, based on the accuracy of the prediction. Because the prediction is not 100% accurate, then the sequences should only partially align. 4. Goal: Use ORF Finder to find open reading frames in archaeal genomic DNA. Given: attached sequence from Halosiccatus sp. LT50, FASTA and GenBank formats ORF Finder at NCBI ( https://www.ncbi.nlm.nih.gov/orffinder/ ) a. (0.1) At what nucleotide position is the longest ORF? 2482…4437 b. (0.1) Look at the locations of the three longest predicted ORFs. List discrepancies on the GenBank record that were not exactly predicted by ORF Finder (Hint: for the longest ORF, look for a CDS that starts at 3988 in the GenBank record). longest ORF Finder GenBank 1 st longest 2482…4438 3988…4437 2 nd longest 5949…7181 5949…7181 3 rd longest 2798…3946 3122…3946 c. (0.1) In the GenBank record, what is the protein accession number for the CDS that starts at 3988. WP_226011558.1 d. (0.1) Try a BLAST search off this page with the longest ORF. Change to the nr database, no organism limits. List the protein lengths for the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
GenBank protein, the ORF Finder prediction, and the BLAST hit from Halosiccatus urmianus . Sequence Length (protein) GenBank record 149 ORF Finder prediction 651 BLAST best hit 149 e. (0.2) Try a pairwise alignment between the ORF finder prediction and the BLAST hit. Examine the protein alignment. Pay close attention to the N- terminus and C-terminus of each sequence. Where do the sequences differ? They differ in the position of N- terminus and C-terminus. For the ORF sequence the alignment starts at position 503 and ends at 651. f. (0.2) Run FGENESB (ARCHAE generic) with the sequence that you started with (halosiccatus.txt). Based on ORF Finder, BLAST, the GenBank record and FGENESB, what is the likely start position for the CDS that ends at 4437? 3988 g. (0.2) Based on the blast hit and the species from which this came, do you think that this is a gene unique to this species or a gene prediction error? Briefly explain. Either answer is acceptable; your explanation should support it. This is a gene unique to this species. In the BLAST search the gene was found amongst halophiles. 5. Goal: Compare Clustal Omega, Muscle, T-Coffee Given: gluperox.txt (sequences), gluperox_x.txt (sequences for Muscle.) Note: Use the “x” file for Muscle because the sequences contain selenocysteine (u). The Muscle algorithm cannot process selenocysteine, so the X file replaced all of the “u” with “x.” Run multiple alignment programs using the attached glutamine peroxidases (gluperox). Start with Clustal Omega at EBI. Select Clustal w/ numbers as the output. a. (0.1) Which sequences appear to have a longer N-terminus than the others? ABH10623.1 & ABP73388.1 b. (0.3) An asterisk represents a con- served position that is identical in all seven sequences. How many such posi- tions are present using all three meth- ods? Clustal Omega 34 MUSCLE 34 T-Coffee 34
c. (0.1) Compare the Clustal Omega out- put to the gluperox image. The re- searchers used ClustalW to produce that figure. How close is each output to what is seen in the figure? The outputs are like the image with some subtle differences in gap place- ment. d. (0.1) Can you find any major differ- ences (other than order of sequences)? N-terminus and C-terminus are differ- ent. e. (0.1) The sequences all have one se- lenocysteine residue (u). At what amino acid position is selenocysteine in the se- quence ADC35417.1? 51 f. (0.2) Which programs successfully aligned selenocysteine? Clustal Omega g. (0.1) Name the amino acid that comes immediately before selenocysteine in most sequences. Then name the amino acid that follows selenocysteine in most sequences. Lysine and Glycine 6. Goal: Compare MP and ML phylogeny using MEGA. Given: dpp4.fst. It contains mRNA sequences of dipeptidyl-peptidase 4 from var- ious organisms. MEGA—starting with Muscle and continuing to MP. a. (0.3) Submit a maximum parsimony tree with bootstrap numbers (sepa- rate file) b. (0.2) A specific sequence phylogeny does not always exactly match species phylogeny. Give two reasons why that could happen. Speciation events can occur very quickly, and the coalescence of specific genes can oc- cur before speciation events.
c. (0.2) The dppimage file is from a re- search paper using these sequences. Looking specifically at bat and flying fox species, how closely does your tree match the image? My tree is similar but not a 100% match d. (0.3) The paper from which the image was selected suggests that the dpp4 bat sequences are positively selected genes. If that is the case, which would you expect to be higher—the non-synonymous sub- stitution rate or the synonymous substi- tution rate? Briefly explain. If the gene is positively selected, I would ex- pect a higher synonymous substitution rate. Positively selected genes confer a fitness benefit, so I would expect a synonymous substitution to conserve gene function. 7. (2 points) Goal: Evaluate gene prediction programs using benchmark data Given: The file benchmark.txt, FGENESH ( http://www.softberry.com/berry.phtml?topic=fgenesh&group=programs&sub - group=gfind ) , Genscan ( http://hollywood.mit.edu/GENSCAN.html ), AUGUSTUS ( https://usegalaxy.org/ ) Hint: Use Toxoplasma gondii as the model for Plasmodium in Augustus. In Gen- scan, use vertebrate. Plsamodium is available in FGENESH. a. (0.8) Fill in the tables for human (top table) and Plasmodium (bottom ta- ble) benchmark sequences. There are more lines in the table than needed.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exon exon locations; FGENESH Genscan AUGUSTUS 1 2 3 4 5 6 7 Exon exon locations; FGENESH Genscan AUGUSTUS 1 2 3 4 5 6 7 b. (0.6) Devise a strategy to determine which is the most accurate prediction for Plasmodium . You can use any computer tool we have covered in the course so far, other than looking up the correct answers in a GenBank record or elsewhere. Describe the strategy in detail. Using blastp compare, with pairwise alignment, the protein product en- coded by the actual gene in the test sequence with the protein encoded by the predicted gene. c. (0.4) Present the CDS location for each sequence as you believe it to be correct. Briefly (a sentence each) explain why you chose those locations. d. (0.2) Ultimately what is the best way to verify that prediction, assuming you have access to nucleotide sequencing technology? Validity of gene prediction can be verified by RNA-seq data. 8. (1) Goal: Take a predicted protein, find orthologues & run phylogeny. Given: augustus.txt, the predicted protein from Taphrina deformans Find the orthologous protein sequences for human, rat, mouse, horse, chicken and Tasmanian devil using BLAST. Run a phylogeny tree using any method and those six sequences plus the T. deformans protein prediction AND the ortholog from the organism are in a table at the end of this document (eight total protein sequences). Submit the tree (separate file) and state your methodology.
1. Run blastp to see if the protein is in each organism of inter- est. 2. Perform sequence alignment using MUSCLE. 3. Using sequence alignment data generate Maximum parsi- mony with bootstrap phylogeny tree using MEGA software. 9. (0.5 points) Given the tree below, answer a few questions.
--Reis et al. (2014) Int J Dev Biol. 58(5):355-362 a. (0.25) Which rabbit gene is a paralog of Rabbit Tiki1? Rabbit Tiki2 b. (0.25) Which human gene is an ortholog of Rabbit Tiki2? Human Tiki2 c. (0.25) Which is more closely related to Rabbit Tiki1, the rabbit paralog or the chicken ortholog? Chicken ortholog d. (0.25) Which (frog paralog/human or- tholog) has a more recent common ancestor with Frog Tiki1? Human Tiki1 10. Goal: Run gene prediction on a bacterial genomic sequence of an operon.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Given: echinicola.txt, a genomic sequence from Echinicola sp. SCS 3-6 Use FGENESB ( http://www.softberry.com/berry.phtml? topic=fgenesb&group=programs&subgroup=gfindb ) BACTERIAL generic (0.5) List all predicted ORF locations. 10…546 579…1247 1225…1686 (0.5) Assume that the entire genomic sequence represents one full-length mRNA. Draw the gene structure below (from FGENESB), making each CDS a rectangle and marking the numerical locations of the start and stop of each CDS. The se- quence length is 1690 bp. 1 10 546 579 1225 1247 1686 11. (0.5 points) The following mouse gene variants are three differentially spliced transcripts of the mouse gene Mttp .
--Suzuki et al. (2016). PLoS One . 11(1):e0147252 The black represents coding regions (CDS) and the blank parts of boxes represent the 3’ UTR. Exons 3 through 18 are not shown in the diagram . a. (0.2) If MTP-A and MTP-B mRNAs each have 18 exons, how many exons make up MTP-C mRNA? MTP-C must also have 18 exons. b. (0.3) Assume that the start codon in Ex1A is the same for MTP-A and MTP-C. Which of the three mRNAs produces a protein sequence that is different from the other two? MTP-B 12. Goal: Attempt to verify results from a research paper.
Given: the file FtMTP.fasta, which contains the sequences used in phylogenetic analysis in the paper. The 2022 paper from Plants on the MTP gene family ( https://www.ncbi.nlm.nih. - gov/pmc/articles/PMC9003181/#app1-plants-11-00850 ) a. (0.3) The authors used Neighbor-Joining with the following options: Bootstrap: 1000 replicates They aligned the sequences using ClustalW(?!) Use MUSCLE to align. Submit your neighbor-joining tree with bootstrap numbers (separate file). b. (0.3) Does the topology match? Note that the top to bottom order might be different, but are the branches essentially the same? The topology is similar but not an exact match. c. (0.2) Do the bootstrap numbers match? Would you expect them to match exactly? No. And No. It is likely they utilized a variety of different alignment and phylogeny tools and used a culmination of data to produce the NJ tree with bootstrap numbers. They also likely made modifications to the align- ment before running the phylogeny pro- gram. d. (0.2) Explain why it might be helpful to edit a multiple sequence alignment before run- ning a phylogeny program. The phylogeny tree is based off the pro- duced alignment. It is best to have the most accurate alignment to produce the most ac- curate phylogeny tree.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Name Organism
Sam Andem Oryctolagus cuniculus Josephine Buclez Oryctolagus cuniculus Jake Carter Salmo salar Wei-Shih Chen Bombyx mori Blessing Ekpo Oryctolagus cuniculus Congyu Liu Bombyx mori Yi Liu Xenopus tropicalis Lincoln Moore Salmo salar Danyael Murphy Xenopus tropicalis Sanjana Ramanujam Oryctolagus cuniculus Cruz Rivera Danio rerio Makenna Roof Ovis aries Joshua Rudolph Xenopus tropicalis Elizabeth Saurage Salmo salar Sabina Sendek Danio rerio Sam Tennison Oryctolagus cuniculus