Number of Polymorphisms 10000- 1000- 100- 10- Chromosome 6 Polymorphism Density Plot 06+00 16+07 26+07 Chromosome Position 36+07 46+07 56+07 Figure 1 - Polymorphism density plot create using a window size of 1,000,000 with an increment of 100,000 HINTS: 1. The VCF file is tab delimited 2. You don't have to use Biopython to parse the file 3. Use the library matplotlib to create the image 4. In the VCF file the polymorphism data for each individual starts with the genotype call. 0/0 means the individual does NOT have a polymorphism at that location and 1/1 means that it does have a polymorphism at that location. When resequencing a genome a researcher is often interested in how the polymorphisms in a genome are positioned relative to the reference. They may have questions like "are the polymorphisms evenly distributed or are they concentrated in particular regions of the genome". For this assignment, you will write a python program to parse a Variant Call Format (VCF) file and then create a polymorphism density plot from the data extracted from the file. Briefly, a VCF file is a standard text file format to record information about polymorphisms found in a genome. The file begins with a header section (lines beginning with the '#' symbol) followed by a title line with the polymorphism records appearing after that. There is one record per line and each record captures information like where in the genome the polymorphism was found, what is the polymorphism relative to the reference, and what kind of data is present to support the 'calling' of the polymorphism. This could be information like the number of sequencing reads supporting the call and the quality score assigned to the call. The variant call information for more than one individual can be present in a record. To calculate the polymorphism density you do the following for each individual; 1. Establish a window of X bases wide and count the number of polymorphisms in that window 2. Record the polymorphism count and the start position of the of window 3. Shift the window down the chromosome by Y bases, count the number of polymorphisms in the window. You will be counting many of the same polymorphisms you counted in the previous window. 4. Record the polymorphism count and the current start position of the window 5. Continue moving the window down the chromosome by Y bases, counting the polymorphisms, and recording the count and position data until your window reaches the end of the chromosome 6. Do this for all of the individuals in the VCF file 7. Create a line graph of the (count, position) data for each individual. The graph should present one line for each parent. NOTE: Assume the VCF file will only have data for 1 chromosome The program will prompt the user for the name of the VCF file, the window size, and the increment value The program will create a polymorphism density plot similar to the example given. This assignment will be marked on the following: 1. Correctness of function 2. Clearly written, formatted and documented code 3. Proper error handling 4. Formatting of the polymorphism density image
Number of Polymorphisms 10000- 1000- 100- 10- Chromosome 6 Polymorphism Density Plot 06+00 16+07 26+07 Chromosome Position 36+07 46+07 56+07 Figure 1 - Polymorphism density plot create using a window size of 1,000,000 with an increment of 100,000 HINTS: 1. The VCF file is tab delimited 2. You don't have to use Biopython to parse the file 3. Use the library matplotlib to create the image 4. In the VCF file the polymorphism data for each individual starts with the genotype call. 0/0 means the individual does NOT have a polymorphism at that location and 1/1 means that it does have a polymorphism at that location. When resequencing a genome a researcher is often interested in how the polymorphisms in a genome are positioned relative to the reference. They may have questions like "are the polymorphisms evenly distributed or are they concentrated in particular regions of the genome". For this assignment, you will write a python program to parse a Variant Call Format (VCF) file and then create a polymorphism density plot from the data extracted from the file. Briefly, a VCF file is a standard text file format to record information about polymorphisms found in a genome. The file begins with a header section (lines beginning with the '#' symbol) followed by a title line with the polymorphism records appearing after that. There is one record per line and each record captures information like where in the genome the polymorphism was found, what is the polymorphism relative to the reference, and what kind of data is present to support the 'calling' of the polymorphism. This could be information like the number of sequencing reads supporting the call and the quality score assigned to the call. The variant call information for more than one individual can be present in a record. To calculate the polymorphism density you do the following for each individual; 1. Establish a window of X bases wide and count the number of polymorphisms in that window 2. Record the polymorphism count and the start position of the of window 3. Shift the window down the chromosome by Y bases, count the number of polymorphisms in the window. You will be counting many of the same polymorphisms you counted in the previous window. 4. Record the polymorphism count and the current start position of the window 5. Continue moving the window down the chromosome by Y bases, counting the polymorphisms, and recording the count and position data until your window reaches the end of the chromosome 6. Do this for all of the individuals in the VCF file 7. Create a line graph of the (count, position) data for each individual. The graph should present one line for each parent. NOTE: Assume the VCF file will only have data for 1 chromosome The program will prompt the user for the name of the VCF file, the window size, and the increment value The program will create a polymorphism density plot similar to the example given. This assignment will be marked on the following: 1. Correctness of function 2. Clearly written, formatted and documented code 3. Proper error handling 4. Formatting of the polymorphism density image
Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
Related questions
Question
100%
ps I cannot attach the chr02.vcf.gz file, I will link a dropbox that you can assess it - https://easyupload.io/261jzv
Please make sure that your graph matches the one attached to this. DO NOT USE DEF FUNCTION.
Expert Solution
This question has been solved!
Explore an expertly crafted, step-by-step solution for a thorough understanding of key concepts.
Step by step
Solved in 2 steps
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education