lab10_rnaseqIII_280

docx

School

Montgomery Blair High *

*We aren’t endorsed by this school

Course

177

Subject

Biology

Date

Jun 18, 2024

Type

docx

Pages

10

Uploaded by KidGoose1815

Report
Lab10: RNAseqII Answers Mehul Goyal 11/12/2023 Q1. As noted above, there are two steps in QC analysis, which includes sample-level and gene-level steps. Why would a scientist want to ensure that samples/replicates are consistent with each other? Why would we want to investigate the genes and filter some genes out of the analysis? Answer: Consistency between the samples helps identify outliers or technicala issues that may have been introduced in processing, preperation, or sequencing of the data. The identification of outliers helps ensure that data is high quality and any observed differences can be attributed to biological factors as opposed to technical factors. We investigate and filter genes out to help reduce false postives (removing gnes that are not as highly expressed) and false negatives (removing/filtering genes that may not be relevant to what is being studied), and it can a more statstically signficant and computationally efficient set. Load the data raw_counts <- read.csv ( 'lmajor_counts.csv' , row.names= 1 ) Q2: Review the lab06: basic linear regression lab if necessary. a) What is the range of values a correlation coefficient can be? b) What would a coefficient of 0 indicate? A score of -.7? A score of .8? Answer: The range of values is -1 to 1. A coeffiecinet of 0 indicates that x is not a good predictor of y. 0 represents a value that indicates not correlation. A score of -.7 indicates a medium-strong negative correlation in which as the x increases, we expect y to decrease or go in the oppostie direction. A score of 0.8 indicates that as x increases, y tends to also increase or follow x in that direction because they have a strong postive correlation. Correlation Matrix cor_matrix <- cor (raw_counts) cor_matrix ## procyclic_1 metacyclic_1 procyclic_2 metacyclic_2 procyclic_3 ## procyclic_1 1.0000000 0.6935682 0.9414528 0.6067164 0.8940239 ## metacyclic_1 0.6935682 1.0000000 0.8055050 0.9250951
0.8180058 ## procyclic_2 0.9414528 0.8055050 1.0000000 0.6824871 0.9794223 ## metacyclic_2 0.6067164 0.9250951 0.6824871 1.0000000 0.6814958 ## procyclic_3 0.8940239 0.8180058 0.9794223 0.6814958 1.0000000 ## metacyclic_3 0.5855459 0.8866617 0.6754010 0.9493425 0.6975117 ## procyclic_4 0.9571108 0.7195456 0.9472523 0.6405438 0.9369850 ## metacyclic_4 0.5953419 0.8867047 0.6887470 0.9290594 0.7208074 ## procyclic_5 0.9646128 0.7223971 0.9361184 0.6635231 0.9167637 ## metacyclic_5 0.5311189 0.8698172 0.6140373 0.8995406 0.6375646 ## metacyclic_3 procyclic_4 metacyclic_4 procyclic_5 metacyclic_5 ## procyclic_1 0.5855459 0.9571108 0.5953419 0.9646128 0.5311189 ## metacyclic_1 0.8866617 0.7195456 0.8867047 0.7223971 0.8698172 ## procyclic_2 0.6754010 0.9472523 0.6887470 0.9361184 0.6140373 ## metacyclic_2 0.9493425 0.6405438 0.9290594 0.6635231 0.8995406 ## procyclic_3 0.6975117 0.9369850 0.7208074 0.9167637 0.6375646 ## metacyclic_3 1.0000000 0.6641136 0.9715132 0.6794432 0.9460085 ## procyclic_4 0.6641136 1.0000000 0.6813375 0.9909555 0.6037955 ## metacyclic_4 0.9715132 0.6813375 1.0000000 0.6887118 0.9621586 ## procyclic_5 0.6794432 0.9909555 0.6887118 1.0000000 0.6268065 ## metacyclic_5 0.9460085 0.6037955 0.9621586 0.6268065 1.0000000 Q3: Review the output of the cor() function by inspecting the cor_matrix variable. What do we mean by pairwise correlations? Answer: Pairwise correlations are essentially the correlation between two variables (or in our case samples) at a time. The cor function generates a matrix of these pairwise correlations which can be used to help identify any strong relationships between variables/samples. ??cor_matrix
## No vignettes or demos or help files found with alias or concept or ## title matching 'cor_matrix' using regular expression matching. Retrieve correlation cor_matrix[ 1 , 2 ] ## [1] 0.6935682 Q4. Notice that the values along the diagonal of the matrix are all 1’s. What is the reason for this? Answer: The values along the diagonal of the matrix are all 1’s because they represent the pairwise correlation of a variable with itself. When you compare a variable with itself it will be perfectly correlated because it is identical and as a result this perfect correlation will result in a score of 1. Hierarchical Clustering hc <- hclust ( as.dist ( 1 - cor_matrix)) Plot Hierarchical Clustering plot (hc)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Q5. Create a heatmap as described above. The dendrogram should be clustering the columns (samples). Use margin = c(10,10), which zooms out the heatmap so you can see it. (Hint #1: use the help function to load the documentation). (Hint#2: There should be an input and two more arguments for the function). Answer: library (gplots) ## ## Attaching package: 'gplots' ## The following object is masked from 'package:stats': ## ## lowess raw_counts_matrix <- as.matrix (raw_counts) heatmap.2 (raw_counts_matrix, dendrogram = "column" , margins = c ( 10 , 10 )) Convert raw counts to log scale log_counts <- log2 (raw_counts + 1 ) Q6: Create another heatmap using the logged data. Same rules apply as above. To clean up the output, we want to remove the “trace”. Add a fourth argument to the
function that removes the “trace” from the heatmap. (Hint: once again, look at the documentation!) Answer: log_counts_matrix <- as.matrix (log_counts) heatmap.2 (log_counts_matrix, dendrogram = "column" , margins = c ( 10 , 10 ), trace = "none" ) Q7: Make a heatmap that clusters the correlation matrix. Remember that the correlation matrix is already a matrix! We still need to remove the trace and keep the same margin. However, this time we don’t need to specify the dendrogram. Answer: heatmap.2 (cor_matrix, margins = c ( 10 , 10 ), trace = "none" )
Q8. Why is the correlation heatmap symmetric? Answer: Similar to why we see the diagonal in the correlation matrix for pairwise correlations, we see symmetry in the heat map because we are comparing the same variables and their correlations in symmetric regions of the heatmap so we will get exact correlation matches in those symmetric regions. Calculate variance of all row var (raw_counts) ## procyclic_1 metacyclic_1 procyclic_2 metacyclic_2 procyclic_3 ## procyclic_1 191039681 69137405 162836927 60149036 206552862 ## metacyclic_1 69137405 52014470 72698074 47855254 98614052 ## procyclic_2 162836927 72698074 156597600 61258748 204872111 ## metacyclic_2 60149036 47855254 61258748 51447263 81708007 ## procyclic_3 206552862 98614052 204872111 81708007 279409143 ## metacyclic_3 66809947 52788386 69770623 56211143 96247654
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## procyclic_4 136493110 53543546 122305200 47404186 161599362 ## metacyclic_4 77850451 60502591 81542712 63045993 113991481 ## procyclic_5 153025027 59797876 134453155 54624172 175883615 ## metacyclic_5 65751672 56187972 68824107 57790332 95454804 ## metacyclic_3 procyclic_4 metacyclic_4 procyclic_5 metacyclic_5 ## procyclic_1 66809947 136493110 77850451 153025027 65751672 ## metacyclic_1 52788386 53543546 60502591 59797876 56187972 ## procyclic_2 69770623 122305200 81542712 134453155 68824107 ## metacyclic_2 56211143 47404186 63045993 54624172 57790332 ## procyclic_3 96247654 161599362 113991481 175883615 95454804 ## metacyclic_3 68145430 56564958 75875214 64375287 69946614 ## procyclic_4 56564958 106456798 66509225 117351468 55799439 ## metacyclic_4 75875214 66509225 89508800 74785611 81532888 ## procyclic_5 64375287 117351468 74785611 131733232 64436869 ## metacyclic_5 69946614 55799439 81532888 64436869 80224418 Create function rowVars <- function (x) { apply (x, 1 , var) } Filter genes with variance = 0 raw_counts_filtered = raw_counts[ rowVars (raw_counts) != 0 ,] Q9. a) Explain how the above line of code filters the data. (Hint: start with the innermost function/operation and work your way outwards). b) Why is it necessary to filter out genes with zero variance before performing PCA analysis? Answer: The != 0 operater creates a logical vector that is true for rows that contain non-zero variance and false for rows that have zero variance. Then a subset of non-zero variance
rows is made into a new matrix which is stored in the variable raw_counts_filtered. It is necessary to filter out genes with zero variance before performing PCA analysis as it goes against the whole purpose of the analysis. The analysis is supposed to retain as much of the variation as possible to identify the principal components. Genes with no variation mess with PCA analysis’ ability to do that so removing these genes will improve the accuracy of the PCA analysis. Q10. How many genes were filtered because of low variance? Use code to find the answer! Answer: numfil <- nrow (raw_counts) - nrow (raw_counts_filtered) numfil ## [1] 24 Compute PCA fit <- prcomp ( t (raw_counts_filtered), scale= TRUE ) Plot PCA plot (fit $ x[, 1 ], fit $ x[, 2 ], col= rep ( c ( 'red' , 'blue' ), 5 )) text (fit $ x[, 1 ], fit $ x[, 2 ], colnames (raw_counts_filtered), pos= 3 )
Q11. In the PCA plot created above, which sample appears to differ the most from the other samples in its condition? Answer: procyclic_3 because it appears to be far away from other procyclic samples (the other samples seem to be clustered toghther). Calculated Total Variance round (fit $ sdev ** 2 / sum (fit $ sdev ** 2 ), 2 ) * 100 ## [1] 48 23 8 7 4 4 3 2 1 0 PCA Plot with Variance Explained var_explained <- round (fit $ sdev ** 2 / sum (fit $ sdev ** 2 ), 2 ) * 100 plot (fit $ x[, 1 ], fit $ x[, 2 ], col= rep ( c ( 'red' , 'blue' ), 5 ), xlab= sprintf ( "PC1 (%d%% variance)" , var_explained[ 1 ]), ylab= sprintf ( "PC2 (%d%% variance)" , var_explained[ 2 ])) text (fit $ x[, 1 ], fit $ x[, 2 ], colnames (raw_counts_filtered), pos= 3 ) Q12. Which aspect of the samples does PC1 appear to correspond to in the plot above?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Answer: PC1 seems to help differentiate between the procyclic samples and metacyclic samples. PC1 explains more variance compared to PC2.