lab06_simulation (1)

html

School

Temple University *

*We aren’t endorsed by this school

Course

1013

Subject

Computer Science

Date

Dec 6, 2023

Type

html

Pages

17

Uploaded by samzahroun

Report
Lab 6: Simulation Elements of Data Science Welcome to Module 2 and lab 6! This week, we will go over conditionals and iteration, and introduce the concept of randomness. All of this material is covered in Chapter 9 and Chapter 10 of the online Inferential Thinking textbook. First, set up the tests and imports by running the cell below. In [343]: name = "Sam" In [344]: import numpy as np from datascience import * %matplotlib inline import matplotlib.pyplot as plt plt.style.use('ggplot') import os user = os.getenv('JUPYTERHUB_USER') from gofer.ok import check 1. Sampling 1.1 Dungeons and Dragons and Sampling In the game Dungeons & Dragons, each player plays the role of a fantasy character. A player performs actions by rolling a 20-sided die, adding a "modifier" number to the roll, and comparing the total to a threshold for success. The modifier depends on her character's competence in performing the action. For example, suppose Alice's character, a barbarian warrior named Roga, is trying to knock down a heavy door. She rolls a 20-sided die, adds a modifier of 11 to the result (because her character is good at knocking down doors), and succeeds if the total is greater than 15 . A Medium posting discusses probability in the context of Dungeons and Dragons https://towardsdatascience.com/understanding-probability-theory-with-dungeons-and- dragons-a36bc69aec88 Question 1.1 Write code that simulates that procedure. Compute three values: the result of Alice's roll (roll_result), the result of her roll plus Roga's modifier (modified_result), and a boolean value indicating whether the action succeeded (action_succeeded). Do not fill in any of the results manually; the entire simulation should happen in code. Hint: A roll of a 20-sided die is a number chosen uniformly from the array make_array(1, 2, 3, 4, ..., 20). So a roll of a 20-sided die plus 11 is a number chosen uniformly from that array, plus 11. In [345]: possible_rolls = np.arange(1,21,1) roll_result = np.random.choice(possible_rolls)
modified_result = roll_result + 11 action_succeeded = modified_result > 15 # The next line just prints out your results in a nice way # once you're done. You can delete it if you want. print(f"On a modified roll of {modified_result}, Alice's action {'succeeded' if action_succeeded else 'failed'}") On a modified roll of 24, Alice's action succeeded In [346]: check('tests/q1.1.py') Out[346]: All tests passed! Question 1.2 Run your cell 7 times to manually estimate the chance that Alice succeeds at this action. (Don't use math or an extended simulation.). Your answer should be a fraction. In [347]: rough_success_chance = 7/7 In [348]: check('tests/q1.2.py') Out[348]: All tests passed! Suppose we don't know that Roga has a modifier of 11 for this action. Instead, we observe the modified roll (that is, the die roll plus the modifier of 11) from each of 7 of her attempts to knock down doors. We would like to estimate her modifier from these 7 numbers. Question 1.3 Write a Python function called simulate_observations . It should take two arguments, the modifier and num_oobservations, and it should return an array of num_observations. Each of the numbers should be the modified roll from one simulation. Then , call your function once to compute an array of 7 simulated modified rolls. Name that array observations . In [349]: modifier = 11 num_observations = 7 def simulate_observations(modifier, num_observations): """Produces an array of 7 simulated modified die rolls""" possible_rolls = np.arange(1,21) obs = make_array() for num in np.arange(num_observations): obs = np.append(obs, (np.random.choice(possible_rolls) + modifier)) return (obs) observations = simulate_observations(modifier, num_observations) observations Out[349]: array([ 14., 20., 30., 27., 31., 29., 18.]) In [350]: check('tests/q1.3.py') Out[350]: All tests passed! Question 1.4 Draw a histogram to display the probability distribution of the modified
rolls we might see. Check with a neighbor or a CA to make sure you have the right histogram. Carry this out again using 100 rolls. In [351]: num_observations = 100 def roll_sim(mod, num_observtions): """Produces the probability distribution of the seven observations from the preveous code cell.""" possible_rolls = np.arange(1,21) modrolls = np.random.choice(possible_rolls, num_observations) + mod return modrolls In [352]: # We suggest using these bins. roll_bins = np.arange(1, modifier+2+20, 1) roll_bins plt.hist(roll_sim(11,100)) Out[352]: (array([ 12., 14., 10., 9., 9., 12., 6., 7., 13., 8.]), array([ 12. , 13.9, 15.8, 17.7, 19.6, 21.5, 23.4, 25.3, 27.2, 29.1, 31. ]), <BarContainer object of 10 artists>) Estimate the modifier Now let's imagine we don't know the modifier and try to estimate it from observations. One straightforward (but clearly suboptimal) way to do that is to find the smallest total roll, since the smallest roll on a 20-sided die is 1, which is roughly 0. Use a random number for modifier to start and keep this value through the next questions. We will also generate 100 rolls based on the below unknown modifier. Question 1.5 Using that method, estimate modifier from observations. Name your
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
estimate min_estimate. In [353]: modifier = np.random.randint(1,20) # Generates a random integer modifier from 1 to 20 inclusive observations = simulate_observations(modifier, num_observations) min_roll = min(observations) min_estimate = min_roll -1 min_estimate Out[353]: 1.0 In [354]: check('tests/q1.5.py') Out[354]: All tests passed! Estimate the modifier based on the mean of observations. Question 1.6 Figure out a good estimate based on that quantity. Then, write a function named mean_based_estimator that computes your estimate. It should take an array of modified rolls (like the array observations) as its argument and return an estimate (single number)of the modifier based on those numbers contianed in the array. In [355]: def mean_based_estimator(obs): """Estimate the roll modifier based on observed modified rolls in the array nums.""" return int(round(np.mean(obs)- 11)) # Here is an example call to your function. It computes an estimate # of the modifier from our observations. mean_based_estimate = mean_based_estimator(observations) print(mean_based_estimate) 1 In [356]: check('tests/q1.6.py') Out[356]: All tests passed! Question 1.7 Construct a histogram and compare to above estimates, are they consistent? What is your best estimate of the random modiifer based on the above, without examining the value? In [357]: plt.hist(observations, bins = roll_bins) # Use to plot histogram of an array of 100 modified rolls estimated_modifier = mean_based_estimator(observations)
In [358]: check('tests/q1.7.py') Out[358]: All tests passed! 2. Sampling and GC content of DNA sequence DNA within a cell contains codes or sequences for the ultimate synthesis of proteins. In DNA is made up of four types of nucleotides, guanine (G), cytosine (C), adenine (A), and thymine (T) connected in an ordered sequence. These nucleotides on a single strand pair with complimentary nucleotides on a second strand, G pairs with C and A with T. Regions of DNA code for RNA which ultimately directs protein synthesis and these segments are known as genes and these segments often have higher GC content. Here we will sample 10 nuclotide segments of a DNA sequence and determine the GC content of these DNA segments. See DNA sequnce basics and GC Content details . Our goal is to sample portions (10 nucelotides) of the sequence and determine the relative content of guanine (G) and cytosine (C) to adenine (A) and thymine (T) In [359]: # DNA sequence we will examine, a string seq = "CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG \ AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG \ CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA \ AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA \ ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT \ AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA \ GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC \ AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT \ TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT \ GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT \ GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC" # LCBO-Prolactin precursor-Bovine seq
Out[359]: 'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC' Question 2.1A Run the first two code cells below to see how substrings are extracted and how a character can be counted within a substring. Use the same strategy to determine GC content as fraction of the total in the first 10 nucleotides in the larger sequence above, seq In [360]: # Example A samplesize = 4 # Use this short sequence in this example seq0 = 'GTGAAAGATT' # How to get a substring seq0[0:samplesize] Out[360]: 'GTGA' In [361]: # Example B # How to count the number of times 'A' appears in sequence seq0[0:samplesize].count('A') Out[361]: 1 In [362]: GCcount = seq[0:10].count('G') + seq[0:10].count('C') GCfraction = (GCcount)/(seq[0:10].count('A') + seq[0:10].count('T') + GCcount) GCfraction Out[362]: 0.5 Lists Below we assemble a list and append an additional entry, 0.7. A useful strategy in creating your function In [363]: gc = [] gc.append(0.8) gc.append(0.7) gc Out[363]: [0.8, 0.7] Fill a list with 30 random G, C, T, A nucleotides use iteration and np.random.choice In [364]: my_sim_seq = []
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
nucleo = ['G','C','T','A'] for i in np.arange(1,31,1): my_sim_seq.append(np.random.choice(nucleo)) print(my_sim_seq) ['G', 'T', 'T', 'T', 'T', 'C', 'A', 'T', 'T', 'G', 'G', 'T', 'G', 'C', 'G', 'G', 'G', 'C', 'A', 'A', 'A', 'C', 'G', 'A', 'A', 'C', 'A', 'C', 'C', 'A'] Question 2.1B We will define a function, calcGC to do the repetitive work of computing GC content fraction for each segment (samplesize) and return a list with the fraction GC for each samplesize segment. In [365]: def calcGC(seq, samplesize): gc = [] for i in range(len(seq)-samplesize): segment = seq[i:i + samplesize] gc_count = segment.count('G') + segment.count('C') gc_fraction = (gc_count)/(samplesize) gc.append(gc_fraction) return gc In [366]: check('tests/q2.1.py') Out[366]: All tests passed! Question 2.2 Apply this function to our sequence above seq with a samplesize of 10. What is the maximum, minimum, mean, and median? (Hint: the max of a list can be obtained with the max(results) and we can use np.mean(results)). Plot the results by using plt.plot(results). In [367]: samplesize = 10 results = calcGC(seq,samplesize) maximum = max(results) minimum = min(results) median = np.median(results) mean = np.average(results) print(maximum, "Max") print(minimum, "Min") print(median, "Med") print(mean, "Mean") plt.plot(results) 0.9 Max 0.1 Min 0.6 Med 0.588783783784 Mean Out[367]: [<matplotlib.lines.Line2D at 0x7fec93fd9120>]
In [368]: check('tests/q2.2.py') Out[368]: All tests passed! Question 2.3 Now apply this function to our sequence above seq with a different, larger samplesize (>30) of your choosing and plot. What do you observed with the different sampling? What is the maximum, minimum, mean, and median? (Hint: the max of a list can be obtained with the max(results) and we can use np.median(results)) In [369]: samplesize = 50 results = calcGC(seq,samplesize) maximum = max(results) minimum = min(results) median = np.median(results) mean = np.average(results) print(maximum, "Max") print(minimum, "Min") print(median, "Med") print(mean, "Mean") plt.plot(results) # Plot 0.72 Max 0.42 Min 0.6 Med 0.588371428571 Mean Out[369]: [<matplotlib.lines.Line2D at 0x7fec93e5ac50>]
Answer As we increase the sample size, the GC content is more likely to be somewhere around half than at the lower or higher end. With a sample size of 10, the GC content can be as low as .1 or as high as .9. On the other hand, with a samplesize of 50, will only have a GC content as low as .4 and as high as 0.75. 3. Wordle and sampling Inspired by Medium post: https://ericlani.medium.com/determining-the-best-first-wordle- word-to-guess-using-data-b93b975a6294 Letter frequency in sampled texts We will begin our study by trying to understand the frequency of certain letters by sampling texts. We will again use Charles Darwin's book on the Origin of Species to examine the frequency of letters in this text. To do this we will need to write a function which goes through all the words and determines the counts of letters. In [370]: darwin_string = open('data/darwin_origin_species.txt', encoding='utf-8').read() darwin_words = np.array(darwin_string.split()) Question 3.1 Define a function to determine letter frequency in text with words split in an array as in above darwin_words array and return a Table with letters and their count. We will use a nice trick with a Python dictionary (see: Python Dictionaries ) as already encoded below. In [371]: def letter_freq(words): f = {} # Create an empty dictionary to store letters and their count found in words for letter in words: for l in letter: l = l.lower()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
if l.isalpha(): # avoid punctuation f[l] = f.get(l,0) + 1 # Using Python dictionary return Table().with_columns('letters',list(f.keys()),'count',list(f.values()) ) Now test your function with your own sentence. In [372]: sentence = "I really don't even know if I'm doing this right." letter_freq(sentence) Out[372]: letters count i 6 r 2 e 3 a 1 l 2 y 1 d 2 o 3 n 4 t 3 ... (8 rows omitted) In [373]: check('tests/q3.1.py') Out[373]: All tests passed! Question 3.2 Apply the function to the darwin_string and examine the output. How many total letters in the text (Hint: use np.sum(freq.column('count')))? Now compute a new column, frequency which contains the fraction of each letter. What are the two most frequent letters in this sample? In [403]: freq = letter_freq(darwin_words).sort("letters") total_letters = np.sum(freq.column("count")) # How many letters freq = freq.with_columns("frequency", freq.column("count")/total_letters).sort("frequency",descending=True) total_letters Out[403]: 750341 In [404]: check('tests/q3.2.py')
Out[404]: All tests passed! Five letter words Question 3.3 Now look at a list of 5 letter words assembled by Professor Emeritas Donald Knuth of Stanford. Use your function to determine the letter frequency and compare to above. In [379]: from urllib.request import urlopen # Needed to read from internet url = "https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt" knuth5_string=urlopen(url).read().decode('utf-8') knuth_words = np.array(knuth5_string.split()) In [393]: # Now apply your function and compute letter frequencies freq = letter_freq(knuth_words) total_letters = np.sum(freq.column("count")) freq = freq.with_columns("frequency", freq.column("count") / total_letters).sort("frequency",descending=True) freq Out[393]: letters count frequency s 3033 0.105367 e 3009 0.104534 a 2348 0.0815703 o 1915 0.0665277 r 1910 0.066354 i 1592 0.0553066 l 1586 0.0550981 t 1585 0.0550634 n 1285 0.0446413 d 1181 0.0410283 ... (16 rows omitted) In [394]: check('tests/q3.3.py') Out[394]: All tests passed!
Compare with Oxford Dictionary Based on analysis of Oxford dictionary these are the letter frequencies from the dictionary. Compare the three, in the markdown below, what are the similarities? In [395]: url = "data/Oxford_Letter_frequency.csv" letters = Table().read_table(url, header=None, names=["letters","frequency","count"]) letters Out[395]: letters frequency count e 0.111607 56.88 a 0.084966 43.31 r 0.075809 38.64 i 0.075448 38.45 o 0.071635 36.51 t 0.069509 35.43 n 0.066544 33.92 s 0.057351 29.23 l 0.054893 27.98 c 0.045388 23.13 ... (16 rows omitted) Comparsion Question 3.4 Let's look at 5-letter words and the frequency of each letter and compare with Oxford case The letter frequency of 5 letter words in the darwin strings is comparable to the 5 letter words from the oxford case. The letters E and A have the top frequency according to the oxford case but they are second and third in the Darwin strings case. Wordle itself The New York Times hosts Wordle where a 5 letter word (Wordle) is determined in six or fewer tries using clues about letters contained and letter position. We will use our new knowledge of letter frequency and Knuth's 5 letter words to come up with best letters and words to try. In [396]: # Reload Knuth's words in more convenient pandas format url = "https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt" # Better to read with pandas import pandas as pd words_df = pd.read_csv(url, header=None, names=["Words","value","letters"])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Now load our letter frequency, freq , data table from a chosen text and convert to pandas dataframe. In [397]: letters_df = freq.to_df() # Select table from each text above letters_df Out[397]: letters count frequency 0 s 3033 0.105367 1 e 3009 0.104534 2 a 2348 0.081570 3 o 1915 0.066528 4 r 1910 0.066354 5 i 1592 0.055307 6 l 1586 0.055098 7 t 1585 0.055063 8 n 1285 0.044641 9 d 1181 0.041028 10 u 1089 0.037832 11 c 964 0.033490 12 p 955 0.033177 13 y 886 0.030780 14 m 843 0.029286 15 h 814 0.028279 16 b 715 0.024839 17 g 679 0.023589 18 k 596 0.020705
letters count frequency 19 f 561 0.019489 20 w 505 0.017544 21 v 318 0.011047 22 x 139 0.004829 23 z 135 0.004690 24 j 89 0.003092 25 q 53 0.001841 Question 3.5 Devise a plan to use the collection of letter frequencies and Knuth's five letter words to order the best words to guess for Wordle? Higher letter frequency means more words with the given letter. Describe plan in text markdown only. We can create a loop using the letter frequency of the letters given. By looping through the knuth 5 letter words and narrowing down the possible word starting with the most frequent first. Words that don't contain the specific letter or letters in the loop will be excluded as each letter is added into the loop. "Magic" Wordle computation Takes a while, be patient. Can you interpret how this code is sorting Knuth's words based on our letter frequencies? In [400]: for index, row in words_df.iterrows(): prob = 0 letter_set = set() for ele in row['Words']: letter_set.add(ele) temp_df = letters_df[letters_df["frequency"].where(letters_df["letters"] == ele).notnull()] temp_df prob += temp_df["frequency"].iloc[0] words_df.loc[index, ["value"]] = prob words_df.loc[index, ["letters"]] = len(letter_set) words_df Out[400]: Words value letters 0 which 0.162897 4.0 1 there 0.358763 4.0 2 their 0.309536 5.0
Words value letters 3 about 0.265833 5.0 4 would 0.218030 5.0 ... ... ... ... 5752 osier 0.398089 5.0 5753 roble 0.317353 5.0 5754 rumba 0.239882 5.0 5755 biffy 0.149904 4.0 5756 pupal 0.240855 4.0 5757 rows × 3 columns Four letter words are included now we will select only five letter words. In [401]: words_df.where(words_df["letters"]==5.0).sort_values(by='value',ascending=False).dropna() Out[401]: Words value letters 697 arose 0.424353 5.0 329 raise 0.413132 5.0 772 arise 0.413132 5.0 3522 aloes 0.413097 5.0 5043 stoae 0.413062 5.0 ... ... ... ... 4681 humpf 0.148063 5.0 5180 mujik 0.146222 5.0 4949 whump 0.146118 5.0 3006 junky 0.137051 5.0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Words value letters 2214 jumpy 0.134167 5.0 3834 rows × 3 columns Question 3.6 Compare top word given based on Darwin's text, Knuth's letter frequency, and the Oxford Dictionary. Top words for Wordle: Complete and tally your score. Submit .html and .ipynb of completed laboratory In [405]: # For your convenience, you can run this cell to run all the tests at once! import glob from gofer.ok import check correct = 0 total = 11 checks = ['1.1','1.2', '1.3', '1.5', '1.6', '1.7', '2.1', '2.2', '3.1', '3.2','3.3'] for x in checks: print('Testing question {}: '.format(x)) g = check('tests/q{}.py'.format(x)) if g.grade == 1.0: print("Passed") correct += 1 else: print('Failed') display(g) print('Grade: {}'.format(str(correct/total))) Testing question 1.1: Passed Testing question 1.2: Passed Testing question 1.3: Passed Testing question 1.5: Passed Testing question 1.6: Passed Testing question 1.7: Passed Testing question 2.1: Passed Testing question 2.2: Passed Testing question 3.1: Passed Testing question 3.2: Passed
Testing question 3.3: Passed Grade: 1.0 In [406]: print("Nice work ",name, user) import time; localtime = time.asctime( time.localtime(time.time()) ) print("Submitted @ ", localtime) Nice work Sam tul06187@temple.edu Submitted @ Sat Oct 21 01:32:34 2023 In [407]: "}} Cell In[407], line 1 "}} ^ SyntaxError: unterminated string literal (detected at line 1)