HW1

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

5645

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

8

Uploaded by ChancellorGazelle2475

Report
EECE 5645 Assignment 1: Text Analyzer Follow the Discovery Cluster Rules : Never run jobs on the gateways. Do not reserve more than one node from the courses partition in the last 24 hours before a homework deadline Start working on homework assignments early. 1
EECE5645 Preparation Follow the “Discovery Cluster Checklist” under “Programming Resources” on Canvas to copy the latest .bashrc file to your Discovery cluster home directory. Familiarize yourself with the ld5645 command, also descirbed there. Make sure that a folder on discovery named after your username exists under the directory /scratch . You can confirm this by logging into the cluster and typing ls /scratch/ | grep $USER You should see a directory named after your username. Copy the directory /courses/EECE5645.202410/data/HW1/Files to the folder you just checked, renamed as HW1 . You can do so by typing: cp -r /courses/EECE5645.202410/data/HW1/Files /scratch/$USER/HW1 Make the contents of this directory private, by typing: chmod -R go-rx /scratch/$USER/HW1 After you do this, your scratch HW1 folder should contain two Python files, called TextAnalyzer.py and helpers.py . Data The directory called /courses/EECE5645.202410/data/HW1/Data contains books from the Project Guten- berg 1 , other documents from the American National Corpus 2 , and the files DaleChallEasyWordList.txt and fireandice.txt , which will be used throughout this assignment. Deliverables In this assignment, you are asked to modify the provided code and use it to analyze this dataset. You must: 1. Provide a report, in pdf format, outlining the answers of the questions below. The report should be type-written in a word processor of your choice (e.g., MS Word, L A T E X, etc.). 2. Provide the final files TextAnalyzer.py and helpers.py you wrote. The report, along with your final code, should be uploaded on Canvas. Upload files separately. DO NOT UPLOAD .zip FILES. 1 https://www.gutenberg.org 2 http://anc.org/ c 2022, Stratis Ioannidis 2
EECE5645 Question 0: Go to the directory that contains TextAnalyzer.py and run the following from the command prompt: python TextAnalyzer.py --help What does this print? What portion of the code causes this to be printed? Find the documentation of the module that offers this functionality. Use this to describe what happens at each line of code that uses a method or object defined in this module. Question 1: Implement the missing functions in file helpers.py . In particular: 1(a) Implement strip_non_alpha , as indicated in the docstring of the function. Modify the main body of helpers.py so that, when you run the program via python helpers.py the main body of the program runs unit tests (e.g., via the assert command) with several different inputs to confirm that the correct output was produced. Make sure that you include tests in which either the input or the output of the function is the empty string. Include (i) your definition of strip_non_alpha , (ii) the tests you implemented in your report. 1(b) Similarly, implement is_inflection_of , same , and find_match , as indicated by the corresponding docstrings. Again, test these extensively by modifying the main body of helpers.py , making sure these functions correctly handle empty strings. Include again (i)-(ii) as above (your code and unit tests) in your report. Hint: If you have python3 installed on your computer, you can implement and test these functions on your own machine without connecting to the cluster. c 2022, Stratis Ioannidis 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
EECE5645 Question 2: 2(a) Implement functions count_sentences and count_words in file TextAnalyzer.py , as described in their docstrings ( Hint: these function definitions should be very short). Include the function definitions in your report. 2(b) Modify the main body of the TextAnalyzer.py so that, when called as: python TextAnalyzer.py SEN input it loads file input into an RDD, calls count_sentences , and prints the number of sentences in file input . You should assume that file input contains text over several lines, with each line corresponding to a different sentence. Include (i) the relevant code in the main body of TextAnalyzer.py , as well as (ii) its output when executed over file Data/Books/TheStoryOfTheStone.txt , in your report. 2(c) Modify the main body of the TextAnalyzer.py so that, when called as: python TextAnalyzer.py WRD input it loads file input into an RDD, calls count_words , and prints the number of words in file input . Again, assume that file input contains text over several lines, with each line corresponding to a different sentence. Do not worry about cases or non-alphabetic characters in this computation: strings "I’ve" , "123" , and "Atlas16" should each count as one word. Include (i) the relevant code in the main body of TextAnalyzer.py , as well as (ii) its output when executed over file Data/Books/TheStoryOfTheStone.txt , in your report. 2(d) Suppose that you were executing command python TextAnalyzer.py WRD input on a standalone cluster. Explain (i) what data/variables are stored on the driver, (ii) what data/variables are stored on the workers, and (iii) what information is exchanged between the driver and the workers, or between the workers themselves, when your code is executed. This is a “thought-experiment”. Do not launch a standalone cluster or run anything. Try to reason about this based on what you have learned in class. c 2022, Stratis Ioannidis 4
EECE5645 Question 3: As before, in this and all following questions, assume that file input contains text over several lines, with each line corresponding to a different sentence. 3(a) Implement function compute_counts in file TextAnalyzer.py , as described in its docstring. Include the function definition in your report. Make sure to convert to lowercase and remove leading or trailing non-alphabetic characters first: ’12BanAna,!’ and ’banana’ should count as the same word. Hint: you may import any function you need from file helpers.py . 3(b) Modify the main body of the TextAnalyzer.py so that, when called as: python TextAnalyzer.py UNQ input --N 30 it loads file input into an RDD, calls compute_counts , produces an RDD with 30 partitions, and uses this to compute and print the number of unique words in file input . Include (i) the relevant code in the main body of TextAnalyzer.py , as well as (ii) its output when executed over file Data/Books/TheStoryOfTheStone.txt , in your report. 3(c) Modify the main body of the TextAnalyzer.py so that, when called as: python TextAnalyzer.py TOP20 input --N 30 it loads file input into an RDD, calls compute_counts , produces an RDD with 30 partitions, and uses this to compute and print the 30 most frequent words in file input , along with their counts. Include (i) the relevant code in the main body of TextAnalyzer.py , as well as (ii) its output when executed over file Data/Books/TheStoryOfTheStone.txt , in your report. 3(d) Suppose that you were executing command python TextAnalyzer.py UNQ input --N 30 on a standalone cluster. Explain (a) what data/variables are stored on the driver, (b) what data/variables are stored on the workers, and (c) what information is exchanged between the driver and the workers, or between the workers themselves, when your code is executed. Again, do not launch a cluster; reason about this from what you have learned in class. c 2022, Stratis Ioannidis 5
EECE5645 Question 4: 4(a). Implement function count_difficult in file TextAnalyzer.py , as described in its doc- string. Include the function definition in your report. Hint: you may again import any function you need from file helpers.py . 4(b) Modify the main body of the TextAnalyzer.py so that, when called as: python TextAnalyzer.py DFF input --N 30 --simple_words easy.txt it loads file input into an RDD, reads easy.txt in a list, and uses count_difficult to compute and the number of difficult words in file input . Include (i) the relevant code in the main body of TextAnalyzer.py , as well as (ii) its output when executed over file Data/Books/TheStoryOfTheStone.txt with easy.txt given by DaleChallEasyWordList.txt , in your report. 4(c) Suppose that you were executing command python TextAnalyzer.py DFF input --N 30 --simple_words easy.txt on a standalone cluster. Beyond anything you already described in Q3(d), explain (i) what additional data/variables are stored on the driver, (ii) what additional data/variables are stored on the workers, and (iii) what is the additional information (if any) exchanged between the driver and the workers, or between the workers themselves, when your code is executed. Again, do not launch a cluster; reason about this from what you have learned in class. 4(d) The (simplified) Dale-Chall Formula 3 is a score used to compute how difficult an English text is. The score is computed by the following formula: 0 . 1579 # difficult words # words × 100 + 0 . 0496 # words # sentences , where # sentences is the number of sentences in the file, # words is the number of words in the file, and # difficult words is the number of difficult words in the file. Difficult words are all words not appearing in DaleChallEasyWordList.txt , accounting for regular inflections etc. Modify the main body of the TextAnalyzer.py so that, when called as: python TextAnalyzer.py DCF input it computes and prints the Dale-Chall Formula on file input . It is fine to also print intermediate results, like # sentences etc., if you like. Parameters --N and --simple_words should be supported but should also be optional, with default values set to 30 and DaleChallEasyWordList.txt , respectively. Include (i) the relevant code in the main body of TextAnalyzer.py , as well as (ii) its output when executed over file Data/Books/TheStoryOfTheStone.txt in your report. 3 https://en.wikipedia.org/wiki/Dale-Chall_readability_formula c 2022, Stratis Ioannidis 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
EECE5645 Question 5: Use the timing code included in TextAnalyzer.py to compute the Dale-Chall Formula N taking values in { 2,5,10,15,20 } when executed over the file Data/Books/TheStoryOfTheStone.txt . Produce a bar plot with N in the x -axis and the computation time on the y -axis on your report. Hint: make sure that you run the code with different values of N on the same compute node, or at least on different nodes with the same specifications. Question 6: Use your code to produce a table with the Dale-Chall formula for every file in directory Data . Report the resulting values in a table, grouping files each category (e.g., books, news, etc.) together, e.g., in consecutive rows. What do you observe? c 2022, Stratis Ioannidis 7
EECE5645 Addendum. These are a few values to help you check the correctness of your code. The poem “Fire and Ice” by Robert Frost is located at /courses/EECE5645.202410/data/HW1/Data/fireandice.txt . Treating each line/verse as a sentence, the poem has: 9 sentences 51 words 41 unique words 3 difficult words (perish, destruction, and suffice) The Dale-Chall Formula is 1.2098901960784314 The 20 most frequent words are (note that the list is non-unique after the 8th word): 1. i: 3 2. say: 3 3. of: 2 4. to: 2 5. some: 2 6. fire: 2 7. in: 2 8. ice: 2 9. think: 1 10. would: 1 11. and: 1 12. it: 1 13. twice: 1 14. also: 1 15. desire: 1 16. hold: 1 17. will: 1 18. i’ve: 1 19. had: 1 20. perish: 1 c 2022, Stratis Ioannidis 8