Lab3

html

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

410

Subject

Computer Science

Date

Feb 20, 2024

Type

html

Pages

Uploaded by MateWasp883

Instructor: Professor Romit Maulik ¶ TAs: Haomiao Ni and Songtao Liu ¶ Lab 3: Hashtag Counting and Spark- submit in Cluster Mode ¶ The goals of this lab are for you to be able to ¶ - Use the RDD transformations filter and sortBy . ¶ - Compute hashtag counts for an input data file containing tweets. ¶ - Modify PySpark code for local mode into PySpark code for cluster mode. ¶ - Request a cluster from ICDS, and run spark-submit in the cluster mode for a big dataset. ¶ - Obtain run-time performance for a choice on the number of output partitions for reduceByKey. ¶ - Apply the obove to compute hashtag counts for tweets related to Boston Marathon Bombing (gathered on April 17, 2013, two days after the domestic terrorist attack). ¶ Total Number of Exercises: 5 ¶ • Exercise 1: 5 points • Exercise 2: 10 points • Exercise 3: 10 points • Exercise 4: 15 points • Exercise 5: 10 points Total Points: 50 points ¶ Data for Lab 3 ¶ • sampled_4_17_tweets.csv : A random sampled of a small set of tweets regarding Boston Marathon Bombing on April 17, 2013. This data is used in the local mode. • BMB_4_17_tweets.csv : The entire set of tweets regarding Boston Marathon Bombing on April 17, 2013. This data is used in the cluster mode. • Like Lab2, download the data from Canvas into a directory for the lab (e.g., Lab3) under your home directory. Items to submit for Lab 3 ¶ • Completed Jupyter Notebook (HTML format) • .py file used for cluster mode • log file for a successful run in the cluster mode • a screen shot of the ls -al command in the output directory for a successful run in the cluster mode.

Due: 11.59 PM, Sept 10, 2023, 5 bonus points if you submit by 11.59 PM, Sept 8, 2023 ¶ Like Lab 2, the first thing we need to do in each Jupyter Notebook running pyspark is to import pyspark first. ¶ In [1]: import pyspark In [2]: from pyspark import SparkContext Like Lab 2, ww create a Spark Context object. ¶ • Note: We use "local" as the master parameter for SparkContext in this notebook so that we can run and debug it in ICDS Jupyter Server. However, we need to remove "master="local", later when you convert this notebook into a .py file for running it in the cluster mode. In [3]: sc=SparkContext(appName="Lab3") sc Out[3]: SparkContext Spark UI Version v3.4.1 Master local AppName Lab3 Exercise 1 (5 points) Add your name below ¶ Answer for Exercise 1 ¶ • Your Name: Ruthvik Uttarala Exercise 2 (10 points) ¶ Complete the path and run the code below to read the file "sampled_4_17_tweets.csv" from your Lab3 directory. ¶ For cluster mode execution - change this path to "BMB_4_17_tweets.csv" for a big data cluster job ¶ In [4]: tweets_RDD = sc.textFile("/storage/home/rpu5040/Lab3/BMB_4_17_tweets.csv") tweets_RDD Out[4]: /storage/home/rpu5040/Lab3/sampled_4_17_tweets.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Exercise 3 (10 points) ¶ Complete and execute the code below, which computes the total count of hashtags in the input tweets, sort them by count (in descending order), and save them in an output directory: ¶ • (a) Uses flatMap to "flatten" the list of tokens from each tweet (using split function) into a very large list of tokens. • (b) Filter the token for hashtags. • (c) Count the total number of hashtags in a way similar to Lab 2. • (d) Sort the hashtag count in descending order. • (e) Save the sorted hashtag counts in an output directory. Code for Exercise 3 is shown in the Code Cells below. ¶ In [5]: tokens_RDD = tweets_RDD.flatMap(lambda line: line.strip().split(" ")) #tokens_RDD.take(3) Out[5]: [',Text', '0,You', 'know'] take (action for RDD) ¶ • take is an action for RDD. • The parameter is the number of elements from the input RDD you want to show. • take is often used for debugging/learning purpose in the local mode so that the contents of a few samples of an RDD can be revealed. This way, if the content and/or the format of the RDD differs from what you expected, you can investigate it and, if needed, fix it before proceeding further. filter (transformation for RDD) ¶ • The syntax for filtering (one type of data trasnformation in Spark) an input RDD is <input RDD>.filter(lambda <parameter> : <the body of a Boolean function> ) • Notice the syntax is not what is described in p. 38 of the textbook. • The result of filtering the input RDD is the collection of all elements in the input RDD that pass the filter condition (i.e., returns True when the filtering Boolean function is applied to each element of the input RDD). • For example, the filtering condition in the pyspark conde below checks whether each element of the input RDD (i.e., tokens_RDD ) starts with the character "#", using Python startswith() method for string. In [6]: hashtag_RDD = tokens_RDD.filter(lambda x: x.startswith("#")) In [8]: hashtag_1_RDD = hashtag_RDD.map(lambda x: (x, 1)) In [11]: hashtag_count_RDD = hashtag_1_RDD.reduceByKey(lambda x, y: 1, 5 ) sortBy (transformation for RDD) ¶ • To sort hashtag count so that those that occur more frequent appear first, we use sortBy(lambda pair: pair[1], ascending=False) . • sortBy sort the input RDD based on the value of the lambda function, which returns the value of the input key-value pair. • Note: The index of a list/tuple in Python starts with 0. Therefore pair[0]

accesses the key of each key-value pair (in the input RDD), whereas pair[1] accesses the value of the key-value pair in the input RDD. • The default sorting order is ascending. To sort in descending order, we need to set the parameter ascending to False , which means frequent/top hashtags occured first in the output. In [12]: sorted_hashtag_count_RDD = hashtag_count_RDD.sortBy(lambda pair: pair[1], ascending=False) Note: You need to complete the path with your output directory (e.g., Lab3 under your home directory). ¶ Note: You also need to change the directory names (e.g., replace "_sampled" with "_cluster") before you convert this notebook into a .py file for submiting it to ICDS cluster. ¶ In [13]: output_path = "/storage/home/rpu5040/Lab3/sorted_hashtag_count_cluster.txt" sorted_hashtag_count_RDD.saveAsTextFile(output_path) In [14]: sc.stop() Exercise 4 (15 points) ¶ After running this Jupyter Notebook successfully in local mode, modify the notebook in the following ways • Remove master="local", in SparkContext statement so that the Spark code runs in cluster (Stand Alone cluster) mode. • Change the input file to "BMB_4_17_tweets.csv", which you have downloaded from Canvas and saved in your directory for Lab3. • Change the output directory/path name • Comment out the take action for RDD, which is used for debugging in the local mode. • Save the modified notebook as Lab3C.ipynb (using Save Notebook as under File on the top menu (top left). • Export the modified notebook using Export Notebook as under File to export it as Executable Script (the exported Python file has .py extension, e.g. Lab3C.py). • Upload Lab3C.py to the Lab3 directory in ICDS Roar under your home directory. • Follow the direction of "Instructions for Running Spark in Cluster Mode" to run Lab3C.py in the cluster mode using spark-submit. • Submit .py file and a screen shot of ls -al in your output directory (in a Terminal window). Exercise 5 (10 points) ¶ Record the run time information below and submit the log file that contains the run time information. Answer to Exercise 5 ¶ • Number of partitions used in ReduceByKey: 5 • real time: 0m13.126s • user time: 0m0.322s • sys time: 0m0.295s In [ ]:

Related Documents

Questions_PM2.docx

lecture_8_practice.pdf

Module 08 - Chapter 08 Assignment.docx

Session 14 ChatGPT.pdf

HW2_SUMMER.py

HW1FINALBEST.py

8.2 Worksheet Part 1_ SLI-1020-A8100 ASL III W24r2.pdf

8.1-8.3 Comprehension Quiz_ SLI-1020-A8100 ASL III W24.pdf

Project One- Research Investigation Work.docx

Unit 1 Discussion.docx

Firearms Analysis Pre-Lab, Lab, and Post-Lab.docx

LONG ONE MAN.docx

Recommended textbooks for you

EBK JAVA PROGRAMMING

Computer Science

ISBN:9781337671385

Author:FARRELL

Publisher:CENGAGE LEARNING - CONSIGNMENT

Np Ms Office 365/Excel 2016 I Ntermed

Computer Science

ISBN:9781337508841

Author:Carey

Publisher:Cengage

Systems Architecture

Computer Science

ISBN:9781305080195

Author:Stephen D. Burd

Publisher:Cengage Learning

LINUX+ AND LPIC-1 GDE.TO LINUX CERTIF.

Computer Science

ISBN:9781337569798

Author:ECKERT

Publisher:CENGAGE L

C++ Programming: From Problem Analysis to Program...

Computer Science

ISBN:9781337102087

Author:D. S. Malik

Publisher:Cengage Learning

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

SEE MORE TEXTBOOKS

Recommended textbooks for you

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
LINUX+ AND LPIC-1 GDE.TO LINUX CERTIF.
Computer Science
ISBN:9781337569798
Author:ECKERT
Publisher:CENGAGE L
C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage