Review Test Submission_ Pair RDD and Dataframes Quiz – .._

pdf

School

University of Texas, Dallas *

*We aren’t endorsed by this school

Course

6350

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by BarristerThunder12757

6/26/22, 10:11 PM Review Test Submission: Pair RDD and Dataframes Quiz – ... file:///C:/Users/axm210187/Downloads/Review Test Submission_ Pair RDD and Dataframes Quiz – .... html 1/6 Review Test Submission: Pair RDD and Dataframes Quiz CS 6350.0U1 - Big Data Management and Analytics - Su22 Course Homepage Review Test Submission: Pair RDD and Dataframes Quiz User Aneena Manoj Course CS 6350.0U1 - Big Data Management and Analytics - Su22 Test Pair RDD and Dataframes Quiz Started 6/26/22 11:47 AM Submitted 6/26/22 10:09 PM Due Date 6/26/22 11:59 PM Status Completed Attempt Score 100 out of 100 points Time Elapsed 10 hours, 21 minutes Results Displayed All Answers, Submitted Answers, Correct Answers Question 1 Selected Answers: Answers: Which of the following are valid operations on key-value pair RDDs? sortByKey groupByKey groupbyValue sortByKey groupByKey sortByValue Question 2 Consider the RDD defined below: data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32]] rdd = sc.parallelize(data) and a function as defined: def myFun(rdd): mean = rdd.values().mean() rdd1 = rdd.map(lambda x: x[1] > mean) return rdd1 What will the output of myFun(rdd).sum() My eLearning 10 out of 10 points 10 out of 10 points Aneena Manoj

6/26/22, 10:11 PM Review Test Submission: Pair RDD and Dataframes Quiz – ... file:///C:/Users/axm210187/Downloads/Review Test Submission_ Pair RDD and Dataframes Quiz – .... html 2/6 Selected Answer: Answers: 2 2 0 [ ["sarah", 30], ["sam", 32]] [False, False, True, False, True] Question 3 Selected Answer: Answers: Consider the RDD defined below: data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]] rdd = sc.parallelize(data) What will be the output of: rdd.map(lambda x: (x[0][0], x[1])).mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])).mapValues(lambda x: float(x[0])/float(x[1])).collect() It will first convert given RDD so that the new key becomes the first character of original key. It will then obtain average value for each key. It will return average value for each key. It will first convert given RDD so that the new key becomes the first character of original key. It will then obtain average value for each key. It will generate count of values for each key. It will remove duplicates from the list. Question 4 Selected Answer: Answers: Consider the RDD defined below: data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]] rdd = sc.parallelize(data) Each element of the RDD contains name as the key and age as the value. We would like to write a function that will sort this rdd by value in descending order. Which of the following is the correct way to write such a function and apply it to the given RDD? def sortByValue(rdd): return rdd.map(lambda x: (x[1], x[0])).sortByKey(ascending = False).map(lambda x: (x[1], x[0])) sortByValue(rdd) def sortByValue(rdd): return rdd.map(lambda x: (x[1], x[0])).sortByKey(ascending = False).map(lambda x: (x[1], x[0])) sortByValue(rdd) 10 out of 10 points 10 out of 10 points

6/26/22, 10:11 PM Review Test Submission: Pair RDD and Dataframes Quiz – ... file:///C:/Users/axm210187/Downloads/Review Test Submission_ Pair RDD and Dataframes Quiz – .... html 3/6 def sortByValue(rdd): return rdd.map(lambda x: (x[1], x[0])).sortByKey(ascending = False) sortByValue(rdd) def sortByValue(rdd): return rdd.map(lambda x: (x[1], x[0])).sortByKey(ascending = False).map(lambda x: (x[1], x[0])) rdd.sortByValue() def sortByValue(rdd): return rdd.sortByKey(ascending = False).map(lambda x: (x[1], x[0])) sortByValue(rdd) Question 5 Selected Answers: Answers: Suppose I create a pair RDD as follows: majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]] majors = sc.parallelize(majors_data) In the above pair RDD, name of the person is the key and their major is the value. I would like to count how many students are enrolled in each major. Which of the following accomplishes this? majors.values().countByValue() majors.values().map(lambda x: (x, 1)).countByKey() majors.values().countByValue() majors.values().map(lambda x: (x, 1)).countByKey() majors.values().count() majors.countByValue() Question 6 Selected Answers: Suppose I create two pair-RDDs as follows: data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]] rdd = sc.parallelize(data) majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]] majors = sc.parallelize(majors_data) In the first pair-RDD, name of the person is the key and age is the value, and in the second pair-RDD, name of the person is the key and their major is the value. I would like to output a pair-RDD with major as the key and value as the oldest person enrolled in that major. Which of the following accomplishes this? rdd.join(majors).values().map(lambda x: (x[1], x[0])).reduceByKey(lambda x, y: max(x, y)) 10 out of 10 points 10 out of 10 points

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

6/26/22, 10:11 PM Review Test Submission: Pair RDD and Dataframes Quiz – ... file:///C:/Users/axm210187/Downloads/Review Test Submission_ Pair RDD and Dataframes Quiz – .... html 4/6 Answers: majors.join(rdd).values().reduceByKey(max) majors.join(rdd).values().reduce(max) rdd.join(majors).map(lambda x: (x[1], x[0])).reduceByKey(lambda x, y: max(x, y)) rdd.join(majors).values().map(lambda x: (x[1], x[0])).reduceByKey(lambda x, y: max(x, y)) majors.join(rdd).values().reduceByKey(max) Question 7 Selected Answers: Answers: Suppose I create two pair-RDDs as follows: data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]] rdd = sc.parallelize(data) majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]] majors = sc.parallelize(majors_data) In the first pair-RDD, name of the person is the key and age is the value, and in the second pair-RDD, name of the person is the key and their major is the value. I would like to find the major and age of 'john' using a single line query (Note: order of display of age and major doesn't matter and I don't want his name to be displayed). Which of the following accomplishes this? rdd.join(majors).lookup('john') majors.join(rdd).lookup('john') rdd.join(majors).filter(lambda x: x[0]=='john').values() rdd.join(majors).lookup('john') rdd.join(majors).filter(lambda x, y: x =='john').values() majors.join(rdd).lookup('john') rdd.join(majors).filter(lambda x: x[0]=='john').values() Question 8 Selected Answer: Answers: Suppose I create a pair-RDD as follows: data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]] rdd = sc.parallelize(data) In this pair-RDD, name of the person is the key and age is the value. I would like to create a PySpark Dataframe called nameDF from this pair-RDD with two columns named: "name" and "age". Which of the following accomplishes this? nameDF = rdd.toDF(["name", "age"]) nameDF = rdd.DataFrame(["name", "age"]) nameDF = rdd.toDF(["name", "age"]) 10 out of 10 points 10 out of 10 points

6/26/22, 10:11 PM Review Test Submission: Pair RDD and Dataframes Quiz – ... file:///C:/Users/axm210187/Downloads/Review Test Submission_ Pair RDD and Dataframes Quiz – .... html 5/6 nameDF = rdd.saveAsDataFrame(["name", "age"]) nameDF = rdd.toDF("name", "age") Question 9 Selected Answers: Answers: Suppose I create two pair-RDDs as follows: names_data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]] names = sc.parallelize(data) majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]] majors = sc.parallelize(majors_data) I convert these pair-RDDs to two PySpark Dataframes - namesDF with columns ['name', 'age'] and majorsDF with columns ['name', 'major']. Using the two Dataframes, I would like to find average age and count of students for each major. That is, I want one output dataframe with 3 columns: [major, avgAge, count]. Which of the following accomplishes this? from pyspark.sql.functions import avg, desc, count majorsDF.join(namesDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count")) from pyspark.sql.functions import avg, desc, count namesDF.join(majorsDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count")) from pyspark.sql.functions import avg, desc, count majorsDF.join(namesDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count")) from pyspark.sql.functions import avg, desc, count majorsDF.join(namesDF, "name").groupBy("major").avg("age").count() from pyspark.sql.functions import avg, desc, count namesDF.join(majorsDF, "name").groupBy("major").agg(avg("age").alias("avgAge"), count("name").alias("count")) from pyspark.sql.functions import avg, desc, count majorsDF.join(namesDF, "name").groupBy("major").count().avg("age") Question 10 Suppose I create two pair-RDDs as follows: names_data = [["john", 20], ["bill", 25], ["sarah", 30], ["mary", 18], ["sam", 32], ['jill', 27], ['mike', 60], ['bella', 22]] names = sc.parallelize(data) majors_data = [["john", "cs"], ["bill", "cs"], ["sarah", "math"], ["mary", "stats"], ["sam", 10 out of 10 points 10 out of 10 points

6/26/22, 10:11 PM Review Test Submission: Pair RDD and Dataframes Quiz – ... file:///C:/Users/axm210187/Downloads/Review Test Submission_ Pair RDD and Dataframes Quiz – .... html 6/6 Sunday, June 26, 2022 10:09:20 PM CDT Selected Answers: Answers: "physics"], ['jill', "math"], ['mike', "cs"], ['bella', "cs"]] majors = sc.parallelize(majors_data) I convert these pair-RDDs to two PySpark Dataframes - namesDF with columns ['name', 'age'] and majorsDF with columns ['name', 'major']. Using the two Dataframes, I would like to find the names and majors of people that are above 30 years old. Which of the following accomplishes this? namesDF.join(majorsDF, "name").filter("age > 30") namesDF.filter("age > 30").join(majorsDF, "name") majorsDF.filter("age > 30").join(majorsDF, "name").show() namesDF.join(majorsDF, "name").filter("age > 30") namesDF.filter("age > 30").join(majorsDF, "name") namesDF.filter("age" > 30).join(majorsDF, "name") ← OK

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Review Test Submission_ Pair RDD and Dataframes Quiz – .._

Related Documents