sample_final_solutions

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

5110

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by EarlMulePerson967

Report
DS 5110: Introduction to Data Management and Processing (Fall 2023) Sample Final Solutions You have 2 hours. The exam is closed book, closed notes. You are allowed to bring two letter-sized pages of notes (on both sides). Use of electronic devices is not allowed. Please read all instructions carefully. Name: NEU ID (optional): 1
1 Multiple-Choice Questions (40 points) Circle ALL the correct choices : there may be more than one correct choice, but there is always at least one correct choice. NO partial credit : the set of all the correct answers must be checked. There are 8 multiple choice questions worth 5 points each. 1. When should we not use a NoSQL database instead of a relational database? (a) When the data structure is highly unstructured and rapidly changing. (b) When scalability and handling large volumes of data with less structure are the main concerns. (c) When ACID (Atomicity, Consistency, Isolation, Durability) properties are critically im- portant for the application. (d) When dealing with Big Data applications that require high-speed data processing and analytics. (b). Relational databases are better suited for scenarios where maintaining the ACID prop- erties is essential. 2. What are the advantages of the BSON format over JSON? (a) BSON is optimized for data storage and retrieval. (b) BSON supports more data types than JSON. (c) BSON is more human readable than JSON. (d) BSON supports the storage of JavaScript code. (a) and (b). (a): BSON as a binary format is more optimized for data storage and retrieval than JSON which is a textual format. (b): JSON supports basic data types like numbers, strings, booleans, arrays, and objects. BSON extends this by supporting additional data types, such as date, binary data, and ObjectId, which are not natively supported in JSON. (c): BSON is a binary format, which makes it less human-readable than JSON. (d): BSON does not support the storage of executable code as a native data type. 3. Suppose you have the following collection in MongoDB. The collection is named contacts . {" _id" : 1, "region" : "NW1", "leads" : 1, "email" : " mlangley@co1 .com "} {" _id" : 2, "region" : "NW1", "leads" : 1, "email" : " jpicoult@co4 .com "} {" _id" : 3, "region" : "NW1", "leads" : 2, "email" : " zzz@company2 .com "} {" _id" : 4, "region" : "SE1", "leads" : 8, "email" : "mary@hssu.edu" } {" _id" : 5, "region" : "SE2", "leads" : 4, "email" : "janet@col.edu "} {" _id" : 6, "region" : "SE2", "leads" : 2, "email" : "bill@uni.edu "} {" _id" : 7, "region" : "SE2", "leads" : 4, "email" : " iii@company1 .com "} {" _id" : 8, "region" : "SW1", "leads" : 1, "email" : "phil@co3.com "} {" _id" : 9, "region" : "SW1", "leads" : 2, "email" : " thomas@company .com "} {" _id" : 10, "region" : "SW2", "leads" : 2, "email" : " sjohnson@uchi .edu "} {" _id" : 11, "region" : "SW2", "leads" : 5, "email" : " tsamuel@someco .com "} How many documents will be returned from the following query? 2
db.contacts.aggregate ([ {" $ group ": {" _id ": " $ region", "count ": { $ count: {}}}} , {" $ match ": {" count ": {" $ gte ": 3}}} ]) (a) 0 (b) 1 (c) 2 (d) 3 (e) 4 (c). The first stage, $group , groups by the region and uses the $sum accumulator expression to count the number of documents in each group. Next, these documents flow into the $match stage, where documents with a count that is less than 3 (3 out of the 5 groups) are filtered out, returning two documents. { "_id" : "SE2", "count" : 3 } { "_id" : "NW1", "count" : 3 } 4. In HDFS, what are the roles of NameNode? (a) Stores the Block IDs and locations of any given file in HDFS. (b) Executes file system namespace operations such as renaming files and directories. (c) Creates and deletes data blocks. (d) Chooses which DataNodes store the replicas of a given file. (a), (b) and (d) 5. Which of the following statements related to MapReduce are true? (a) Each mapper/reducer must generate the same number of output key/value pairs as it receives on the input. (b) The input to reducers is grouped by key. (c) The output type of keys generated by mappers must be of the same type as their input. (d) The output type of keys generated by mappers must be of the same input type of keys received by reducers. (b) and (d) (a): Mappers and reducers can generate a different number of output key/value pairs than they receive. (b): The input to reducers is grouped by key. (c): The output key types of mappers and reducers can be different from their input key types. (d): The output type of keys generated by mappers must be of the same input type of keys received by reducers. This is true as the output keys of the mappers are the input keys for the reducers. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
6. Which of the following statements about Apache Hive is not true? (a) Hive is a real-time data warehousing solution that provides immediate query responses. (b) Hive operates on Hadoop’s file system (HDFS) and provides SQL-like query language called HiveQL. (c) Hive translates HiveQL queries into MapReduce jobs under the hood to process the data. (d) Hive is designed for managing and querying only structured data stored in databases. (a) and (d). (a): While Hive is a data warehouse infrastructure built on top of Hadoop, it is not accurate to describe it as a real-time data warehousing solution; it is primarily designed for batch processing and is not typically used for real-time query responses. (d): Hive is not limited to traditional database formats, and can also handle semi-structured and unstructured data (as we see in the Word Count example in class). 7. Consider the following HTML document: <table > <tr > <th style =" height :100 px">Header1 </th > <th >Header2 </th > </tr > <tr > <td style =" height :100 px">Cell1 </td > <td >Cell2 </td > </tr > <tr > <td style =" height :100 px">Cell3 </td > <td colspan ="2" > Cell4 </td > </tr > </table > How many elements will be returned from the following XPath query: //tr/td[@style] (a) 0 (b) 1 (c) 2 (d) 3 (e) The query is incorrect. (c) 8. In Spark, which of the following operations triggers the execution of RDD transformations? (a) map() (b) reduce() (c) transform() (d) select() 4
(e) collect() (b) and (e). These two functions are actions, therefore they trigger the execution of RDD transformations that are defined before them. 5
2 MongoDB (25 points) You have a collection called movies . A sample document in the collection is shown below: { ' _id ' : ObjectId ( ' 573 a1394f29313caabcdf639 ' ), ' title ' : ' Titanic ' , ' released ' : datetime.datetime (1953 , 7, 13, 0, 0), ' plot ' : ' An unhappy married couple deal with their problems on board the ill - fated ship. ' , ' genres ' : [ ' Drama ' , ' History ' , ' Romance ' ], ' runtime ' : 98, ' cast ' : [ ' Clifton Webb ' , ' Barbara Stanwyck ' , ' Robert Wagner ' , ' Audrey Dalton ' ], ' directors ' : [ ' Jean Negulesco ' ], ' writers ' : [ ' Charles Brackett ' , ' Walter Reisch ' , ' Richard L. Breen ' ], ' imdb ' : { ' rating ' : 7.3, ' votes ' : 4677} , ' reviews ' : [{ ' username ' : ' user1 ' , ' comment ' : ' THIS MOVIE IS AMAZING !!! I love watching it! ' }, { ' username ' : ' user2 ' , ' comment ' : ' Knowing it is the 3rd highest grossing film of all time as of now , I can imagine why! ' }, { ' username ' : ' user3 ' , ' comment ' : ' This movie is a masterpiece , everything is perfectly and beautifully shot and well acted . ' }] } Write the following queries in MongoDB (5 points for each query): 1. Find how many movies belong to the ’Romance’ genre. db.movies.find ({ genres: ' Romance ' }).count () 2. Find the titles and the ratings of the three movies with the highest IMDB rating. db.movies.find ({}, {title: 1, ' imdb.rating ' : 1}).sort({ ' imdb.rating ' : -1}). limit (3) 3. Find the actor/actress who participated in the highest number of movies. db.movies.aggregate ([ { $ unwind: '$ cast ' }, { $ group: { ' _id ' : '$ cast ' , ' num_movies ' : { $ count: {}}}} , { $ sort: { ' num_movies ' : -1}}, { $ limit: 1} ]) 4. Find the average number of reviews written per movie. Note that some movies don’t have any reviews. db.movies.aggregate ([ { $ project: { num_reviews : { $ size: { $ ifNull: [ '$ reviews ' , []]}}}} , { $ group: {_id: null , reviewsAvg: { $ avg: '$ num_reviews ' }}} ]) 5. Add an actor named ”Samuel L. Jackson” to the movie ”Pulp Fiction”. db.movies.updateOne( { ' title ' : ' Pulp Fiction ' }, { $ push: { ' cast ' : ' Samuel L. Jackson ' }} ) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3 MapReduce (15 points) For each of the following problems describe how would you solve it using MapReduce. Write the map and reduce tasks in pseudo-code. 1. (5 points) The input is a list of housing data where each input record contains information about a single house: (address, city, state, zip, price). The output should be the average house price in each zip code. function map(address , city , state , zip , price): emit(zip , price) function reduce(zip , list of prices): emit(zip , avg(prices)) 2. (10 points) The input contains two lists. The first list provides voter information for every registered voter: (voter-id, name, age, zip). The second list gives occupancy information: (zip, age, job). For each unique pair of zip and age values, the output should give a list of names and a list of jobs for people in that zip code with that age. If a particular zip/age pair appears in one input list but not the other, then that zip/age pair can appear in the output with an empty list of names or jobs, or you can omit it from the output entirely. The mapper task first has to identify from which list the input record came from. In this case, we can use the number of fields in the record to identify the list. The output of the mapper will be key = (zip, age) and the value = either a name or a job with an additional label to distinguish which it is. function map(record): if len(record) == 4: # Voter information emit (( record [3], record [2]) , ( ' name ' , record [1]) else: emit (( record [0], record [1]) , ( ' job ' , record [2]) The reducer task will get all the names and jobs associated with a specific (zip, age) key and will aggregate the two lists together: function reduce ((zip , age), list of values): list_of_names = all values that are names from the voter list list_of_jobs = all values that are jobs from the occupancy list emit ((zip , age), list of all names , list of all jobs) If a particular age/zip pair appears in one input list but not the other, it will naturally result in an empty list for either names or jobs. 7
4 Spark (20 points) You have been hired as a data scientist at Spaces, a large social networking website where users can follow each other. In your first task, you are given the Spaces follow graph. The graph has 550 million nodes, and 100 billion directed edges. Using Spark, write a parallel program to compute the in-degree distribution of the graph. That is, for each k , your program should compute the number of users who have k followers (unless this number is zero, in which case you do not need to include that count in the final output). Your program should effectively take advantage of parallel processing and not attempt to read 100 billion edges on a single machine. Your input is an RDD, where each entry is an edge: ( u, v ) is an entry in the RDD if u follows v . Your function should return another RDD, where the entries are of the form ( k , number of users who have k followers). Fill in the code stub provided for you below. You should write real code (rather than pseu- docode). def degree_distribution (edges): """ Input: an RDD containing entries of the form (source node , destination node) Output: an RDD containing entries of the form (k, number of users who have k followers) """ rdd = edges. map ( lambda edge: (edge [1], 1)) rdd = rdd. reduceByKey ( lambda x, y: x + y) rdd = rdd. map ( lambda count: (count [1], 1)) distribution = rdd.reduceByKey ( lambda x, y: x + y) return distribution # Read input and convert all node ids to integers data = sc.textFile( ' followers.txt ' ). map ( lambda line: ( int (line.split () [0]) , int ( line.split () [1]))) # Compute the degree distribution distribution = degree_distribution (data) # Write the output to a file distribution .sortByKey (). saveAsTextFile ( ' distribution_output ' ) 8
THIS PAGE IS INTENTIONALLY LEFT BLANK
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help