hw4_440

docx

School

Purdue University *

*We aren’t endorsed by this school

Course

440

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

Uploaded by BailiffKangaroo6144

Name: Ankush Maheshwari Student ID: 0032646352 Purdue University (Spring 2023) CS44000: Large-scale Data Analytics Homework 1 IMPORTANT:  Upload a pdf file with answers to Gradescope.  Please use the either the latex template or word template to write down your answers and generate a pdf file. o Latex template: https://www.cs.purdue.edu/homes/csjgwang/CS440/template.tex o Word template: https://www.cs.purdue.edu/homes/csjgwang/CS440/template.docx Problem Score 1 2 3 4 5 Total 1

Problem 1 1a) We can solve this problem by using the Hadoop MapReduce approach. We make a map function that reads each document and emits key-value pairs, where the key is a word and value is the document ID it belongs to. We will do this for all words in all documents. We sort the key value pairs by the word and group them together to get all the document IDs associated with a word. The reduce function receives the sorted key-value pairs, combines unique values (document IDs) for each word key, and emits the final output. We use the set function to ensure that the document IDs associated with each word are unique. 1b) Map(String docID, String content): words = content.split() # Split document content into words for word in words: emit(word, docID) # Emit (word, documentID) as key-value pair Reduce(String word, List<documentIDs>): unique_docs = Set() # Use a set to store unique document IDs for docID in documentIDs: unique_docs.add(docID) # Add document ID to the set format_docs = join(unique_docs, ‘, ‘) # Add comma between docs for output format emit(word + ‘: ‘ + format_docs) # Emit word and set of document IDs 2

Problem 2 2a) Databases: Write-ahead Logging (WAL): Databases employ transaction logs and write-ahead logging mechanisms. When changes occur, they are first written to a log file before being applied to the actual database. In case of a crash, the system can use the log to replay the operations and restore the database to a consistent state. The log is usually much smaller than the data, and there are two types of logs: redo (contains new data) and undo (contains old data). Depending on the DBMS, we may use three approaches for logging: UNDO only, REDO only, both UNDO and REDO. Pros: - Provides ACID (Atomicity, Consistency, Isolation, Durability) properties. - Ensures data consistency by logging transactions before applying changes. - Allows for point-in-time recovery. Cons: - Overhead of maintaining logs can impact performance. - Recovery might be slower for large databases due to log replay. Hadoop: Replication and Redundancy: Hadoop handles failure by replicating data across multiple nodes in a cluster. HDFS (Hadoop Distributed File System) replicates data blocks across different nodes, ensuring redundancy. When a node fails, Hadoop can retrieve the data from other replicas. Pros: - Fault tolerance through data replication. - No single point of failure due to data redundancy. - Parallel processing allows for continued execution despite node failures. Cons: 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

- High replication can lead to increased storage requirements. - Less efficient for scenarios with frequent small writes due to replication overhead. 2b) Hadoop: (copied from above as the same points apply here) Replication and Redundancy: Hadoop handles failure by replicating data across multiple nodes in a cluster. HDFS (Hadoop Distributed File System) replicates data blocks across different nodes, ensuring redundancy. When a node fails, Hadoop can retrieve the data from other replicas. Pros: - Fault tolerance through data replication. - No single point of failure due to data redundancy. - Parallel processing allows for continued execution despite node failures. Cons: - High replication can lead to increased storage requirements. - Less efficient for scenarios with frequent small writes due to replication overhead. Spark: RDD Lineage and Resilient Distributed Datasets (RDDs): Spark employs RDD lineage, which is a directed acyclic graph (DAG) of operations. Spark stores information about how to recreate RDDs from the original data using transformations. In case of failure, Spark uses this lineage information to recompute lost RDD partitions. This ensures the retrieval of data in a stable state. Pros: - Provides fault tolerance by tracking the sequence of operations to rebuild RDDs. - Allows for in-memory computation with efficient recovery. 4

- Can recover lost data by recomputing transformations. Cons: - Requires recomputation which can be resource-intensive. - If lineage information is somehow lost, recovery becomes challenging. Problem 3 3a) Advantages of HBase over traditional databases: Scalability: HBase is built on top of Hadoop and follows a distributed architecture. It scales horizontally by adding more machines to the cluster, accommodating massive amounts of data. Schema Flexibility: HBase allows columns to be added on the fly without redesigning the entire database structure. This flexibility is very useful in scenarios where the schema is subject to frequent changes. Support for Semi-Structured Data: HBase accommodates semi-structured or schema-less data. Fault Tolerance: HBase ensures high fault tolerance by replicating data across multiple nodes in the cluster. If a node fails, data is available on other nodes, minimizing the risk of data loss. Big Data Processing: HBase is well-suited for handling big data applications due to its distributed nature. It supports high-throughput read and write operations. 3b) HBase vs HDFS: HBase: Pros: - HBase offers low-latency random read and write access, while HDFS only optimized for sequential read. - It provides ACID properties ensuring data consistency. - Supports both structured and semi-structured data, including big data and sparse tables 5

Cons: - Has additional overhead due to indexing. - Not appropriate for batch analytics. - Setting up an HBase cluster is more complex than other database systems. HDFS: Pros: - HDFS offers simple and robust storage for large-scale data. - It is highly fault-tolerant and cost-effective - Suitable for batch analytics. Cons: - Not optimized for random read/write access so it’s slow. - Lacks indexing and efficient lookup mechanisms. - It has a rigid architecture and cannot perform real time analysis. 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Problem 4 4a) RDD’s importance to Spark: Distributed Processing: RDD’s allow parallel processing by distributing data across multiple nodes in a cluster. This allows efficient and scalable computations at a large scale. Fault Tolerance: RDD’s maintain a lineage graph, enabling reconstruction of lost data partitions in case of failures. Immutable: RDD’s are immutable and their contents cannot be changed after creation. Transformation Operations: RDD’s support transformations from one RDD to another (eg: map, reduce, filter, etc) and actions on a particular RDD (eg: count). Lazy Evaluation: RDDs in Spark work on-demand basis, saving lots of data processing time. Memory Storage: RDD’s are stored in memory, they can be reused for multiple computations and persist on disk when required. Functionality: Provide more functionality than Hadoop MapReduce and operations are coarse grained. 4b) In lazy evaluation, transformations on RDDs are not executed immediately; they are queued up until an action is called. Using this offers the advantages: 7

Optimization: Delayed evaluation allows Spark to analyze the directed acyclic graph (DAG) of all previously defined transformations and then determine the most efficient execution plan (eg: run in parallel, use pipeline transformations, etc). Resource Utilization: It enables Spark to pipeline operations and perform them in a single pass, eliminating the need to materialize intermediate results on disk or in memory Reduced Overhead: Lazy evaluation minimizes unnecessary computations, reducing the overhead of processing operations that might not be needed and improving performance. Handling Large Datasets: It makes it possible to work with larger and more complex datasets. Increased Modularity: It divides jobs into smaller, more modular stages for easier coding, debugging and maintenance 4c) The distributed join operation belongs to the wide dependency in Spark. Narrow dependencies are one-to-one and arise when each partition of parent RDD is used by at most one partition of child RDD. Wide dependencies occur when multiple child partitions depend on one or more parent partitions. In a distributed join operation, each partition of the parent RDD may be depended on by multiple children partitions. It involves reshuffling and redistribution of data from multiple partitions of different RDDs because it combines data from two separate RDDs, connecting records with matching keys across these RDDs. This makes it an example of wide dependency as it requires the exchange of data from different RDD partitions, and the mapping is not one-to-one. Further, multiple child partitions can depend on one parent partition which goes against the concept of narrow dependency. 8

Problem 5 5a) Pros and Cons of BSP: Pros: - Parallel Programming: BSP offers a structured method to perform parallel programming. It divides the computation into a series of supersteps, each consisting of a computation phase , communication phase, and barrier synchronization. - Fault Tolerance: It inherently supports fault tolerance by organizing computations into supersteps. Synchronization occurs at the end of each superstep, allowing for state checkpointing and recovery in case of failures. - Reduced Overhead: By enforcing synchronization only at the end of each superstep, BSP can potentially reduce the overhead associated with frequent synchronization. - Scalability: BSP is scalable as it allows for efficient coordination and synchronization among parallel processes. Cons: - Implementation Issues: Caches, distribution of data in memories, consistency, synchronization costs, etc., can pose challenges. Also, some algorithms do not fit into the BSP paradigm and need modifications to get them to work with BSP. 9

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

- Synchronization Overhead: The synchronization points at the end of each superstep can introduce overhead, especially if certain nodes finish their computations faster than others, leading to idle time waiting for synchronization. - Communication Overhead: Transmitting data between supersteps can be costly, especially in scenarios where large amounts of data need to be communicated. - Uneven Graph Partitioning: Can cause workload imbalances across nodes. 5b) Pros and Cons of Cloud-Native Database Architecture: Pros: - Decoupled storage and compute: Decoupling the storage and compute engines allows for independent scaling. This architecture enables dynamic resource allocation, optimizing performance as per demand. - Fault Isolation: If the compute engine fails, it doesn't necessarily affect the stored data, ensuring better fault isolation and easier recovery. - Cost Optimization: By scaling compute and storage independently, we can make a more cost- effective decision based on our strategic requirements. - Scalability: The decoupled architecture and no need for in-house compute and storage allows the architecture to be extremely scalable. Cons: - Data Transfer Overhead: There might be increased overhead while moving data between storage and compute layers. Most notable in cases where there is frequent communication between the two layers. - Complexity: Managing separate storage and compute layers introduces complexity to components such as query planning, cross-node communication, and concurrency control. - Network Dependency: Performance can be affected by the network latency and bandwidth between the storage and compute components. - Cost: Remote in-house storage tends to be cheaper than cloud storage at scale. However, this requires maintenance and upkeep. 10

hw4_440

Related Documents