Answered: How does Apache spark differ from…

Computer Networking: A Top-Down Approach (7th Edition)

7th Edition

ISBN:9780133594140

Author:James Kurose, Keith Ross

Publisher:James Kurose, Keith Ross

Chapter1: Computer Networks And The Internet

Section: Chapter Questions

Problem R1RQ: What is the difference between a host and an end system? List several different types of end...

See similar textbooks

Related questions

Question

How does Apache spark differ from Hadoop?

Expert Solution

Step 1

Hadoop is an open-source framework that allows to store and process big data, in a distributed environment across clusters of computers. Hadoop is designed to scale up from a single server to thousands of machines, where every machine is offering local computation and storage.

Spark is an open-source cluster computing designed for fast computation. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. The main feature of Spark is in-memory cluster computing that increases the speed of an application

Step 2

Key differences:

Hadoop is an open source framework which uses a MapReduce algorithm whereas Spark is lightning fast cluster computing technology, which extends the MapReduce model to efficiently use with more type of computations.
Hadoop’s MapReduce model reads and writes from a disk, thus slow down the processing speed whereas Spark reduces the number of read/write cycles to disk and store intermediate data in-memory, hence faster-processing speed.
Hadoop requires developers to hand code each and every operation whereas Spark is easy to program with RDD – Resilient Distributed Dataset.
Hadoop MapReduce model provides a batch engine, hence dependent on different engines for other requirements whereas Spark performs batch, interactive, Machine Learning and Streaming all in the same cluster.
Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle real-time data efficiently.
Hadoop is a high latency computing framework, which does not have an interactive mode whereas Spark is a low latency computing and can process data interactively.
With Hadoop MapReduce, a developer can only process data in batch mode only whereas Spark can process real-time data through Spark Streaming.
Hadoop is designed to handle faults and failures, it is naturally resilient toward faults, hence a highly fault-tolerant system whereas, with Spark, RDD allows recovery of partitions on failed nodes.
Hadoop needs an external job scheduler for example – Oozie to schedule complex flows whereas Spark has in-memory computation, so it has its own flow scheduler.
Hadoop is a cheaper option available while comparing it in terms of cost whereas Spark requires a lot of RAM to run in-memory, thus increasing the cluster and hence cost.