How does Apache spark differ from Hadoop

Computer Networking: A Top-Down Approach (7th Edition)
7th Edition
ISBN:9780133594140
Author:James Kurose, Keith Ross
Publisher:James Kurose, Keith Ross
Chapter1: Computer Networks And The Internet
Section: Chapter Questions
Problem R1RQ: What is the difference between a host and an end system? List several different types of end...
icon
Related questions
Question

How does Apache spark differ from Hadoop?

Expert Solution
Step 1

Hadoop is an open-source framework that allows to store and process big data, in a distributed environment across clusters of computers. Hadoop is designed to scale up from a single server to thousands of machines, where every machine is offering local computation and storage.

Spark is an open-source cluster computing designed for fast computation. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. The main feature of Spark is in-memory cluster computing that increases the speed of an application

Step 2

Key differences:

  • Hadoop is an open source framework which uses a MapReduce algorithm whereas Spark is lightning fast cluster computing technology, which extends the MapReduce model to efficiently use with more type of computations.
  • Hadoop’s MapReduce model reads and writes from a disk, thus slow down the processing speed whereas Spark reduces the number of read/write cycles to disk and store intermediate data in-memory, hence faster-processing speed.
  • Hadoop requires developers to hand code each and every operation whereas Spark is easy to program with RDD – Resilient Distributed Dataset.
  • Hadoop MapReduce model provides a batch engine, hence dependent on different engines for other requirements whereas Spark performs batch, interactive, Machine Learning and Streaming all in the same cluster.
  • Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle real-time data efficiently.
  • Hadoop is a high latency computing framework, which does not have an interactive mode whereas Spark is a low latency computing and can process data interactively.
  • With Hadoop MapReduce, a developer can only process data in batch mode only whereas Spark can process real-time data through Spark Streaming.
  • Hadoop is designed to handle faults and failures, it is naturally resilient toward faults, hence a highly fault-tolerant system whereas, with Spark, RDD allows recovery of partitions on failed nodes.
  • Hadoop needs an external job scheduler for example – Oozie to schedule complex flows whereas Spark has in-memory computation, so it has its own flow scheduler.
  • Hadoop is a cheaper option available while comparing it in terms of cost whereas Spark requires a lot of RAM to run in-memory, thus increasing the cluster and hence cost.
steps

Step by step

Solved in 3 steps

Blurred answer
Recommended textbooks for you
Computer Networking: A Top-Down Approach (7th Edi…
Computer Networking: A Top-Down Approach (7th Edi…
Computer Engineering
ISBN:
9780133594140
Author:
James Kurose, Keith Ross
Publisher:
PEARSON
Computer Organization and Design MIPS Edition, Fi…
Computer Organization and Design MIPS Edition, Fi…
Computer Engineering
ISBN:
9780124077263
Author:
David A. Patterson, John L. Hennessy
Publisher:
Elsevier Science
Network+ Guide to Networks (MindTap Course List)
Network+ Guide to Networks (MindTap Course List)
Computer Engineering
ISBN:
9781337569330
Author:
Jill West, Tamara Dean, Jean Andrews
Publisher:
Cengage Learning
Concepts of Database Management
Concepts of Database Management
Computer Engineering
ISBN:
9781337093422
Author:
Joy L. Starks, Philip J. Pratt, Mary Z. Last
Publisher:
Cengage Learning
Prelude to Programming
Prelude to Programming
Computer Engineering
ISBN:
9780133750423
Author:
VENIT, Stewart
Publisher:
Pearson Education
Sc Business Data Communications and Networking, T…
Sc Business Data Communications and Networking, T…
Computer Engineering
ISBN:
9781119368830
Author:
FITZGERALD
Publisher:
WILEY