How does Apache spark differ from Hadoop
Computer Networking: A Top-Down Approach (7th Edition)
7th Edition
ISBN:9780133594140
Author:James Kurose, Keith Ross
Publisher:James Kurose, Keith Ross
Chapter1: Computer Networks And The Internet
Section: Chapter Questions
Problem R1RQ: What is the difference between a host and an end system? List several different types of end...
Related questions
Question
How does Apache spark differ from Hadoop?
Expert Solution
Step 1
Hadoop is an open-source framework that allows to store and process big data, in a distributed environment across clusters of computers. Hadoop is designed to scale up from a single server to thousands of machines, where every machine is offering local computation and storage.
Spark is an open-source cluster computing designed for fast computation. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. The main feature of Spark is in-memory cluster computing that increases the speed of an application
Step 2
Key differences:
- Hadoop is an open source framework which uses a MapReduce algorithm whereas Spark is lightning fast cluster computing technology, which extends the MapReduce model to efficiently use with more type of computations.
- Hadoop’s MapReduce model reads and writes from a disk, thus slow down the processing speed whereas Spark reduces the number of read/write cycles to disk and store intermediate data in-memory, hence faster-processing speed.
- Hadoop requires developers to hand code each and every operation whereas Spark is easy to program with RDD – Resilient Distributed Dataset.
- Hadoop MapReduce model provides a batch engine, hence dependent on different engines for other requirements whereas Spark performs batch, interactive, Machine Learning and Streaming all in the same cluster.
- Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle real-time data efficiently.
- Hadoop is a high latency computing framework, which does not have an interactive mode whereas Spark is a low latency computing and can process data interactively.
- With Hadoop MapReduce, a developer can only process data in batch mode only whereas Spark can process real-time data through Spark Streaming.
- Hadoop is designed to handle faults and failures, it is naturally resilient toward faults, hence a highly fault-tolerant system whereas, with Spark, RDD allows recovery of partitions on failed nodes.
- Hadoop needs an external job scheduler for example – Oozie to schedule complex flows whereas Spark has in-memory computation, so it has its own flow scheduler.
- Hadoop is a cheaper option available while comparing it in terms of cost whereas Spark requires a lot of RAM to run in-memory, thus increasing the cluster and hence cost.
Step by step
Solved in 3 steps
Recommended textbooks for you
Computer Networking: A Top-Down Approach (7th Edi…
Computer Engineering
ISBN:
9780133594140
Author:
James Kurose, Keith Ross
Publisher:
PEARSON
Computer Organization and Design MIPS Edition, Fi…
Computer Engineering
ISBN:
9780124077263
Author:
David A. Patterson, John L. Hennessy
Publisher:
Elsevier Science
Network+ Guide to Networks (MindTap Course List)
Computer Engineering
ISBN:
9781337569330
Author:
Jill West, Tamara Dean, Jean Andrews
Publisher:
Cengage Learning
Computer Networking: A Top-Down Approach (7th Edi…
Computer Engineering
ISBN:
9780133594140
Author:
James Kurose, Keith Ross
Publisher:
PEARSON
Computer Organization and Design MIPS Edition, Fi…
Computer Engineering
ISBN:
9780124077263
Author:
David A. Patterson, John L. Hennessy
Publisher:
Elsevier Science
Network+ Guide to Networks (MindTap Course List)
Computer Engineering
ISBN:
9781337569330
Author:
Jill West, Tamara Dean, Jean Andrews
Publisher:
Cengage Learning
Concepts of Database Management
Computer Engineering
ISBN:
9781337093422
Author:
Joy L. Starks, Philip J. Pratt, Mary Z. Last
Publisher:
Cengage Learning
Prelude to Programming
Computer Engineering
ISBN:
9780133750423
Author:
VENIT, Stewart
Publisher:
Pearson Education
Sc Business Data Communications and Networking, T…
Computer Engineering
ISBN:
9781119368830
Author:
FITZGERALD
Publisher:
WILEY