Is Apache Spark faster than Hadoop processing

Business software

Apaches Spark is the new trend technology in the fields of big data, analytics and data science. Many Spark protagonists even believe that this new platform overshadows everything else to such an extent that it will soon be the dominant tool for all data scientists. This does not seem unfounded, because Spark's high performance with very large amounts of data has led to it being viewed as the successor to Hadoop. "From a technical point of view, Spark is a significantly faster and more powerful engine than Hadoop," said Reynold Xin, data engineer and co-founder of Databricks - the company leading the Apache Spark project. Forrester analyst Mike Gualtieri also sees Spark at an advantage due to the faster processing. "Hadoop was built for big data - Spark for high speeds," he says.

Record breaking performance

Sparks performance was first publicly recognized when it set a new record at the Daytona-GraySort last year. In this test, 100 Tbytes are to be sorted. Databricks had set up 206 machines with almost 6,600 cores for this purpose. Sparks only needed 23 minutes for the sort job - significantly less than the previous record of 72 minutes held by Yahoo with Hadoop. It should also be taken into account that Hadoop used 2,100 nodes with over 50,000 cores.

Much more features than Hadoop

But the much better performance alone is not enough to predict the end of Hadoop. Flexibility and range of applications are at least as important. Spark can be used in conjunction with different data platforms. It also offers native support for in-memory, including optimized data distribution between memory and hard disk. In this respect, those who expect Hadoop to end soon seem to be right.

Friend and foe at the same time

In fact, Spark can be either: a dominant competitor product or an excellent addition to Hadoop. Gualtieri puts his Spark praise into perspective: "If you consider that opposites attract, then Spark and Hadoop form a perfect team, after all, both are cluster platforms that can be distributed over many nodes and have very different advantages and disadvantages . "

The Hadoop specialist Cloudera emphasizes the combined market interest of the two platforms. "Anyone who adopts Hadoop today firmly assumes that Spark is one of them," says its chief technologist, Eli Collins. This fits his view that Spark is just one of many Hadoop tools - similar to MapReduce, Drill, Impala and a few others.

Universal data platforms

But Spark differs from many other tools in one essential point: Spark does not necessarily have to be based on the Hadoop HDFS file system. It is just as efficient when operated with other data platforms such as AWS S3, HBase or Apache Cassandra. Cassandra is now becoming the preferred data platform for Spark. According to a study by Typesafe, 20 percent of all Spark instances already run on Cassandra.

Difficult times for MapReduce

Especially when compared to MapReduce, Spark does much better. MapReduce is a ten year old basic component from the original Hadoop platform. It's slow, batch-oriented, and very complex. Spark, on the other hand, is fast and flexible; it can be used for batch-oriented as well as iterative or streaming analyzes. The latter makes Spark particularly interesting for real-time analyzes. "Spark's flexible range of use means that existing big data applications can be operated faster and more differently," says Xin about the particular advantages of Spark.

High speed of development

Meanwhile, Sparks also enjoys the support of many important IT companies, such as IBM, Hortonworks, Cloudera, Pivotal and the R specialist Revolution Analytics, which was recently acquired by Microsoft. The pace of development is correspondingly high. On March 13th, Spark 1.3 came onto the market, which, compared to its predecessor, is mainly characterized by faster data analysis. The core is the new DataFrames API, which is comparable to the data frames in R and Python (Pandas). The new API allows a faster analysis of structured data and simplifies the use of Spark for everyone who is used to working with a single machine. Spark 1.4 has already been announced for June, which will primarily offer an R interface. "Spark will then support Scala, Python, Java and R - all four dominant big data languages," says Matei Zaharia, CTO at Databricks.

Complementary properties

Most Hadoop-oriented providers do not see any disadvantageous competition in Spark, but expect a complementary distribution of tasks. "Hadoop is far superior to all other data warehouse solutions; there is nothing that matches Hadoop when it comes to offline analysis of big data," says Tomer Shiran, Vice President at MapR. And Patrick McFadin from DataStax sums it up as follows: "Hadoop is the standard when it comes to data warehouses and offline data analysis, but Spark with Cassandra is a better alternative for all applications where speed is an important factor - for example real-time Analyzes. "

In addition, Hadoop has also made a number of significant developments. The latest improvements come from Berkeley University, of all places, where Spark was born. The new file system Tachyon for Hadoop was developed there. It's 300 times faster than HDFS and fully backwards compatible at the same time.