Various frameworks for Big Data analytics have become important for large scale processing. Hadoop MapReduce is one of the most prominent programming model in the distributed system field, but the model has multiple disadvantages for running iterative algorithms, e.g., low performance caused by disk I/O cost. To overcome the inefficiency, in-memory frameworks such as Apache Flink and Apache Spark have been introduced. Nowadays, Spark-SQL developed by the Spark community is very popular in the field. These systems are actively developed and utilized in industry.
In this thesis, we experimentally compare Flink and Spark-SQL with a comprehensive benchmark suite for Big Data systems, BigBench. 21 queries of the 30 BigBench queries are ported from Hive QL to Flink. Both systems are evaluated using the 21 queries to test the total elapsed time and data scalability on variable scale factors. Results show that in the experiment of data scalability, the elapsed time of Spark-SQL stays nearly constant for over a half of the queries as the dataset increases, whereas that of Flink linearly increases. In the experiment of the total elapsed time, Spark-SQL is slower than Flink on Scale Factor 100 for most queries, but on Scale Factor 300 the elapsed time of Spark-SQL for over a half of the queries is similar to that of Flink. We analyze the behavior of Flink and Spark-SQL on a per query basis in detail and point out the trade-offs of using the two systems according to certain situations.