Performance analysis and caching mechanism study for iterative mapreduce systems반복 맵리듀스 시스템을 위한 성능 분석 및 캐싱 정책 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 416
  • Download : 0
MapReduce has become a dominant framework in big data analysis, and thus there have been significant efforts to implement various data analysis algorithms in MapReduce. Many data analysis algorithms are inherently iterative, repeating the same set of tasks until a convergence. To efficiently support iterative algorithms at scale, a few variants of Hadoop and new platforms have been proposed and actively developed in both academia and industry. Representative systems include HaLoop, iMapReduce, Twister, and Spark. In this dissertation, we analyze the performance of iterative MapReduce systems to obtain lessons and study a new caching mechanism through the lessons. To better understand the distributed processing of iterative algorithms, we identify and categorize the limitations of MapReduce in handling iterative algorithms, and then, experimentally compare Hadoop and the aforementioned systems using various workloads and metrics. We thoroughly explore the effectiveness of their new caching, communication, and scheduling mechanisms in support of iterative computation. In general, Spark achieved the best performance. Thus, to learn more about Spark, we experimentally investigate the limitations of Spark in handling iterative algorithms. According to our experiment results, the network I/O overhead was the primary factor that affected system performance the most. The disk I/O overhead also affected system performance, but it was less significant than the network I/O overhead. In addition to these overheads, garbage collection is also heavily affected by the caching mechanism, especially when processing iterative algorithms. Thus, we conducted experiments to analyze the effect of garbage collection according to caching mechanisms in the running of iterative algorithms on Spark. The experimental results show that frequent major garbage collection is a major performance degradation factor, with the garbage collection time accounting for up to 46.51% of the total execution time. In the comparison between the memory cache and the disk cache, their difference becomes larger when the size of the variable data is smaller than that of the eden but the sum of the static and variable data is not. Applying lessons learned from the previous experiments, we study a new Spark memory management system called machine-learning tuning (MLTuning), which automatically optimizes a caching mechanism for Spark. MLTuning comprises two phases: heuristic adjustment phase and regression tuning phase. In the heuristic adjustment phase, the algorithm adaptively tunes the memory by a heuristic rule; in the regression tuning phase, it optimizes the memory by fitting a regression model to the execution logs that are collected during the heuristic adjustment phase. Experimental results indicate that MLTuning can reduce memory overheads and improve the performance of a Spark job by up to 56% compared with existing Spark memory management systems. We believe that our work will contribute toward the efficient distributed processing of iterative algorithms including machine-learning and deep-learning algorithms.
Advisors
Lee, Jae-Gilresearcher이재길researcher
Description
한국과학기술원 :지식서비스공학대학원,
Publisher
한국과학기술원
Issue Date
2020
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 지식서비스공학대학원, 2020.2,[v, 73 p. :]

Keywords

MapReduce▼aHadoop▼aHaLoop▼aTwister▼aiMapReduce▼aSpark▼aCaching mechanism▼aIterative algorithm; 맵리듀스▼a하둡▼a하룹▼a트위스터▼a아이맵리듀스▼a스파크▼a캐싱 정책▼a반복 알고리즘

URI
http://hdl.handle.net/10203/283634
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=908425&flag=dissertation
Appears in Collection
KSE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0