DSpace at KOASAS: Performance analysis and caching mechanism study for iterative mapreduce systems

DSpace at KOASAS

College of Engineering(공과대학)Dept. of Knowledge Service Engineering(지식서비스공학과)KSE-Theses_Ph.D.(박사논문)

Performance analysis and caching mechanism study for iterative mapreduce systems반복 맵리듀스 시스템을 위한 성능 분석 및 캐싱 정책 연구

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 460
Download : 0

Export

Kang, Minseo

MapReduce has become a dominant framework in big data analysis, and thus there have been significant efforts to implement various data analysis algorithms in MapReduce. Many data analysis algorithms are inherently iterative, repeating the same set of tasks until a convergence. To efficiently support iterative algorithms at scale, a few variants of Hadoop and new platforms have been proposed and actively developed in both academia and industry. Representative systems include HaLoop, iMapReduce, Twister, and Spark. In this dissertation, we analyze the performance of iterative MapReduce systems to obtain lessons and study a new caching mechanism through the lessons. To better understand the distributed processing of iterative algorithms, we identify and categorize the limitations of MapReduce in handling iterative algorithms, and then, experimentally compare Hadoop and the aforementioned systems using various workloads and metrics. We thoroughly explore the effectiveness of their new caching, communication, and scheduling mechanisms in support of iterative computation. In general, Spark achieved the best performance. Thus, to learn more about Spark, we experimentally investigate the limitations of Spark in handling iterative algorithms. According to our experiment results, the network I/O overhead was the primary factor that affected system performance the most. The disk I/O overhead also affected system performance, but it was less significant than the network I/O overhead. In addition to these overheads, garbage collection is also heavily affected by the caching mechanism, especially when processing iterative algorithms. Thus, we conducted experiments to analyze the effect of garbage collection according to caching mechanisms in the running of iterative algorithms on Spark. The experimental results show that frequent major garbage collection is a major performance degradation factor, with the garbage collection time accounting for up to 46.51% of the total execution time. In the comparison between the memory cache and the disk cache, their difference becomes larger when the size of the variable data is smaller than that of the eden but the sum of the static and variable data is not. Applying lessons learned from the previous experiments, we study a new Spark memory management system called machine-learning tuning (MLTuning), which automatically optimizes a caching mechanism for Spark. MLTuning comprises two phases: heuristic adjustment phase and regression tuning phase. In the heuristic adjustment phase, the algorithm adaptively tunes the memory by a heuristic rule; in the regression tuning phase, it optimizes the memory by fitting a regression model to the execution logs that are collected during the heuristic adjustment phase. Experimental results indicate that MLTuning can reduce memory overheads and improve the performance of a Spark job by up to 56% compared with existing Spark memory management systems. We believe that our work will contribute toward the efficient distributed processing of iterative algorithms including machine-learning and deep-learning algorithms.

Advisors: Lee, Jae-Gil researcher; 이재길 researcher

Description: 한국과학기술원 :지식서비스공학대학원,

Publisher: 한국과학기술원

Issue Date: 2020

Identifier: 325007

Language: eng

Description: 학위논문(박사) - 한국과학기술원 : 지식서비스공학대학원, 2020.2,[v, 73 p. :]

Keywords: MapReduce▼aHadoop▼aHaLoop▼aTwister▼aiMapReduce▼aSpark▼aCaching mechanism▼aIterative algorithm; 맵리듀스▼a하둡▼a하룹▼a트위스터▼a아이맵리듀스▼a스파크▼a캐싱 정책▼a반복 알고리즘

URI: http://hdl.handle.net/10203/283634

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=908425&flag=dissertation

Appears in Collection: KSE-Theses_Ph.D.(박사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Performance analysis and caching mechanism study for iterative mapreduce systems반복 맵리듀스 시스템을 위한 성능 분석 및 캐싱 정책 연구

KOASAS

Communities & Collections