Lightweight fault-tolerant schemes for software distributed shared memory소프트웨어 분산 공유 메모리를 위한 저비용의 고장 허용 기법

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 399
  • Download : 0
As a software Distributed Shared Memory (DSM) becomes attractive on a large system, the focus of attention moves toward improving the reliability of a system. A fault-tolerant DSM has to achieve high reliability and at the same time preserve high performance during failure-free execution. In this thesis, the aim of our work is to propose novel fault-tolerant schemes for efficient software DSMs. A common approach to fault-tolerant software DSMs is to take checkpoints with message logging. In order to achieve our goal, we first propose an efficient logging scheme, called the remote logging, on a home-based software DSM. The remote logging stores data indispensable for recovery into the volatile memory of a remote node. The remote logging tolerates multiple failures if the backup nodes of failed nodes are alive. It makes the reliability of software DSMs grow much higher. In addition, the logging overhead can be moderated with high-speed system area network and user-level DMA operations supported by modern communication protocols. Thus, the remote logging leads to much lower failure-free overhead than traditional stable logging, which flushes logs into local disk at each synchronization point. To further enhance the performance of failure-free execution, we propose the lightweight checkpointing scheme dedicated to software DSMs. In our scheme, each node takes no checkpoint of shared memory, but saves the execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we extend the remote logging and introduce a XOR-diffing technique. The cliff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Experimental results shows that our new approach achieved better performance than traditional independent checkpointing. The performance improvement comes ...
Advisors
Maeng, Seung-Ryoulresearcher맹승렬researcher
Description
한국과학기술원 : 전산학전공,
Publisher
한국과학기술원
Issue Date
2005
Identifier
249460/325007  / 020005121
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전산학전공, 2005.8, [ viii, 69 p. ]

Keywords

Software Distributed Shared Memory; Fault tolerance; parallel system; 병렬 시스템; 소프트웨어 분산 공유 메모리; 고장허용성

URI
http://hdl.handle.net/10203/33207
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=249460&flag=dissertation
Appears in Collection
CS-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0