2019 Fault Tolerant Distributed Algorithms

Font size  SML

Register update notification mail Add to favorite lecture list
Academic unit or major
Graduate major in Computer Science
Instructor(s)
Bonnet Francois  Defago Xavier 
Course component(s)
Lecture
Day/Period(Room No.)
Mon7-8(W831)  Thr7-8(W831)  
Group
-
Course number
CSC.T527
Credits
2
Academic year
2019
Offered quarter
3Q
Syllabus updated
2019/3/18
Lecture notes updated
-
Language used
English
Access Index

Course description and aims

The course aims to develop a thorough understanding of fault-tolerance in distributed systems. Due to their nature, distributed systems are inherently vulnerable to failures if not designed properly. At any time, a subset of the processes in a distributed system may fail by crashing or could be compromised and behave in a treacherous way (e.g., Byzantine failures). It is hence essential to design distributed systems and applications in such a way that they can adequately cope with failures. The lecture will present focus on how to deal with these issues.

Student learning outcomes

By studying relevant methods and algorithms in details, the student will acquire a deep understanding of the issues at hand and the basic mechanisms to deal with such failures. Although the course will focus on the theory of such systems, it will also systematically draw links with practical applications, making it valuable to both theoreticians and practitioners.

Keywords

Distributed algorithms, message-passing, synchrony models, agreement, replication, fault-tolerance, Byzantine agreement, blockchain, self-stabilization, blockchain, probabilistic algorithms

Competencies that will be developed

Intercultural skills Communication skills Specialist skills Critical thinking skills Practical and/or problem-solving skills

Class flow

Typical classes will alternate between slide-based presentations, interactive discussions, class exercises. Active contribution to class discussions is strongly encouraged.

Course schedule/Required learning

  Course schedule Required learning
Class 1 Introduction, overview, reminder Revision of basic concepts of distributed algorithms (models, synchrony, causality)
Class 2 Models, faults, formalism, definition instructed in class.
Class 3 Synchronous consensus instructed in class.
Class 4 State-machine replication, replication techniques instructed in class.
Class 5 Group membership, distributed transactions, atomic commit instructed in class.
Class 6 Asynchronous Consensus, FLP impossibility proof instructed in class.
Class 7 Unreliable failure detectors instructed in class.
Class 8 Eventual leader election, Paxos instructed in class.
Class 9 Randomized consensus instructed in class.
Class 10 Byzantine consensus instructed in class.
Class 11 Byzantine randomized consensus instructed in class.
Class 12 Self-stabilization (definition, requirements, mutual exclusion, proof) instructed in class.
Class 13 Self-stabilization (spanning-tree, distributed reset, composition, ...) instructed in class.
Class 14 Distributed ledger and blockchain mechanisms instructed in class.
Class 15 Q&A + final test instructed in class.

Textbook(s)

Christian Cachin, Rachid Guerraoui, Luís Rodrigues, "Introduction to Reliable and Secure Distributed Programming," Springer, 2011. https://www.springer.com/jp/book/9783642152597

Reference books, course materials, etc.

Reference Books:
1. Shlomi Dolev, "Self-Stabilization," MIT Press, 2000. https://mitpress.mit.edu/books/self-stabilization
2. Michel Raynal, "Communication and agreement abstractions for fault-tolerant asynchronous distributed systems," Morgan & Claypool, 2010. https://www.morganclaypool.com/doi/abs/10.2200/S00236ED1V01Y201004DCT002
3. Michel Raynal, "Fault-tolerant Agreement in Synchronous Message-passing Systems," Morgan & Claypool, 2010. https://www.morganclaypool.com/doi/abs/10.2200/S00294ED1V01Y201009DCT003
4. Wan Fokkink, "Distributed algorithms: an intuitive approach ," MIT Press, 2013.
5. Vijay K. Garg, "Elements of distributed computing," IEEE, 2002.
6. Gerard Tel, "Introduction to distributed algorithms (2nd ed.)," Cambridge Univ. Press, 2000.
7. Ajay Kshemkalyani, Mukesh Singhal, "Distributed computing: principles, algorithms, and systems," Cambridge Uni. Press, 2011.

Course materials:
Slide copies, additional article copies, ...distributed during lectures or made available for download from the course webpage.

Assessment criteria and methods

Homework assignments and contribution to class discussion (30%), reports (30%), and examination (40%).

Examination will assess the understanding of basic concepts of fault-tolerant distributed algorithms (problems, algorithms, and methodology) and reasoning (correctness and complexity).

Related courses

  • CSC.T438 : Distributed Algorithms
  • MCS.T406 : Distributed Systems
  • CSC.T524 : Dependable Computing
  • MCS.T213 : Introduction to Algorithms and Data Structures

Prerequisites (i.e., required knowledge, skills, courses, etc.)

Required knowledge:
Prior to taking this course, the student must have previously acquired,
through lectures or self-study, background knowledge on basic concepts
of fault-free distributed algorithms such as taught in the following
courses:
- CSC.T438 Distributed algorithm; __or__
- MCS.T406 (CSC.T406) Distributed Systems

Other

Related course:
In the field of fault-tolerant and dependable computing systems, this course is complementary with:
- CSC.T524 Dependable Computing

Page Top