You cannot register for this course anymore.
News
Office Hours Timings and LocationWritten on 28.04.25 by Vaastav Anand Office Hours Timing: Mondays 2pm - 3pm Location: Room 005, E1 5 (all mondays except 12th May, 2025) Location on May 12th, 2025: Room 029, E1 5
|
Lecture 2 slides postedWritten on 25.04.25 by Vaastav Anand Lecture 2 slides are now posted on CMS |
Assignment 1 releasedWritten on 24.04.25 by Vaastav Anand Assignment 1 is now released at: https://gitlab.cs.uni-saarland.de/os/cldrel-25ss/assignments/-/tree/assn1 Each student should have received an invite to join their own fork of the assignments repository. If you did not get an invitation to your own fork of the assignments repo, then it means… Read more Assignment 1 is now released at: https://gitlab.cs.uni-saarland.de/os/cldrel-25ss/assignments/-/tree/assn1 Each student should have received an invite to join their own fork of the assignments repository. If you did not get an invitation to your own fork of the assignments repo, then it means we were unable to find your username in the gitlab system. Please ensure that you have an active gitlab account and then contact the instructors with your account details to get access to your own fork. Assignment Due Date: 10th May, 2025. 5pm CEST. |
Lecture 1 Slides postedWritten on 14.04.25 by Vaastav Anand Lecture 1 Slides are now posted on the CMS website |
Reliability in Modern Cloud Systems
Cloud systems power a large fraction of the computing world today. Ensuring that these systems are correct and performant remains a key challenge that continues to bedevil developers. In this seminar, we will explore various themes around the various forms of reliability in modern cloud systems as well as learn about state-of-the-art strategies for mitigating incidents and understanding issues in modern cloud systems today.
Pre-requisites: Programming 2, Software Engineering Lab (Praktikum)
Recommended: Distributed Systems
Places: 20
Kickoff Meeting: 14.04.25, Monday 2:15pm-3:45pm
Lecture Time (23.04.25 onwards): Wednesdays, 2:15pm-3:45pm
Lecture Room: 005, E1 5
Office Hours (28.04.25 onwards): Mondays, 2pm-3pm
Office Hours Room: 005, E1 5 on all days except May 12th (Room 029)
Format
Each lecture will be divided into 2 parts:
- Lecture Part: In this part, the instructors will give a lecture on a specific topic in reliability.
- Discussion Part: In this part we will discuss the assigned reading and the previous week's lecture.
Assignments
All assignments will be based on Blueprint, a toolchain for generating microservice implementations and for exploring the design space of microservices.
Grading
- Assignment 1 - Implementing a basic Microservice Application using Blueprint: 10%
- Assignment 2 - Adding Observability to the Application and collecting traces from a workload: 20%
- Assignment 3 - Reproducing a Retry Storm: 25%
- Assignment 4 - Open Ended Project: 40%
- Participation in Discussion: 5%
Course Schedule
Date |
Lecture Details |
Readings |
Assignment |
Slides |
09.04.25 |
** No seminar ** |
N/A |
||
14.04.25 |
Part 1: Kickoff Meeting Part 2: From Monoliths to Microservices |
Kickoff Logistics, Lecture 1 | ||
23.04.25 |
Part 1: Paper Discussion Part 2: The Tail at Scale |
Blueprint: A Toolchain for Highly Reconfigurable Microservices |
Lecture 2 | |
30.04.25 |
Part 1: Paper Discussion Part 2: Reliability Basics |
|||
07.05.25 |
Part 1: Paper Discussion Part 2: The Pillars of Observability |
Assignment 1 due; Assignment 2 released |
||
14.05.25 |
** No seminar** |
|||
21.05.25 |
Part 1: Discussion Part 2: Of Failures and Incidents |
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure |
||
28.05.25 |
Part 1: Discussion Part 2: Cross System Interaction Failures |
Assignment 2 due; Assignment 3 released |
||
04.06.25 |
Part 1: Discussion Part 2: Dealing with Metastability (Load Shedding Techniques) |
Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems |
||
11.06.25 |
Part 1: Discussion Part 2: Root Cause Analysis |
|||
18.06.25 |
Part 1: Discussion Part 2: Testing & Formal Methods |
How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service |
Assignment 3 due; Assignment 4 released |
|
25.06.25 |
Part 1: Discussion Part 2: Predicting and handling workloads |
Executing microservice applications on serverless, correctly Building Reliable Cloud Services Using P# (Experience Report) |
||
02.07.25 |
Part 1: Discussion Part 2: Hardware Reliability |
|||
09.07.25 |
Part 1: Data Center Design Part 2: Discussion |
Characterizing Cloud Computing Hardware Reliability RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation |
||
16.07.25 |
Demos and Presentations |
Assignment 4 due |