Sorry, but the limit for this course is reached (30 students)!
You cannot register for this course anymore.

News

Office Hours Timings and Location

Written on 28.04.25 by Vaastav Anand

Office Hours Timing: Mondays 2pm - 3pm

Location: Room 005, E1 5 (all mondays except 12th May, 2025)

Location on May 12th, 2025: Room 029, E1 5

 

Lecture 2 slides posted

Written on 25.04.25 by Vaastav Anand

Lecture 2 slides are now posted on CMS

Assignment 1 released

Written on 24.04.25 by Vaastav Anand

Assignment 1 is now released at: https://gitlab.cs.uni-saarland.de/os/cldrel-25ss/assignments/-/tree/assn1

Each student should have received an invite to join their own fork of the assignments repository. 

If you did not get an invitation to your own fork of the assignments repo, then it means… Read more

Assignment 1 is now released at: https://gitlab.cs.uni-saarland.de/os/cldrel-25ss/assignments/-/tree/assn1

Each student should have received an invite to join their own fork of the assignments repository. 

If you did not get an invitation to your own fork of the assignments repo, then it means we were unable to find your username in the gitlab system. Please ensure that you have an active gitlab account and then contact the instructors with your account details to get access to your own fork.

Assignment Due Date: 10th May, 2025. 5pm CEST.

Lecture 1 Slides posted

Written on 14.04.25 by Vaastav Anand

Lecture 1 Slides are now posted on the CMS website

Reliability in Modern Cloud Systems

Cloud systems power a large fraction of the computing world today. Ensuring that these systems are correct and performant remains a key challenge that continues to bedevil developers. In this seminar, we will explore various themes around the various forms of reliability in modern cloud systems as well as learn about state-of-the-art strategies for mitigating incidents and understanding issues in modern cloud systems today.

Pre-requisites: Programming 2, Software Engineering Lab (Praktikum)

Recommended: Distributed Systems

Places: 20

Kickoff Meeting: 14.04.25, Monday 2:15pm-3:45pm

Lecture Time (23.04.25 onwards): Wednesdays, 2:15pm-3:45pm

Lecture Room: 005, E1 5

Office Hours (28.04.25 onwards): Mondays, 2pm-3pm

Office Hours Room: 005, E1 5 on all days except May 12th (Room 029)

Format

Each lecture will be divided into 2 parts:

- Lecture Part: In this part, the instructors will give a lecture on a specific topic in reliability.

- Discussion Part: In this part we will discuss the assigned reading and the previous week's lecture.

Assignments

All assignments will be based on Blueprint, a toolchain for generating microservice implementations and for exploring the design space of microservices.

Grading

- Assignment 1 - Implementing a basic Microservice Application using Blueprint: 10%

- Assignment 2 - Adding Observability to the Application and collecting traces from a workload: 20%

- Assignment 3 - Reproducing a Retry Storm: 25%

- Assignment 4 - Open Ended Project: 40%

- Participation in Discussion: 5%

Course Schedule

Date

Lecture Details

Readings 

Assignment

Slides

09.04.25

** No seminar **

N/A

   

14.04.25

Part 1: Kickoff Meeting

Part 2: From Monoliths to Microservices

    Kickoff Logistics, Lecture 1

23.04.25

Part 1: Paper Discussion

Part 2: The Tail at Scale

Blueprint: A Toolchain for Highly Reconfigurable Microservices

Assignment 1 released

Lecture 2

30.04.25

Part 1: Paper Discussion

Part 2: Reliability Basics

Tales of The Tail: Past and the Future

   

07.05.25

Part 1: Paper Discussion

Part 2:  The Pillars of Observability

If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Assignment 1 due;

Assignment 2 released

 

14.05.25

** No seminar**

     

21.05.25

Part 1: Discussion

Part 2: Of Failures and Incidents

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

   

28.05.25

Part 1: Discussion

Part 2:  Cross System Interaction Failures

What bugs cause production cloud incidents?

Assignment 2 due;

Assignment 3 released

 

04.06.25

Part 1: Discussion

Part 2: Dealing with Metastability (Load Shedding Techniques)

Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

   

11.06.25

Part 1: Discussion

Part 2: Root Cause Analysis

Metastable Failures in the Wild

   

18.06.25

Part 1: Discussion

Part 2: Testing & Formal Methods

How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service

Assignment 3 due;

Assignment 4 released

 

25.06.25

Part 1: Discussion

Part 2: Predicting and handling workloads

Executing microservice applications on serverless, correctly

 

Building Reliable Cloud Services Using P# (Experience Report)

   

02.07.25

Part 1: Discussion

Part 2: Hardware Reliability

Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms

   

09.07.25

Part 1: Data Center Design

Part 2: Discussion

Characterizing Cloud Computing Hardware Reliability

 

RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation

   

16.07.25

Demos and Presentations

 

Assignment 4 due

 

 

 

Privacy Policy | Legal Notice
If you encounter technical problems, please contact the administrators.