Ceph Level 2 Operations and Recovery Training Course

Ceph Level 2 – Operations and Recovery

Ceph is an open-source distributed storage platform created to deliver scalable, software-defined block, object and file storage on commodity hardware. Originally developed by Sage Weil and now stewarded through the wider upstream Ceph community with strong enterprise backing, it was built around goals such as resilience, horizontal scale, self-healing behaviour and removal of traditional storage bottlenecks through intelligent data distribution.

This module builds on that foundation and moves learners from concepts into operations: CLI usage, health interpretation, data placement behaviour, scrubbing, and common recovery tasks. It is designed for engineers who need to work confidently with live Ceph environments, understand what health output really means, and follow structured recovery workflows without making a stressed cluster worse through rushed changes.

Course purpose

Move learners from architectural understanding into practical Ceph operations, with a focus on safe CLI work, health interpretation, data placement awareness, scrubbing, and structured recovery handling.

Suggested duration

1.5–2 days

Target audience

operations engineers
storage support engineers
Proxmox/Ceph administrators
on-call staff responsible for cluster health

Prerequisites

completion of Ceph Level 1 or equivalent understanding
comfort with Linux shell and distributed system basics

Learning outcomes

use core Ceph command-line tools confidently
explain replication and erasure coding trade-offs
interpret how data is physically distributed via PGs and CRUSH rules
understand common cluster health states, especially unhealthy conditions
perform or supervise scrubbing operations
follow common recovery procedures safely

Detailed module structure

Unit 1: Core command-line tools

Topics:

Ceph CLI basics
reading cluster state
querying health, OSDs, pools and PGs
safe operational habits when using the CLI
distinguishing status inspection from change-making commands

Lab ideas:

inspect cluster health
list pools and OSD topology
inspect PG state summaries

Unit 2: Replication, erasure coding and data protection models

Topics:

replicated pools
erasure-coded pools
durability vs performance vs capacity trade-offs
operational implications of each model
when not to choose erasure coding

Lab ideas:

compare a replicated pool design and an erasure-coded design
assess suitability for VM storage, backups or object workloads

Unit 3: CRUSH map and physical data location

Topics:

reading CRUSH placement logic
how failure domains influence data placement
how PGs map data into physical distribution
reasoning about where data likely lives
why PG movement occurs during topology changes

Lab ideas:

review CRUSH rules
predict placement effects of adding or removing an OSD/host

Unit 4: Cluster health states and deep explanation of unhealthy conditions

Topics:

overall health states
degraded vs undersized vs misplaced data
peering issues
stale states
near-full and full conditions
slow operations
monitor quorum issues
OSD down/out scenarios
when “unhealthy” is expected temporarily vs when it is a serious incident

Lab ideas:

analyse example ceph -s and health detail outputs
classify the severity of several unhealthy states
decide whether to pause, observe or intervene

Unit 5: Scrubbing and data consistency operations

Topics:

what scrubbing is
regular scrub vs deep scrub
scheduling and performance considerations
when to trigger manual scrub
when not to force manual operations during recovery
interpreting scrub-related warnings

Lab ideas:

inspect scrub settings
trigger a manual scrub in a safe lab context
interpret follow-up health messages

Unit 6: Common recovery processes

Topics:

failed OSD replacement workflow
rebalancing expectations
recovery throttling concepts
handling near-full conditions
dealing with stuck PGs
restoring health after host loss
operational sequencing: observe, confirm, act, validate
documenting recovery actions for handover and post-incident review

Lab ideas:

simulate OSD loss
observe recovery state transitions
walk through a controlled replacement scenario
analyse a “cluster unhealthy after maintenance” incident

Assessment

Practical troubleshooting

Given cluster health output, explain the state, likely cause and next safe action.

Recovery runbook exercise

Document the response to a failed OSD and degraded PGs.