Ceph Level 2 – Operations and Recovery

Ceph is an open-source distributed storage platform created to deliver scalable, software-defined block, object and file storage on commodity hardware. Originally developed by Sage Weil and now stewarded through the wider upstream Ceph community with strong enterprise backing, it was built around goals such as resilience, horizontal scale, self-healing behaviour and removal of traditional storage bottlenecks through intelligent data distribution.


This module builds on that foundation and moves learners from concepts into operations: CLI usage, health interpretation, data placement behaviour, scrubbing, and common recovery tasks. It is designed for engineers who need to work confidently with live Ceph environments, understand what health output really means, and follow structured recovery workflows without making a stressed cluster worse through rushed changes.

Course purpose

Move learners from architectural understanding into practical Ceph operations, with a focus on safe CLI work, health interpretation, data placement awareness, scrubbing, and structured recovery handling.

Suggested duration

  • 1.5–2 days

Target audience

  • operations engineers
  • storage support engineers
  • Proxmox/Ceph administrators
  • on-call staff responsible for cluster health

Prerequisites

  • completion of Ceph Level 1 or equivalent understanding
  • comfort with Linux shell and distributed system basics

Learning outcomes

  • use core Ceph command-line tools confidently
  • explain replication and erasure coding trade-offs
  • interpret how data is physically distributed via PGs and CRUSH rules
  • understand common cluster health states, especially unhealthy conditions
  • perform or supervise scrubbing operations
  • follow common recovery procedures safely

Detailed module structure

Unit 1: Core command-line tools

Topics:

  • Ceph CLI basics
  • reading cluster state
  • querying health, OSDs, pools and PGs
  • safe operational habits when using the CLI
  • distinguishing status inspection from change-making commands

Lab ideas:

  • inspect cluster health
  • list pools and OSD topology
  • inspect PG state summaries

Unit 2: Replication, erasure coding and data protection models

Topics:

  • replicated pools
  • erasure-coded pools
  • durability vs performance vs capacity trade-offs
  • operational implications of each model
  • when not to choose erasure coding

Lab ideas:

  • compare a replicated pool design and an erasure-coded design
  • assess suitability for VM storage, backups or object workloads

Unit 3: CRUSH map and physical data location

Topics:

  • reading CRUSH placement logic
  • how failure domains influence data placement
  • how PGs map data into physical distribution
  • reasoning about where data likely lives
  • why PG movement occurs during topology changes

Lab ideas:

  • review CRUSH rules
  • predict placement effects of adding or removing an OSD/host

Unit 4: Cluster health states and deep explanation of unhealthy conditions

Topics:

  • overall health states
  • degraded vs undersized vs misplaced data
  • peering issues
  • stale states
  • near-full and full conditions
  • slow operations
  • monitor quorum issues
  • OSD down/out scenarios
  • when “unhealthy” is expected temporarily vs when it is a serious incident

Lab ideas:

  • analyse example ceph -s and health detail outputs
  • classify the severity of several unhealthy states
  • decide whether to pause, observe or intervene

Unit 5: Scrubbing and data consistency operations

Topics:

  • what scrubbing is
  • regular scrub vs deep scrub
  • scheduling and performance considerations
  • when to trigger manual scrub
  • when not to force manual operations during recovery
  • interpreting scrub-related warnings

Lab ideas:

  • inspect scrub settings
  • trigger a manual scrub in a safe lab context
  • interpret follow-up health messages

Unit 6: Common recovery processes

Topics:

  • failed OSD replacement workflow
  • rebalancing expectations
  • recovery throttling concepts
  • handling near-full conditions
  • dealing with stuck PGs
  • restoring health after host loss
  • operational sequencing: observe, confirm, act, validate
  • documenting recovery actions for handover and post-incident review

Lab ideas:

  • simulate OSD loss
  • observe recovery state transitions
  • walk through a controlled replacement scenario
  • analyse a “cluster unhealthy after maintenance” incident

Assessment

Practical troubleshooting

Given cluster health output, explain the state, likely cause and next safe action.

Recovery runbook exercise

Document the response to a failed OSD and degraded PGs.

Operational confidence - Better health interpretation - Safer recovery decisions

Built for engineers responsible for live Ceph clusters and real incident response