Health Verification
Rook and Ceph upgrades are designed to ensure data remains available even while the upgrade is proceeding. Rook will perform the upgrades in a rolling fashion such that application pods are not disrupted. To ensure the upgrades are seamless, it is important to begin the upgrades with Ceph in a fully healthy state. This guide reviews ways of verifying the health of a CephCluster.
See the troubleshooting documentation for any issues during upgrades:
Pods all Running¶
In a healthy Rook cluster, all pods in the Rook namespace should be in the Running
(or Completed
) state and have few, if any, pod restarts.
Status Output¶
The Rook toolbox contains the Ceph tools that gives status details of the cluster with the ceph status
command. Below is an output sample:
The output should look similar to the following:
In the output above, note the following indications that the cluster is in a healthy state:
- Cluster health: The overall cluster status is
HEALTH_OK
and there are no warning or error status messages displayed. - Monitors (mon): All of the monitors are included in the
quorum
list. - Manager (mgr): The Ceph manager is in the
active
state. - OSDs (osd): All OSDs are
up
andin
. - Placement groups (pgs): All PGs are in the
active+clean
state. - (If applicable) Ceph filesystem metadata server (mds): all MDSes are
active
for all filesystems - (If applicable) Ceph object store RADOS gateways (rgw): all daemons are
active
If the ceph status
output has deviations from the general good health described above, there may be an issue that needs to be investigated further. Other commands may show more relevant details on the health of the system, such as ceph osd status
. See the Ceph troubleshooting docs for help.
Upgrading an unhealthy cluster¶
Rook will not upgrade Ceph daemons if the health is in a HEALTH_ERR
state. Rook can be configured to proceed with the (potentially unsafe) upgrade by setting either skipUpgradeChecks: true
or continueUpgradeAfterChecksEvenIfNotHealthy: true
as described in the cluster CR settings.
Container Versions¶
The container version running in a specific pod in the Rook cluster can be verified in its pod spec output. For example, for the monitor pod mon-b
, verify the container version it is running with the below commands:
The status and container versions for all Rook pods can be collected all at once with the following commands:
The rook-version
label exists on Ceph resources. For various resource controllers, a summary of the resource controllers can be gained with the commands below. These will report the requested, updated, and currently available replicas for various Rook resources in addition to the version of Rook for resources managed by Rook. Note that the operator and toolbox deployments do not have a rook-version
label set.
Rook Volume Health¶
Any pod that is using a Rook volume should also remain healthy:
- The pod should be in the
Running
state with few, if any, restarts - There should be no errors in its logs
- The pod should still be able to read and write to the attached Rook volume.