Ceph
PLEASE NOTE: This document applies to v0.8 version and not to the latest stable release v1.9
Ceph Cluster CRD
Rook allows creation and customization of storage clusters through the custom resource definitions (CRDs).
Sample
To get you started, here is a simple example of a CRD to configure a Ceph cluster with all nodes and all devices. More examples are included later in this doc.
apiVersion: ceph.rook.io/v1beta1
kind: Cluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
serviceAccount: rook-ceph-cluster
storage:
useAllNodes: true
useAllDevices: true
In addition to the CRD, you will also need to create a namespace, role, and role binding as seen in the common cluster resources below.
Settings
Settings can be specified at the global level to apply to the cluster as a whole, while other settings can be specified at more fine-grained levels. If any setting is unspecified, a suitable default will be used automatically.
Cluster metadata
name
: The name that will be used internally for the Ceph cluster. Most commonly the name is the same as the namespace since multiple clusters are not supported in the same namespace.namespace
: The Kubernetes namespace that will be created for the Rook cluster. The services, pods, and other resources created by the operator will be added to this namespace. The common scenario is to create a single Rook cluster. If multiple clusters are created, they must not have conflicting devices or host paths.
Cluster Settings
dataDirHostPath
: The path on the host (hostPath) where config and data should be stored for each of the services. If the directory does not exist, it will be created. Because this directory persists on the host, it will remain after pods are deleted.- On Minikube environments, use
/data/rook
. Minikube boots into a tmpfs but it provides some directories where files can be persisted across reboots. Using one of these directories will ensure that Rook’s data and configuration files are persisted and that enough storage space is available. - WARNING: For test scenarios, if you delete a cluster and start a new cluster on the same hosts, the path used by
dataDirHostPath
must be deleted. Otherwise, stale keys and other config will remain from the previous cluster and the new mons will fail to start. If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes empty dir docs.
- On Minikube environments, use
dashboard
: Settings for the Ceph dashboard. To view the dashboard in your browser see the dashboard guide.enabled
: Whether to enable the dashboard to view cluster status
serviceAccount
: The service account under which the OSD pods will run that will give access to ConfigMaps in the cluster’s namespace. If not set, the default ofrook-ceph-cluster
will be used.network
: The network settings for the clusterhostNetwork
: uses network of the hosts instead of using the SDN below the containers.
mon
: contains mon related options mon settings For more details on the mons and when to choose a number other than3
, see the mon health design doc.placement
: placement configuration settingsresources
: resources configuration settingsstorage
: Storage selection and configuration that will be used across the cluster. Note that these settings can be overridden for specific nodes.useAllNodes
:true
orfalse
, indicating if all nodes in the cluster should be used for storage according to the cluster level storage selection and configuration values. If individual nodes are specified under thenodes
field below, thenuseAllNodes
must be set tofalse
.nodes
: Names of individual nodes in the cluster that should have their storage included in accordance with either the cluster level configuration specified above or any node specific overrides described in the next section below.useAllNodes
must be set tofalse
to use specific nodes and their config.- storage selection settings
- storage configuration settings
Node Updates
Nodes can be added and removed over time by updating the Cluster CRD, for example with kubectl -n rook-ceph edit cluster.ceph.rook.io rook-ceph
.
This will bring up your default text editor and allow you to add and remove storage nodes from the cluster.
This feature is only available when useAllNodes
has been set to false
.
Mon Settings
count
: set the number of mons to be started. The number should be odd and between1
and9
. If not specified the default is set to3
andallowMultiplePerNode
is also set totrue
.allowMultiplePerNode
: enable (true
) or disable (false
) the placement of multiple mons on one node. Default isfalse
.
Node Settings
In addition to the cluster level settings specified above, each individual node can also specify configuration to override the cluster level settings and defaults. If a node does not specify any configuration then it will inherit the cluster level settings.
name
: The name of the node, which should match itskubernetes.io/hostname
label.config
: Config settings applied to all OSDs on the node unless overridden bydevices
ordirectories
. See the config settings below.- storage selection settings
- storage configuration settings
Storage Selection Settings
Below are the settings available, both at the cluster and individual node level, for selecting which storage resources will be included in the cluster.
useAllDevices
:true
orfalse
, indicating whether all devices found on nodes in the cluster should be automatically consumed by OSDs. Not recommended unless you have a very controlled environment where you will not risk formatting of devices with existing data. Whentrue
, all devices will be used except those with partitions created or a local filesystem. Is overridden bydeviceFilter
if specified.deviceFilter
: A regular expression that allows selection of devices to be consumed by OSDs. If individual devices have been specified for a node then this filter will be ignored. This field uses golang regular expression syntax. For example:sdb
: Only selects thesdb
device if found^sd.
: Selects all devices starting withsd
^sd[a-d]
: Selects devices starting withsda
,sdb
,sdc
, andsdd
if found^s
: Selects all devices that start withs
^[^r]
: Selects all devices that do not start withr
devices
: A list of individual device names belonging to this node to include in the storage cluster.name
: The name of the device (e.g.,sda
).config
: Device-specific config settings. See the config settings below.
directories
: A list of directory paths that will be included in the storage cluster. Note that using two directories on the same physical device can cause a negative performance impact.path
: The path on disk of the directory (e.g.,/rook/storage-dir
).config
: Directory-specific config settings. See the config settings below.
location
: Location information about the cluster to help with data placement, such as region or data center. This is directly fed into the underlying Ceph CRUSH map. More information on CRUSH maps can be found in the ceph docs.
OSD Configuration Settings
The following storage selection settings are specific to Ceph and do not apply to other backends. All variables are key-value pairs represented as strings.
metadataDevice
: Name of a device to use for the metadata of OSDs on each node. Performance can be improved by using a low latency device (such as SSD or NVMe) as the metadata device, while other spinning platter (HDD) devices on a node are used to store data.storeType
:filestore
orbluestore
, the underlying storage format to use for each OSD. The default is set dynamically tobluestore
for devices, whilefilestore
is the default for directories. Set this store type explicitly to override the default. Warning: Bluestore is not recommended for directories in production. Bluestore does not purge data from the directory and over time will grow without the ability to compact or shrink.databaseSizeMB
: The size in MB of a bluestore database. Include quotes around the size.walSizeMB
: The size in MB of a bluestore write ahead log (WAL). Include quotes around the size.journalSizeMB
: The size in MB of a filestore journal. Include quotes around the size.
Placement Configuration Settings
Placement configuration for the cluster services. It includes the following keys: mgr
, mon
, osd
and all
. Each service will have its placement configuration generated by merging the generic configuration under all
with the most specific one (which will override any attributes).
A Placement configuration is specified (according to the kubernetes PodSpec) as:
nodeAffinity
: kubernetes NodeAffinitypodAffinity
: kubernetes PodAffinitypodAntiAffinity
: kubernetes PodAntiAffinitytolerations
: list of kubernetes Toleration
The mon
pod does not allow Pod
affinity or anti-affinity.
This is because of the mons having built-in anti-affinity with each other through the operator. The operator chooses which nodes are to run a mon on. Each mon is then tied to a node with a node selector using a hostname.
See the mon design doc for more details on the mon failover design.
Cluster-wide Resources Configuration Settings
Resources should be specified so that the rook components are handled after Kubernetes Pod Quality of Service classes. This allows to keep rook components running when for example a node runs out of memory and the rook components are not killed depending on their Quality of Service class.
You can set resource requests/limits for rook components through the Resource Requirements/Limits structure in the following keys:
mgr
: Set resource requests/limits for MGRs.mon
: Set resource requests/limits for Mons.osd
: Set resource requests/limits for OSDs.
Resource Requirements/Limits
For more information on resource requests/limits see the official Kubernetes documentation: Kubernetes - Managing Compute Resources for Containers
requests
: Requests for cpu or memory.cpu
: Request for CPU (example: one CPU core1
, 50% of one CPU core500m
).memory
: Limit for Memory (example: one gigabyte of memory1Gi
, half a gigabyte of memory512Mi
).
limits
: Limits for cpu or memory.cpu
: Limit for CPU (example: one CPU core1
, 50% of one CPU core500m
).memory
: Limit for Memory (example: one gigabyte of memory1Gi
, half a gigabyte of memory512Mi
).
Samples
Here are several samples for configuring Ceph clusters. Each of the samples must also include the namespace and corresponding access granted for management by the Ceph operator. See the common cluster resources below.
Storage configuration: All devices
apiVersion: ceph.rook.io/v1beta1
kind: Cluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
serviceAccount: rook-ceph-cluster
network:
hostNetwork: false
dashboard:
enabled: true
# cluster level storage configuration and selection
storage:
useAllNodes: true
useAllDevices: true
deviceFilter:
location:
config:
metadataDevice:
databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
journalSizeMB: "1024" # this value can be removed for environments with normal sized disks (20 GB or larger)
Storage Configuration: Specific devices
Individual nodes and their config can be specified so that only the named nodes below will be used as storage resources. Each node’s ‘name’ field should match their ‘kubernetes.io/hostname’ label.
apiVersion: ceph.rook.io/v1beta1
kind: Cluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
serviceAccount: rook-ceph-cluster
network:
hostNetwork: false
dashboard:
enabled: true
# cluster level storage configuration and selection
storage:
useAllNodes: false
useAllDevices: false
deviceFilter:
location:
config:
metadataDevice:
databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
journalSizeMB: "1024" # this value can be removed for environments with normal sized disks (20 GB or larger)
nodes:
- name: "172.17.4.101"
directories: # specific directories to use for storage can be specified for each node
- path: "/rook/storage-dir"
- name: "172.17.4.201"
devices: # specific devices to use for storage can be specified for each node
- name: "sdb"
- name: "sdc"
config: # configuration can be specified at the node level which overrides the cluster level config
storeType: bluestore
- name: "172.17.4.301"
deviceFilter: "^sd."
Storage Configuration: Cluster wide Directories
This example is based up on the Storage Configuration: Specific devices. Individual nodes can override the cluster wide specified directories list.
apiVersion: ceph.rook.io/v1beta1
kind: Cluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
serviceAccount: rook-ceph-cluster
network:
hostNetwork: false
dashboard:
enabled: true
# cluster level storage configuration and selection
storage:
useAllNodes: false
useAllDevices: false
config:
databaseSizeMB: "1024" # this value can be removed for environments with normal sized disks (100 GB or larger)
journalSizeMB: "1024" # this value can be removed for environments with normal sized disks (20 GB or larger)
directories:
- path: "/rook/storage-dir"
nodes:
- name: "172.17.4.101"
directories: # specific directories to use for storage can be specified for each node
# overrides the above `directories` values for this node
- path: "/rook/my-node-storage-dir"
- name: "172.17.4.201"
Node Affinity
To control where various services will be scheduled by kubernetes, use the placement configuration sections below. The example under ‘all’ would have all services scheduled on kubernetes nodes labeled with ‘role=storage’ and tolerate taints with a key of ‘storage-node’.
apiVersion: ceph.rook.io/v1beta1
kind: Cluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
serviceAccount: rook-ceph-cluster
network:
hostNetwork: false
dashboard:
enabled: true
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- storage-node
tolerations:
- key: storage-node
operator: Exists
mgr:
nodeAffinity:
tolerations:
mon:
nodeAffinity:
tolerations:
osd:
nodeAffinity:
tolerations:
Resource requests/Limits
To control how many resources the rook components can request/use, you can set requests and limits in Kubernetes for them.
You can override these requests/limits for OSDs per node when using useAllNodes: false
in the node
item in the nodes
list.
WARNING Before setting resource requests/limits, please take a look at the Ceph documentation for recommendations for each component: Ceph - Hardware Recommendations.
apiVersion: ceph.rook.io/v1beta1
kind: Cluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
serviceAccount: rook-ceph-cluster
# cluster level resource requests/limits configuration
resources:
storage:
useAllNodes: false
nodes:
- name: "172.17.4.201"
resources:
limits:
cpu: "2"
memory: "4096Mi"
requests:
cpu: "2"
memory: "4096Mi"
Common Cluster Resources
Each Ceph cluster must be created in a namespace and also give access to the Rook operator to manage the cluster in the namespace. Creating the namespace and these controls must be added to each of the examples previously shown.
apiVersion: v1
kind: Namespace
metadata:
name: rook-ceph
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rook-ceph-cluster
namespace: rook-ceph
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: rook-ceph-cluster
namespace: rook-ceph
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: [ "get", "list", "watch", "create", "update", "delete" ]
---
# Allow the operator to create resources in this cluster's namespace
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: rook-ceph-cluster-mgmt
namespace: rook-ceph
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: rook-ceph-cluster-mgmt
subjects:
- kind: ServiceAccount
name: rook-ceph-system
namespace: rook-ceph-system
---
# Allow the pods in this namespace to work with configmaps
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: rook-ceph-cluster
namespace: rook-ceph
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: rook-ceph-cluster
subjects:
- kind: ServiceAccount
name: rook-ceph-cluster
namespace: rook-ceph