Contents
- What is CephFS and Why Use It in Kubernetes?
- CephFS fundamentals and architecture
- CephFS vs RBD vs RGW: choosing the right interface
- Benefits of CephFS for Kubernetes workloads
- CephFS Integration Options for Kubernetes
- Ceph CSI and CephFS driver overview
- Rook: Kubernetes operator for Ceph
- External Ceph cluster vs in‑cluster Rook deployment
- Removal of in‑tree CephFS plugin and CSI migration
- Provisioning CephFS Storage in Kubernetes
- Defining CephFS storage classes (fsName, pool, reclaim policy)
- Creating PersistentVolumeClaims with ReadWriteMany
- Mounting CephFS volumes in pods (deployment examples)
- Sharing volumes across namespaces and enabling multi‑tenancy
- Performance, Reliability, and Best Practices
- Scaling metadata servers and designing pools
- Replication vs erasure coding for CephFS data
- Network and hardware tuning for throughput
- Monitoring CephFS with dashboards and metrics
- Common Pitfalls and Troubleshooting
- Avoiding misconfiguration of pools, secrets, and storage classes
- Kernel vs FUSE CephFS clients and compatibility
- Handling mount errors, slow requests, and stuck PGs
- Upgrading Ceph and Rook without downtime
- Use Cases and Deployment Patterns
- Shared file storage for microservices and logs
- High‑performance computing and AI workloads
- Container registries, CI/CD caches, and artifact storage
- Multi‑cluster CephFS and external Ceph clusters
- Considerations for SMEs and Managed Services
- Using MicroCeph, MicroK8s, or QuantaStor
- Scaling CephFS as your Kubernetes clusters grow
- Backup and Recovery Strategies for CephFS in Kubernetes with Bacula
- Key Takeaways
What is CephFS and Why Use It in Kubernetes?
CephFS is a distributed file system capable of seamless integration with Kubernetes storage requirements, among others. Businesses that run containerized workloads need a persistent storage solution capable of offering both horizontal scaling and data consistency (across multiple modules) at the same time.
These capabilities are delivered by the CephFS architecture via a POSIX-compliant interface (Portable Operating System Interface) that can be accessed by multiple pods at the same time – making it perfect for various shared-storage scenarios within Kubernetes environments.
CephFS fundamentals and architecture
CephFS is a file system operating on top of the Ceph distributed storage system, separating data and metadata management into their own distinct components. There are three primary components that the Ceph architecture consists of:
- Metadata servers (MDS) responsible for handling filesystem metadata operations
- Object storage daemons (OSD) that store actual data blocks
- Monitors (MON) which maintain cluster state
The metadata servers process file system operations – such as open, close, and rename commands. Meanwhile, the OSD layer distributes data across multiple nodes using the CRUSH algorithm, determining data placement without the need for a centralized lookup table.
The file system relies on pools to organize data storage. CephFS requires at least two pools:
- Actual data. Contains the file contents themselves, split into objects, typically 4MB in size by default
- Metadata. Stores directory structures, file attributes, and access permissions, all of which must remain highly available at all times
This separation allows administrators to apply different replication or erasure coding strategies to both data and metadata, striving to optimize for performance and reliability based on the specific requirements of each environment.
Client access occurs through kernel modules or FUSE (Filesystem in USErspace) implementations.
- The kernel client integrates directly with the Linux kernel, offering better performance and lower CPU overhead for environments that use compatible kernel versions
- FUSE clients, on the other hand, offer broader compatibility across operating systems and kernel versions but tend to introduce additional context switching that may impact performance during heavy workload situations
Both clients communicate with MDS for metadata operations and directly with OSDs for data transfer. That way, the bottlenecks that would usually occur in traditional client-server file systems are eliminated from the beginning.
CephFS vs RBD vs RGW: choosing the right interface
Ceph offers three primary interfaces for data access, each optimized for different use cases within Kubernetes environments – CephFS, RBD, and RGW. Knowing the best environment conditions for each of the interfaces helps architects select appropriate storage backends depending on specific workload requirements.
The storage interface selection process directly impacts not only application performance, but also scalability limits and even operational complexity in production deployments. The table below should serve as a good introduction to the basics of each interface type.
| Interface | Access Mode | Best For | Key Characteristics |
| CephFS | ReadWriteMany (RWX) | Shared file access, logs, configuration files | POSIX-compliant, multiple concurrent clients, file system semantics |
| RBD | ReadWriteOnce (RWO) | Databases, exclusive block storage | Lowest latency, snapshots, single-pod attachment |
| RGW | S3/Swift APIs | Archives, backups, unstructured data | Horizontal scaling, eventual consistency, object storage |
CephFS provides a POSIX-compliant shared file system that multiple clients can mount at the same time. This particular interface excels in scenarios that require shared access to common datasets – be it configuration files, application logs, or media assets that multiple pods need to read and write concurrently.
Rados Block Device (RBD) delivers block storage using ReadWriteOnce persistent volumes. RBD images offer better performance for database workloads and applications which require low-latency access to storage, as block operations bypass file system overhead. With that being said, RBD volumes are only attachable to a single pod at a time (with standard configurations).
Rados Gateway (RGW) exposes object storage through S3 and Swift-compatible APIs. The object storage model provides eventual consistency while scaling horizontally without the need for coordination overhead required by file systems. Applications need to use S3 SDKs rather than file system calls, though, necessitating code modifications for workloads that were not originally designed with object storage in mind.
Benefits of CephFS for Kubernetes workloads
CephFS addresses several persistent storage challenges that appear when attempting to run stateful applications in Kubernetes clusters. These key advantages include:
- ReadWriteMany (RWX) access – Multiple pods mount the same volume simultaneously, enabling horizontal scaling for shared datasets
- Dynamic provisioning – CSI driver automatically creates subvolumes from storage class definitions without manual intervention
- Data protection – Configurable replication or erasure coding ensures durability with automatic recovery from node failures
- Horizontal scaling – Add metadata servers and OSD nodes to increase capacity and throughput as workloads grow
- Native Kubernetes integration – Standard PersistentVolumeClaim resources work without requiring Ceph-specific knowledge
The ReadWriteMany access mode removes various storage bottlenecks that typically occurred for ReadWriteOnce volumes (as those can only be attached to a single pod). Applications that need shared access to configuration files, logs, or media assets have the option to scale horizontally without encountering the issue of storage constraints.
Dynamic provisioning via the Ceph CSI driver removes the need for manual volume creation. Administrators can easily define storage classes to specify pool names and file system identifiers, which the CSI driver would then use to automatically provision volumes once applications submit PersistentVolumeClaims. The dynamic provisioning workflow is what makes self-service storage consumption possible for development teams.
Data protection occurs either via replication or with erasure coding at the pool level. Replication keeps multiple copies across nodes for quick recovery, while erasure coding splits data into fragments with parity information, reducing storage overhead. These redundancy mechanisms operate with full transparency, and Ceph can even reconstruct data automatically when failures occur.
CephFS Integration Options for Kubernetes
CephFS integration with Kubernetes is a choice between several possible deployment approaches, each with their own trade-offs in terms of complexity, control, or operational overhead. The specific integration method would decide how storage provisioning occurs, which components are going to manage the Ceph cluster lifecycle, and where the infrastructure responsibilities are going to lie.
Organizations would have to evaluate an abundance of factors when selecting an integration path – including their existing infrastructure, operational expertise, and scalability requirements.
Ceph CSI and CephFS driver overview
The Container Storage Interface (CSI) is a standard API that enables storage vendors to develop plugins that operate across different container orchestration platforms. The Ceph CSI driver applies this specification to CephFS volumes, replacing the in-tree Kubernetes volume plugin that is already deprecated.
The driver consists of two primary components that handle different aspects of volume lifecycle:
- Controller plugin – Runs as a deployment, handles volume creation, deletion, snapshots, and expansion operations
- Node plugin – Runs as a daemonset on every node, manages volume mounting and unmounting for pods
The CSI driver communicates with Ceph monitors and metadata servers to provision subvolumes within existing CephFS file systems. Whenever applications request storage through PersistentVolumeClaims – the provisioner creates isolated subvolumes with independent quotas and snapshots. The subvolume isolation as a feature creates tenant separation without the need to have separate file systems for each application.
Node plugins mount CephFS volumes via kernel clients by default, but also fall back to FUSE if kernel versions cannot support the required features. The driver is responsible for handling authentication by creating and managing Ceph user credentials – credentials that are stored as Kubernetes secrets and mounted to pods during the volume attachment process.
Rook: Kubernetes operator for Ceph
Rook transforms Ceph deployment and management processes into a cloud-native experience through implementing Kubernetes operator patterns. The Rook operator is looking for custom resource definitions that describe the desired state of a Ceph cluster, then creates and manages the pods, services, and configurations needed to maintain that same state.
Rook can offer several operational advantages for Kubernetes environments, such as:
- Declarative configuration – Define entire Ceph clusters using YAML manifests instead of manual commands
- Automated lifecycle management – Handles cluster upgrades, scaling, and failure recovery without operator intervention
- Kubernetes-native operations – Uses standard kubectl commands for cluster management and troubleshooting
- Built-in monitoring – Deploys Prometheus exporters and Grafana dashboards automatically
The operator deploys Ceph components as Kubernetes workloads. Monitor pods run as a deployment, OSD pods run as a deployment per disk or directory, and metadata server pods run as deployments with anti-affinity rules for high availability. The pod-based architecture is what allows Kubernetes to handle node failures, resource scheduling, and health monitoring with nothing but the cluster capabilities it has.
Rook can simplify CephFS provisioning due to its capability to create storage classes automatically when CephFS custom resources are defined. Administrators need to specify pool configurations, replica counts, and file system parameters in a CephFilesystem resource, which Rook then translates into commands that are Ceph-appropriate. Such abstraction helps eliminate the need to run ceph command-line tools manually.
External Ceph cluster vs in‑cluster Rook deployment
Organizations can integrate CephFS with Kubernetes using either an external Ceph cluster that is managed independently or an in-cluster Rook deployment running Ceph components as pods. Each approach is suitable to its own set of operational models and infrastructure requirements, as shown in the table below.
| Aspect | External Ceph Cluster | In-Cluster Rook Deployment |
| Infrastructure | Dedicated bare-metal or VMs outside Kubernetes | Ceph components run as pods within Kubernetes |
| Management | Separate tools and procedures for Ceph | Unified Kubernetes-native operations |
| Failure domains | Clear separation between storage and compute | Storage and compute share infrastructure |
| Multi-cluster | Single cluster serves multiple Kubernetes clusters | Typically one Rook per Kubernetes cluster |
| Expertise required | Storage team manages Ceph independently | Kubernetes team manages entire stack |
| Resource planning | Storage capacity independent of compute nodes | Requires sufficient node resources for OSDs |
External clusters benefit organizations with existing Ceph deployments or dedicated storage teams. This separation allows storage administrators to manage Ceph with the help of familiar tools and also without extensive Kubernetes expertise. The infrastructure duplication is also reduced significantly by allowing multiple Kubernetes clusters to share a single external Ceph cluster.
Rook deployments work well for organizations seeking operational simplicity and unified infrastructure management. The approach reduces systems to maintain but requires careful resource planning to prevent storage pods from competing with application workloads. Many deployments dedicate specific nodes to storage using taints and tolerations.
Hybrid approaches are also common, running metadata servers and monitors in Rook while connecting to external OSD clusters for data storage.
Removal of in‑tree CephFS plugin and CSI migration
Kubernetes deprecated the in-tree CephFS volume plugin in version 1.28 and removed it completely in version 1.31. Organizations who still use the legacy plugin would have to migrate to the Ceph CSI driver in order to maintain their CephFS functionality in current Kubernetes versions.
The in-tree plugin implemented storage functionality directly in the Kubernetes codebase, which created a number of operational challenges. To name a few examples: storage updates required Kubernetes releases, bug fixes could not be deployed independently, and code maintenance increased project complexity.
The CSI migration path is what allows existing volumes to continue functioning while new volumes already use the CSI driver. Kubernetes can translate in-tree volume specifications to CSI equivalents automatically when the CSI migration feature gate is enabled. The translation itself occurs transparently without the need for any manual changes to PersistentVolume or PersistentVolumeClaim definitions.
Provisioning CephFS Storage in Kubernetes
Provisioning CephFS storage in Kubernetes requires configuring storage classes that define how volumes are created, establishing persistent volume claims that request storage, and mounting those volumes in pod specifications. The provisioning workflow connects application storage requirements to underlying CephFS infrastructure through declarative Kubernetes resources.
Information and knowledge about each component in the provisioning chain allows administrators to design storage configurations that match workload requirements for capacity, performance, and access patterns.
Defining CephFS storage classes (fsName, pool, reclaim policy)
Storage classes act as templates that describe how dynamic volumes should be provisioned. The CephFS storage class specifies which file system to use, which data pool stores file contents, and how volumes should be handled when claims are deleted.
Essential storage class parameters include:
- fsName – Identifies the CephFS file system where subvolumes are created
- pool – Specifies the data pool for storing file contents
- mounter – Selects kernel or fuse client for mounting volumes
- reclaimPolicy – Determines whether volumes are deleted or retained when claims are removed
- volumeBindingMode – Controls when volume provisioning occurs relative to pod scheduling
The fsName parameter must match an existing CephFS file system in the Ceph cluster. The CSI driver queries the Ceph cluster to verify the file system exists before attempting provisioning operations. The file system validation prevents provisioning failures caused by configuration errors.
Pool selection impacts performance and durability characteristics:
- SSD-backed pools – Low-latency storage for databases and performance-critical workloads
- HDD-backed pools – Cost-effective capacity for archives and bulk storage
- Mixed strategies – Different replication levels per storage tier
Reclaim policies define volume lifecycle behavior. The Delete policy automatically removes subvolumes when PersistentVolumeClaims are deleted, reclaiming storage capacity. The Retain policy preserves subvolumes after claim deletion, allowing administrators to recover data or investigate issues before manual cleanup. The reclaim policy selection balances operational convenience against data safety requirements.
Creating PersistentVolumeClaims with ReadWriteMany
PersistentVolumeClaims request storage from defined storage classes without requiring knowledge of underlying storage implementation. The ReadWriteMany access mode distinguishes CephFS from block storage by making it possible for multiple pods to mount volumes simultaneously.
Claims specify storage requirements through several key fields:
- accessModes – Must include ReadWriteMany for shared CephFS access
- resources.requests.storage – Defines required capacity for the volume
- storageClassName – References the storage class for provisioning
- volumeMode – Set to Filesystem for CephFS volumes
The ReadWriteMany access mode enables horizontal scaling patterns, with multiple pod replicas sharing common data. Applications such as content management systems, shared configuration stores, and distributed logging benefit from this capability. The simultaneous access eliminates the need to coordinate storage between pods.
Storage capacity requests affect quota enforcement when it comes to provisioned subvolumes. The CSI driver creates subvolumes with quotas matching the requested size to prevent individual applications from consuming excessive storage. Quota enforcement happens at the CephFS level, while the metadata servers reject write operations that would exceed existing limits.
Storage class selection determines which CephFS file system and pool serve the claim. Applications can request different performance tiers or durability levels by specifying appropriate storage classes in claim definitions. The storage class abstraction allows applications to declare requirements without the need to understand all the Ceph infrastructure details.
Mounting CephFS volumes in pods (deployment examples)
Pods consume provisioned storage by referencing PersistentVolumeClaims in volume specifications. The volume mount configuration connects claim names to mount paths within containers, making storage accessible to application processes.
Volume mounting involves two specification sections:
- volumes[] – Declares which claims the pod uses
- volumeMounts[] – Defines mount paths within specific containers
- subPath – Optional field to mount subdirectories instead of entire volumes
- readOnly – Restricts mount to read-only access when needed
Multiple containers within a pod can mount the same volume at different paths, allowing for sidecar patterns where one container writes data while another processes or exports it. The shared volume access within pods simplifies data exchange between tightly coupled containers.
The CSI node plugin handles mounting through these steps:
- Retrieves Ceph credentials from Kubernetes secrets
- Establishes connections to monitors and metadata servers
- Mounts the subvolume using kernel or FUSE clients
- Completes automatically as part of pod startup
SubPath mounting allows pods to isolate their view of shared volumes. Instead of seeing the entire subvolume contents, containers only access specified subdirectories. This capability enables multiple applications to share storage while maintaining logical separation. The subpath isolation exists to reduce complexity in multi-tenant scenarios, among other benefits.
Sharing volumes across namespaces and enabling multi‑tenancy
CephFS volumes can be shared across namespace boundaries through PersistentVolume objects that reference existing subvolumes. The cross-namespace sharing enables centralized data management while distributing access to multiple teams or applications.
Sharing approaches include:
- Pre-provisioned PersistentVolumes – Administrators create volumes referencing specific subvolumes, then create claims in multiple namespaces
- StorageClass with shared fsName – Multiple namespaces use the same storage class, receiving isolated subvolumes in a common file system
- Volume cloning – Create new volumes from snapshots, distributing copies across namespaces
- Namespace resource quotas – Limit storage consumption per namespace to prevent resource exhaustion
Pre-provisioned volumes provide the most direct sharing mechanism. Administrators create PersistentVolume resources that specify CephFS subvolume details, then create corresponding claims in target namespaces. The static provisioning workflow gives operators complete control over which namespaces access which storage.
Multi-tenancy security operates through several mechanisms:
- Subvolume-level access controls – Each volume receives unique Ceph credentials
- Automatic credential management – CSI driver creates users with restricted capabilities
- Namespace isolation – Prevents cross-namespace data access
Resource quotas enforce capacity limits per namespace, aiming to prevent individual tenants from consuming entire storage pools. Administrators set namespace quotas that aggregate all PersistentVolumeClaim sizes, rejecting all new claims that would exceed limits. Quota enforcement like this protects shared infrastructure from resource exhaustion by single tenants.
Performance, Reliability, and Best Practices
Optimizing CephFS performance in Kubernetes requires balancing metadata server capacity, pool design, network throughput, and monitoring visibility. The performance tuning approach must address both Ceph infrastructure characteristics and Kubernetes workload patterns to achieve production-grade reliability.
Scaling metadata servers and designing pools
Metadata server capacity determines how many file operations CephFS can handle concurrently. Each MDS instance processes directory traversals, file opens, and permission checks for specific portions of the file system namespace. The MDS scaling strategy has a direct impact on application responsiveness under any load.
Active-standby MDS configurations provide high availability. One MDS handles all metadata operations while standbys remain ready to take over during failures. Active-active configurations distribute namespace portions across multiple MDS instances, allowing for horizontal scaling when it comes to workloads with high metadata operation rates.
Pool design considerations include:
- Separate metadata and data pools – Different performance requirements justify isolated configurations
- Replica count – Three replicas balance durability against storage efficiency for metadata
- Placement groups – Calculate appropriate PG counts based on OSD count and pool size
- Crush rules – Control data distribution across failure domains
Metadata pools require fast storage and higher replication since metadata loss can corrupt entire file systems. SSD-backed metadata pools with three-way replication provide both performance and durability. Data pools can use erasure coding to reduce storage overhead while maintaining acceptable performance for sequential workloads.
Replication vs erasure coding for CephFS data
Replication creates multiple complete copies of each object in different OSDs. The replication approach offers fast recovery with consistent performance but consumes more raw storage capacity. Three-way replication requires three times the logical data size in physical storage.
Erasure coding splits data into fragments with parity information, similar to how a RAID configuration works. For example, a 4+2 erasure code stores data across six fragments where any four fragments would be enough to reconstruct the original data. The erasure coding approach reduces storage overhead to 1.5x while maintaining data protection.
Performance trade-offs include:
- Replication advantages – Lower latency, faster rebuilds, simpler operations
- Erasure coding advantages – Reduced storage costs, acceptable for sequential access
- Workload suitability – Replication for databases, erasure coding for archives
Metadata pools should always use replication due to their high sensitivity to latency. Data pools can rely on erasure coding for cost reduction when workloads primarily perform large sequential reads and writes, not small random operations.
Network and hardware tuning for throughput
Network configuration significantly impacts CephFS performance since all I/O traverses the network between clients and OSDs. The network architecture should provide sufficient bandwidth and low latency for storage traffic.
Critical network considerations:
- Separate storage networks – Isolate Ceph traffic from application traffic
- 10GbE or faster – Minimum recommended bandwidth for production deployments
- Jumbo frames – Enable 9000 MTU to reduce packet processing overhead
- Network redundancy – Bond multiple interfaces for bandwidth and failover
Hardware tuning focuses on OSD node configurations. NVMe SSDs offer better performance than SATA SSDs for both data and metadata workloads. Adequate CPU and RAM capabilities on OSD nodes prevents bottlenecks during recovery operations. Each OSD typically requires at least 2GB RAM, with additional memory improving cache effectiveness.
Client-side tuning includes selecting necessary mount options. The kernel CephFS client tends to provide better performance than FUSE for workloads with compatible kernel versions. Disabling atime (access time) updates reduces metadata operations for read-heavy workloads.
Monitoring CephFS with dashboards and metrics
Effective monitoring provides visibility into CephFS health, performance bottlenecks, and capacity utilization. The monitoring strategy should track both Ceph cluster metrics and Kubernetes storage consumption patterns.
Essential metrics to monitor:
- MDS performance – Request latency, queue depth, cache utilization
- Pool capacity – Used space, available space, growth rates
- OSD health – Disk utilization, operation latency, error rates
- Client operations – Read/write throughput, IOPS, error counts
The Ceph dashboard provides built-in visualization of cluster health and performance. Prometheus exporters collect detailed metrics that can be visualized using Grafana. Alert rules should be set up to notify operators of capacity thresholds, performance degradation, and component failures before they impact applications.
Kubernetes-level monitoring tracks PersistentVolume usage, provisioning failures, and mount errors. The CSI driver exposes metrics about volume operations that complement Ceph cluster metrics. Combining both perspectives enables comprehensive troubleshooting when storage issues occur.
Common Pitfalls and Troubleshooting
CephFS deployments tend to face predictable failure patterns when it comes to configuration errors, client compatibility, and operational procedures. Being aware of these common pitfalls greatly improves the effectiveness of troubleshooting efforts while preventing recurring issues from happening in the future. The necessary troubleshooting approach, however, requires examining both Kubernetes and Ceph layers to identify root causes.
Avoiding misconfiguration of pools, secrets, and storage classes
Configuration errors are the most popular cause of CephFS provisioning failures in Kubernetes environments. The configuration validation process should verify pool existence, credential validity, and storage class parameters before even attempting volume provisioning.
Common configuration mistakes include:
- Non-existent pool names – Storage classes reference pools that do not exist in Ceph
- Incorrect fsName values – File system names that do not match actual CephFS instances
- Missing or expired secrets – Ceph credentials deleted or rotated without updating Kubernetes secrets
- Wrong secret namespaces – CSI driver cannot access secrets in different namespaces
- Mismatched cluster IDs – Storage class references incorrect Ceph cluster
Verifying pool existence before deploying storage classes would prevent provisioning failures. Administrators should confirm the fact that pools exist via the ceph osd pool ls commands and validate file systems with ceph fs ls. That way, the pre-deployment validation can catch configuration errors before applications encounter them.
Secret management requires careful attention when it comes to credential lifecycle. Ceph credentials rotation requires updating corresponding Kubernetes secrets before old credentials can expire. With that in mind, using separate service accounts with minimal capabilities for each storage class improves security and simplifies troubleshooting when access issues occur.
Storage class parameters must match Ceph cluster capabilities. Keep in mind that specifying erasure-coded pools for metadata or requesting features unsupported by the deployed Ceph version causes silent failures that manifest as stuck provisioning operations.
Kernel vs FUSE CephFS clients and compatibility
CephFS supports two client implementations with different performance characteristics and compatibility requirements. The choice between the two has a direct impact on both performance and operational complexity of the environment:
- Kernel client – Higher performance, lower CPU overhead, requires compatible kernel versions
- FUSE client – Broader compatibility, userspace implementation, additional context switching overhead
- Feature parity – Some newer CephFS features only available in FUSE initially
Kernel client compatibility depends on Linux kernel versions shipped with container host operating systems. Older kernels lack support for recent CephFS features or contain bugs that cause mount failures. The kernel version requirement is often the deciding factor of whether kernel or FUSE clients are viable to begin with.
FUSE clients provide escape hatches when kernel compatibility issues block deployments. Organizations that run older Kubernetes node operating systems can use FUSE to access CephFS without the prerequisite of upgrading host kernels beforehand. The performance penalty typically matters less than deployment feasibility for initial rollouts.
Switching between clients would require proper storage class modifications. The mounter parameter is what controls client selection, allowing administrators to test both implementations with identical storage configurations. Such a benchmarking process for workloads is essential for identifying performance differences that might be specific to certain access patterns.
Handling mount errors, slow requests, and stuck PGs
Operational issues manifest through mount failures, degraded performance, or stalled I/O operations. The diagnostic process examines mount logs, Ceph cluster health, and network connectivity to isolate problems.
Common operational problems:
- Mount timeout errors – Network connectivity issues or monitor unavailability
- Permission denied failures – Incorrect Ceph credentials or insufficient capabilities
- Slow request warnings – OSD performance problems or network congestion
- Stuck placement groups – OSD failures preventing data availability
- Out of space errors – Pool capacity exhaustion or quota limits reached
Mount errors tend to indicate authentication failures or network problems. Examining CSI node plugin logs often reveals specific error messages from Ceph clients. Testing network connectivity from Kubernetes nodes to Ceph monitors and OSDs is a great way to help isolate infrastructure issues from the rest.
Slow request warnings are a great indication of performance bottlenecks in the Ceph cluster. Common causes of such include failing disks, network saturation, and insufficient OSD resources. The performance diagnosis requires examining OSD latency metrics and network utilization patterns.
Stuck placement groups prevent I/O operations on affected data. Recovery from such an issue requires identifying failed OSDs, replacing failed hardware, or manually intervening when automatic recovery stalls. However, regular monitoring usually catches PG issues before they impact application availability.
Upgrading Ceph and Rook without downtime
Upgrade procedures must maintain data availability while in the process of transitioning to new software versions. The upgrade strategy depends heavily on whether you’re using external Ceph clusters or in-cluster Rook deployments.
Upgrade considerations:
- Version compatibility – Verify Ceph version compatibility with Kubernetes and CSI driver versions
- Rolling upgrades – Update components sequentially to maintain quorum and availability
- Backup verification – Confirm backups exist before major version upgrades
- Testing procedures – Validate upgrades in non-production environments first
Rook automates upgrade orchestration via operator version updates. The operator manages rolling upgrades of Ceph daemons while maintaining cluster availability. Administrators update the Rook operator version, which then progressively upgrades Ceph components according to dependency requirements.
External Ceph clusters require manual upgrade orchestration using ceph orchestration tools or configuration management systems. Following Ceph project upgrade documentation ensures the correct sequence of monitor, OSD, and MDS upgrades is used. The strict adherence to that upgrade sequence is necessary to prevent compatibility issues between components at different versions.
Use Cases and Deployment Patterns
CephFS serves diverse workload types that require shared storage capabilities in Kubernetes environments. Understanding common deployment patterns helps architects select appropriate configurations for specific application requirements. The use case alignment determines storage class parameters, capacity planning, and performance optimization strategies.
Microservices architectures frequently require shared access to configuration files, static assets, and centralized logging directories. The shared storage pattern is what allows multiple service replicas to access common data without the need for complex synchronization logic.
Common use cases for microservices:
- Configuration management – Centralized config files accessed by multiple pods
- Static content serving – Web assets shared across frontend replicas
- Shared uploads – User-generated content accessible to processing pipelines
- Centralized logging – Log aggregation from distributed services
Configuration sharing simplifies application deployments by the virtue of eliminating configuration distribution mechanisms. Pods mount shared volumes that contain environment-specific settings, updating without requiring pod restarts. The configuration volume pattern reduces deployment complexity compared to ConfigMaps for large or frequently changing settings.
Log aggregation benefits from shared volumes where application pods write logs to common directories. Log processing sidecars or separate log shipper deployments read from these volumes to forward logs to centralized systems. This way, a simpler log collection is achieved if compared with agent-based solutions for certain workload types.
High‑performance computing and AI workloads
HPC and machine learning workloads process large datasets that must be accessible across multiple compute nodes. The parallel access pattern leverages CephFS ReadWriteMany capabilities to provide shared dataset storage for distributed processing.
HPC and AI requirements include:
- Training dataset access – Large datasets shared across multiple training pods
- Checkpointing storage – Model checkpoints written from distributed training jobs
- Result aggregation – Output data collected from parallel processing tasks
- Shared model repositories – Pre-trained models accessible to inference workloads
Training workloads benefit from CephFS when datasets exceed node local storage capacity or when multiple training jobs share common datasets. Pods that run on different nodes read training data simultaneously without the need for dataset replication. The shared dataset approach helps reduce storage duplication while simplifying dataset management.
Checkpoint storage requires reliable writes from training processes that periodically save model state. CephFS provides consistent storage where checkpoints remain accessible even if training pods restart on different nodes. Recovery from failures becomes simpler when checkpoint data persists independently of pod lifecycle.
Container registries, CI/CD caches, and artifact storage
Development infrastructure requires shared storage for container images, build caches, and compiled artifacts. The artifact storage pattern provides durable storage for CI/CD pipelines and development workflows.
Development infrastructure use cases:
- Container registry backends – Registry storage backed by CephFS volumes
- Build artifact caching – Maven, npm, or pip caches shared across build agents
- Compiled artifact storage – Build outputs accessible to deployment pipelines
- Test result archival – Historical test results and coverage reports
Container registries like Harbor or GitLab Registry can use CephFS for image storage layers. Shared storage enables high availability for the registry, with multiple registry instances being able to serve requests while accessing common image data. The registry HA pattern improves reliability without requiring storage replication at the application layer.
CI/CD caches accelerate build processes by preserving downloaded dependencies across builds. Build agents running as Kubernetes pods mount shared cache volumes, eliminating redundant package downloads. Cache sharing reduces build times and external bandwidth consumption when multiple builds occur concurrently.
Multi‑cluster CephFS and external Ceph clusters
Organizations running multiple Kubernetes clusters can share CephFS storage across cluster boundaries. The multi-cluster pattern centralizes storage infrastructure while distributing compute across isolated Kubernetes environments.
Multi-cluster benefits include:
- Centralized storage management – Single Ceph cluster serves multiple Kubernetes clusters
- Cross-cluster data sharing – Workloads in different clusters access common datasets
- Disaster recovery – Backup clusters mount production data for failover scenarios
- Cost efficiency – Consolidated storage reduces infrastructure duplication
External Ceph clusters enable this pattern by remaining independent of individual Kubernetes cluster lifecycles. Each Kubernetes cluster deploys CSI drivers that are configured to access the shared external Ceph cluster. Storage provisioning and lifecycle management occur at the Ceph level, not within Kubernetes itself.
Security considerations also require careful planning. Network policies must allow Kubernetes nodes to reach Ceph monitors and OSDs while preventing unauthorized access. Namespace-level credential isolation ensures workloads in one cluster cannot access volumes provisioned for other clusters without explicit authorization.
Considerations for SMEs and Managed Services
Small and medium enterprises often lack dedicated storage teams to manage full Ceph deployments. Simplified solutions reduce operational complexity while providing CephFS functionality for Kubernetes workloads. The simplified deployment approach balances feature requirements against available operational expertise.
Using MicroCeph, MicroK8s, or QuantaStor
Lightweight Ceph distributions simplify initial deployments for organizations without extensive storage infrastructure experience. These solutions provide opinionated configurations that reduce decision complexity during setup.
Simplified deployment options:
- MicroCeph – Snap-based Ceph distribution with simplified installation and management
- MicroK8s – Lightweight Kubernetes with integrated storage addons including Ceph
- QuantaStor – Commercial unified storage platform supporting CephFS
- Managed Ceph services – Cloud provider offerings handling infrastructure management
MicroCeph reduces Ceph deployment complexity by automating common configuration tasks and providing sensible defaults for small clusters. Organizations can deploy functional Ceph clusters in minutes rather than hours, lowering the barrier to CephFS adoption. The quick start approach enables experimentation before committing to production infrastructure.
MicroK8s integrates storage capabilities directly into Kubernetes distributions, eliminating the need to deploy and configure separate storage clusters. Built-in addons provide CephFS functionality without requiring separate infrastructure planning. This integration suits development environments and small production deployments where operational simplicity outweighs customization needs.
Commercial solutions like QuantaStor provide vendor support and unified management interfaces. Organizations preferring commercial backing over community-supported software can adopt CephFS through these platforms while receiving enterprise support contracts.
Scaling CephFS as your Kubernetes clusters grow
Initial deployments often start small but must accommodate growth as workload requirements expand. The growth planning process should anticipate capacity, performance, and operational requirements at larger scales.
Scaling considerations include:
- Capacity expansion – Adding OSD nodes to increase storage capacity
- Performance scaling – Additional MDS instances for increased metadata operations
- Network upgrades – Higher bandwidth links as throughput requirements grow
- Monitoring evolution – More sophisticated observability as complexity increases
Starting with three-node Ceph clusters provides redundancy while minimizing initial hardware investment. Organizations can add OSD nodes incrementally as capacity requirements increase, with Ceph automatically rebalancing data across expanded clusters. The incremental growth model avoids over-provisioning while maintaining expansion flexibility.
Metadata server scaling becomes necessary when file operation rates exceed single MDS capacity. Transitioning from active-standby to active-active MDS configurations distributes namespace load across multiple servers. This transition requires careful planning to avoid disruption during configuration changes.
Migration from simplified solutions to production-grade deployments may become necessary as scale increases. Organizations outgrowing MicroCeph or embedded solutions can migrate to full Rook deployments or external Ceph clusters while preserving existing data through backup and restore procedures.
Backup and Recovery Strategies for CephFS in Kubernetes with Bacula
Protecting CephFS data requires backup strategies that capture volume contents while minimizing impact on running workloads. Bacula Enterprise is an advanced solution for complex, demanding and HPC environments that provides sophisticated backup capabilities that integrate with CephFS through multiple approaches. The backup integration strategy must balance recovery objectives against operational complexity.
Bacula backup approaches for CephFS include:
- Direct filesystem backup – Bacula File Daemon accesses mounted CephFS volumes
- Snapshot-based backup – Capture CSI snapshots, then backup snapshot contents
- Application-consistent backup – Coordinate with applications before snapshot creation
- Bare metal recovery – Include Ceph configuration alongside data backups
Direct filesystem backups mount CephFS volumes on nodes running Bacula File Daemons. The daemon traverses directory structures and streams file contents to Bacula Storage Daemons for archival. This approach provides file-level granularity for restoration but requires careful scheduling to avoid performance impact during backup windows.
Snapshot-based workflows leverage CephFS snapshot capabilities through the CSI driver. Administrators create snapshots of PersistentVolumes, mount those snapshots to backup pods, and run Bacula File Daemon against snapshot mounts. The snapshot backup pattern provides consistency without impacting production workloads during backup operations.
Application-consistent backups require coordination between backup tools and applications. Databases and stateful applications should flush buffers and pause writes before snapshot creation. Kubernetes operators or scripts can orchestrate application quiesce, snapshot creation, application resume, and backup initiation sequences.
Recovery procedures depend on backup granularity. File-level backups enable selective restoration of individual files or directories. Volume-level backups require restoring entire volumes, which suits disaster recovery scenarios where complete volume reconstruction is necessary.
Testing recovery procedures validates backup effectiveness. Organizations should regularly restore backups to verify data integrity and measure recovery time objectives. The recovery validation process identifies backup configuration problems before actual disaster scenarios occur.
Bacula retention policies should align with organizational compliance and capacity constraints. Defining appropriate retention periods for daily, weekly, and monthly backups prevents excessive storage consumption while maintaining required recovery points.
Key Takeaways
- CephFS enables ReadWriteMany access for multiple pods to share volumes simultaneously
- External Ceph clusters suit dedicated storage teams while Rook simplifies Kubernetes-native operations
- Storage classes require careful configuration of fsName, pools, and reclaim policies
- Performance optimization depends on proper MDS scaling and pool design choices
- Common issues include pool misconfiguration, credential problems, and client compatibility
- Use cases range from shared configuration to ML datasets and multi-cluster storage
- Start simple with MicroCeph but plan capacity expansion and monitoring evolution