Availability and Recovery

Plan availability and recovery by matching your environment to the appropriate deployment topology, high availability baseline, Global Cluster Disaster Recovery path, and backup or restore procedure.

For detailed sizing, backup, restore, and disaster recovery procedures, follow the linked topic-specific procedures. The guidance below focuses on planning choices and support boundaries.

Deployment Topology

Choose topology by scenario, not by a single production minimum table.

Topology	Use when	Availability notes
Multi-Cluster	One `global` cluster manages multiple workload clusters.	Avoid non-platform business workloads on the `global` cluster. Size the `global` cluster for the number of managed clusters, user concurrency, API traffic, and installed plugins.
Single Cluster	One cluster intentionally runs both platform components and business workloads.	Production-capable when it fits a single-business-system scenario. Plan resources for both platform components and workloads. The `global` cluster control plane uses 3 nodes in this mode.
Single Node	Test or proof-of-concept environments.	Do not use for production.

For installation planning, see Plan and Prerequisites.

High Availability Baseline

For production-like environments, use 3 control plane nodes as the HA baseline. A 5 control plane node topology can improve scale and reliability for larger environments, but it is not a universal hard requirement for every production deployment.

Infra nodes or custom role nodes are useful for isolating platform components or high-load components, but they are not a universal requirement unless a sizing tier or component document says so. For Extra Large global cluster sizing, follow the sizing guidance for dedicated infra nodes.

Planning topic	Source
`global` cluster sizing by managed cluster count	Evaluating Resources for Global Cluster
Workload cluster control plane sizing and scale factors	Evaluating Resources for Workload Cluster
Node roles and infra/custom role nodes	Cluster Node Planning
Node requirements and OS/kernel support	Node Preprocessing

Component-Level High Availability

Standard HA topology deployments rely on the following mechanisms across platform components. This section is reference context for architects; you do not configure these per component yourself when the platform is deployed in an HA topology.

etcd

Deployed on the global cluster control plane nodes.
Uses the RAFT consensus protocol for leader election and replicated state.
A three control plane node deployment tolerates one node failure; a five control plane node deployment tolerates two.
Supports local and remote object storage snapshot backups for recovery.

Platform access and ingress

The cluster ingress gateway runs with multiple replicas and leader election.
An external load balancer, DNS record, or a self-built VIP with heartbeat detection and active-standby failover provides the platform access entry point.

Image registry and object storage

The image registry component runs in multiple replicas behind the cluster ingress, with a highly available database backend and a highly available cache backend.
The object storage backend used by the platform runs in distributed mode with erasure coding, data redundancy, and automatic recovery.

Monitoring and logging

Monitoring runs as multiple instances with query-time deduplication, distributed storage components, and cross-region redundancy options.
Logging components are deployed in distributed, multi-replica modes across ingestion, transport, storage, and query.

Container network

The container network interface (CNI) achieves high availability through stateless per-node agents and triple-replica control plane components.

Self-healing

Health checks, failover, and traffic redirection are coordinated by the Kubernetes control plane and platform components.
Transient failures are retried and workloads are rescheduled onto healthy nodes when a node becomes unavailable.

For topology-level planning, see High Availability Baseline. For cluster-level disaster recovery, see Global Cluster Disaster Recovery.

Global Cluster Disaster Recovery

Global Cluster Disaster Recovery protects the platform management entry point and global control-plane services when the Primary global cluster becomes unavailable.

For 4.3, Global DR has the following scope:

It uses Primary and Standby global clusters.
It relies on real-time synchronization of resource state stored in the Primary global cluster etcd, except excluded namespaces.
It restores the platform entry point and global control-plane services by switching DNS or VIP access to the Standby cluster.
Primary and Standby should follow the validated path of aligned versions, patches, component versions, and key configuration.

Do not treat Global DR as full platform data DR, application data DR, automatic failover, or an SLA-backed RPO/RTO commitment. Global DR does not cover registry data, chartmuseum data, other component data, application data, or resources excluded from etcd synchronization.

For the procedure and supported scenarios, see Global Cluster Disaster Recovery.

Backup And Restore Paths

Recovery is composed by scenario. Each mechanism protects a specific data domain and has its own procedure, prerequisite, and limitation.

Scenario	Use	Source
Primary `global` cluster failure	Global Cluster Disaster Recovery.	Global Cluster Disaster Recovery
Accidental cluster-state deletion or rollback	etcd backup and restore.	etcd Backup and Restore
Registry image repository data	Registry backup and recovery.	Registry Data Backup and Recovery
Monitoring data	Monitoring component backup or restore.	VictoriaMetrics Backup and Recovery
Logging data	Logging component backup or restore, according to the installed logging backend.	Logging Service
Application resources and persistent volumes	Application backup and restore with Data Backup Essentials and Data Backup for Velero.	Backup Overview

Application backup can protect namespaces, Kubernetes resources, and persistent volume data according to the backup configuration. It does not support every storage or application data pattern. For example, hostPath PersistentVolumes are not supported by the documented application backup path, and database workloads should follow data-service-specific backup guidance.

Recovery Checklist

Use this checklist to choose follow-up work. For detailed steps, follow the linked procedures:

Choose Single Cluster, Multi-Cluster, or Single Node during installation planning.
Size the global cluster and workload clusters from the scalability guidance.
Decide whether Global Cluster Disaster Recovery is required before installing Core.
Configure backups for etcd, registry data, monitoring data, logging data, and applications according to the data domains you need to recover.
Run regular recovery checks and failover drills where your operational process requires them.
Verify platform access, global services, connected cluster access, and component-level recovery after failover or restore.

For related planning and operations paths, see Learn More.

#Availability and Recovery

#TOC

#Deployment Topology

#High Availability Baseline

#Component-Level High Availability

#Global Cluster Disaster Recovery

#Backup And Restore Paths

#Recovery Checklist