OpenStack Private Cloud Architecture

A production OpenStack private cloud requires careful architecture across compute, storage, networking, and control plane layers. This guide presents a reference architecture for a medium-scale private cloud (50–200 compute nodes) using OpenStack 2024.2 Dalmatian.

Architecture Overview

Layer Components Node Count
Control plane Keystone, Nova API, Neutron, Glance, Cinder API, Horizon, HAProxy 3 (HA)
Network Neutron L3/DHCP agents, OVN controllers 2–3
Compute Nova compute, OVS/OVN, libvirt 50–200
Storage Ceph MON/MGR/OSD or enterprise SAN 3+ (Ceph)
Monitoring Prometheus, Grafana, Alertmanager 1–3

Network Architecture

A production cloud uses four physically or logically separated networks:

Network Purpose VLAN/Subnet MTU
Management API, database, message queue VLAN 10 / 172.29.236.0/22 1500
Tunnel/Overlay VXLAN/Geneve between compute nodes VLAN 20 / 172.29.240.0/22 9000
Storage Ceph replication, iSCSI traffic VLAN 30 / 172.29.244.0/22 9000
Provider/External Tenant external access, floating IPs VLAN 40 / public range 1500

Network Hardware

  • Spine-leaf topology for predictable latency
  • 25 GbE minimum for compute nodes (2x bonded)
  • 100 GbE spine uplinks
  • Jumbo frames (MTU 9000) on storage and tunnel networks
  • MLAG/VPC for switch-level redundancy

Control Plane Design

The control plane runs on three nodes behind HAProxy for high availability:

                    +-----------+
                    | HAProxy   |
                    | (VIP)     |
                    +-----+-----+
                          |
            +-------------+-------------+
            |             |             |
      +-----+-----+ +----+------+ +----+------+
      | ctrl-01   | | ctrl-02   | | ctrl-03   |
      | Keystone  | | Keystone  | | Keystone  |
      | Nova API  | | Nova API  | | Nova API  |
      | Neutron   | | Neutron   | | Neutron   |
      | Glance    | | Glance    | | Glance    |
      | MariaDB   | | MariaDB   | | MariaDB   |
      | RabbitMQ  | | RabbitMQ  | | RabbitMQ  |
      +-----------+ +-----------+ +-----------+

Database

  • MariaDB Galera Cluster across all 3 controllers
  • Synchronous replication ensures consistency
  • Use SSDs for database storage

Message Queue

  • RabbitMQ cluster with mirrored queues
  • 3-node cluster for quorum
  • Monitor queue depth for capacity planning

API Load Balancing

  • HAProxy with a virtual IP (keepalived)
  • Health checks on each API endpoint
  • SSL termination at HAProxy

Compute Node Design

Each compute node runs:

Service Purpose
nova-compute Instance lifecycle management
OVN controller / OVS agent Virtual networking
libvirt + QEMU/KVM Hypervisor
ceph-common RBD client for Ceph storage
telegraf/node_exporter Monitoring agent

Hardware Recommendations

Component Specification
CPU 2x Intel Xeon or AMD EPYC (64+ cores total)
RAM 512 GB–1 TB DDR5
Boot disk 2x 480 GB SSD (RAID 1)
NIC 2x 25 GbE (bonded, LACP)
GPU (optional) NVIDIA A100/H100 for AI workloads

CPU and RAM Allocation

# /etc/nova/nova.conf
[DEFAULT]
cpu_allocation_ratio = 4.0     # 4:1 for general workloads
ram_allocation_ratio = 1.0     # no RAM overcommit in production
reserved_host_memory_mb = 8192 # reserve 8 GB for host OS
reserved_host_cpus = 4         # reserve 4 cores for host

Storage Architecture

Ceph (Recommended)

Component Specification
MON/MGR nodes 3 (can colocate with controllers)
OSD nodes 5+ dedicated storage nodes
OSD disks NVMe for performance, HDD for capacity
Replication 3x for production
Pools volumes (Cinder), images (Glance), vms (Nova ephemeral)

Storage Tiers

Use Ceph CRUSH rules to create tiers:

# Fast tier (NVMe)
ceph osd crush rule create-replicated fast-rule default host ssd
ceph osd pool create fast-volumes 128 128 replicated fast-rule

# Bulk tier (HDD)
ceph osd crush rule create-replicated bulk-rule default host hdd
ceph osd pool create bulk-volumes 128 128 replicated bulk-rule

Map to Cinder volume types for user-facing storage tiers.

Security Architecture

Layer Mechanism
API TLS everywhere (HAProxy termination or end-to-end)
Authentication Keystone with LDAP/AD federation
Network Security groups, project isolation
Secrets Barbican for key management
Compliance Audit logging via Oslo.messaging notifications

Monitoring and Operations

Tool Purpose
Prometheus + Grafana Metrics collection and dashboards
Alertmanager Alert routing (PagerDuty, Slack)
ELK / Loki Log aggregation
Ceilometer / Gnocchi OpenStack-native metering
Rally / Tempest Performance and integration testing

Key Metrics to Monitor

  • Nova: instance count, scheduler latency, hypervisor utilization
  • Neutron: agent health, port creation latency
  • Ceph: cluster health, OSD latency, pool usage
  • RabbitMQ: queue depth, message rates
  • HAProxy: backend health, request latency

Capacity Planning

Resource Formula
vCPUs available (physical cores x allocation ratio) - reserved
RAM available (physical RAM x allocation ratio) - reserved
Storage (total OSD capacity / replication factor) x 0.85
Instances per host min(vCPU / flavor vCPUs, RAM / flavor RAM)

Deployment Tools

Tool Best For
OpenStack-Ansible Full control, bare-metal
Kolla-Ansible Containerized, easy upgrades
TripleO/Director Red Hat environments
Sunbeam/MicroStack Small-scale, Canonical

Summary

A production OpenStack private cloud requires a three-node HA control plane, spine-leaf networking with jumbo frames, Ceph distributed storage, and comprehensive monitoring. The architecture scales from 50 to 200+ compute nodes by adding hardware without changing the control plane design.