January 8th. We’re well into our cloud storage upgrade started just before Christmas – upgrading all Ceph storage servers and Proxmox compute nodes to 40Gbit InfiniBand, and adding extra replication storage pools.
Virtual Machines on the new storage are already performing better than the last generation of Proxmox/KVM VM’s, with improvements only set to get better the more disk rows get upgraded. We’re now upgrading SSD cache’s so each “row” of spinny disks gets a new Enterprise-SSD based journal/cache, further enhancing performance.
Storage is spread across 10 HPE servers running Ceph (a mixture of DL360 and DL380 hosts), each connected by dual InfiniBand (active/backup) with a line speed of 40Gbit.
All that said, we did have a problem this afternoon when we swapped a disk in the array. The server didn’t acknowledge the slot was now empty before the new disk was inserted (something we’ve done successfully countless times before!), when the new disk was inserted the volume went to a degraded state requiring human intervention. A small handful of VM’s were affected. We’re still trying to work out why this time the server didn’t release the slot, but a reboot of the storage node caused the SAS bus to rescan, and all disks were picked up again OK. We know how it happened, so we can plan around it next time.
Our Ceph storage array has “rows” dedicated to general/dedicated/cache roles, comprising of a mixture of SATA/SAS and SSD storage disks. All customer data is spread across multiple physical servers for data redundancy, and customers with dedicated storage pools (private cloud storage) get their own rows and journals, physically separate to other data.
Ceph is expandable well beyond the multi-petabyte level. Currently ours so far is at just over 60TB, we have along way to go!