Degraded Stability in Computer Science Core Infrastructure
Incident Report for CU Boulder CS
Resolved
Stability issues should be resolved. We will continue to monitor service status.
Posted Feb 07, 2023 - 05:00 MST
Monitoring
Network hardware arrived and was installed into all cluster nodes. Emergency maintenance is scheduled Feb 7, 2023 03:00-07:00 MST in an attempt to fully resolve issues with infrastructure stability. Please view details at the bottom of the page.
Posted Feb 06, 2023 - 01:13 MST
Identified
An issue has been identified in the Computer Science Cloud Storage platform.

Due to recent addition of the Ceph NVMe io2 tier, we are experiencing service degradation due to network congestion attached to our cloud object storage platform.

Significant pause frames and packet loss is occurring on many nodes due to recent traffic increases. This can only be remediated by replacing the networking components on these nodes. This hardware has been ordered and is expected for delivery in two weeks. We hope services will be fully restored by the 2nd week of February.

This will especially be apparent with services sensitive to IO delay from flapping.
JupyterHub appears to be the most affected by this; followed by Moodle.

To mitigate downtime, services are being migrated off the io2 tier (nvme) to the st1 tier (magnetic media).
Posted Jan 17, 2023 - 15:14 MST
This incident affected: Computer Science Core Infrastructure and Red Hat Ceph Object Storage Cluster.