Monitoring - Network hardware arrived and was installed into all cluster nodes. Emergency maintenance is scheduled Feb 7, 2023 03:00-07:00 MST in an attempt to fully resolve issues with infrastructure stability. Please view details at the bottom of the page.
Feb 06, 2023 - 01:13 MST
Identified - An issue has been identified in the Computer Science Cloud Storage platform.
Due to recent addition of the Ceph NVMe io2 tier, we are experiencing service degradation due to network congestion attached to our cloud object storage platform.
Significant pause frames and packet loss is occurring on many nodes due to recent traffic increases. This can only be remediated by replacing the networking components on these nodes. This hardware has been ordered and is expected for delivery in two weeks. We hope services will be fully restored by the 2nd week of February.
This will especially be apparent with services sensitive to IO delay from flapping. JupyterHub appears to be the most affected by this; followed by Moodle.
To mitigate downtime, services are being migrated off the io2 tier (nvme) to the st1 tier (magnetic media).
Jan 17, 2023 - 15:14 MST
Computer Science Core Infrastructure
?
Under Maintenance
Science Network
?
Operational
Red Hat Ceph Object Storage Cluster
?
Operational
90 days ago
100.0
% uptime
Today
JupyterHub
?
Under Maintenance
90 days ago
99.78
% uptime
Today
CS Cloud OpenStack Platform
?
Under Maintenance
90 days ago
99.81
% uptime
Today
CS vSphere for VDI Labs
?
Operational
90 days ago
100.0
% uptime
Today
Managed Servers
?
Operational
Moodle LTI Provider to Canvas
?
Under Maintenance
90 days ago
99.77
% uptime
Today
Moodle Computer Science Post-Baccalaureate
?
Under Maintenance
90 days ago
99.83
% uptime
Today
ELRA Environment
?
Operational
CEAS Redirector
?
Operational
Departmental Sites
?
Operational
CS Home
?
Operational
CS Financials
?
Operational
Foundations
?
Operational
IEEE FOCS 2021
?
Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Related
No incidents or maintenance related to this downtime.
Resolved -
Our monitoring systems alerted us to a sudden increase in storage utilization over the weekend for the JupyterHub datastore. Storage was exhausted at around 5:30 MST today, Sunday, February 5th. Intermittent errors such as load failures and messages similar to "no space left on device" would have occurred. The disk was live-expanded at around 6:00 PM MST.
Feb 5, 17:30 MST