JupyterHub Outage
Incident Report for CU Boulder CS
Resolved
JupyterHub's backend storage was changed to a CephFS pool backed by flash media during the spring break maintenance window. We believe that this reconfiguration will greatly improve stability.
Posted Mar 27, 2023 - 07:02 MDT
Update
Cluster reboot complete and services have been restored.
Posted Mar 18, 2023 - 15:28 MDT
Update
Load average on many cluster nodes have climbed to unacceptable values leading to extremely poor performance. We are rebooting the cluster before it crashes completely.
Posted Mar 18, 2023 - 15:14 MDT
Monitoring
We have adjusted some storage parameters early AM and will continue to monitor the cluster.
Posted Mar 17, 2023 - 06:32 MDT
Update
All nodes have finished rebooting and the hub is available again. We are however still investigating the cause.
Posted Mar 16, 2023 - 18:17 MDT
Update
All worker nodes have now failed. We will perform a cluster-wide reboot for JupyterHub. It will be inaccessible at this time until all nodes finish rebooting.
Posted Mar 16, 2023 - 17:54 MDT
Identified
Some worker nodes have once again failed. We are resetting those nodes and also trying to locate a cause.
Posted Mar 16, 2023 - 17:46 MDT
Monitoring
Remaining worker nodes have been restored and the cluster should be back to full capacity. We are monitoring status.
Posted Mar 16, 2023 - 14:10 MDT
Update
A set of worker nodes have been restored to allow the cluster to operate at a reduced capacity.
Posted Mar 16, 2023 - 13:56 MDT
Identified
All JupyterHub backends are either degraded, slow to respond, or unavailable. We are working on restoring all worker nodes.
Posted Mar 16, 2023 - 13:00 MDT
This incident affected: JupyterHub.