JupyterHub Outage

Incident Report for CU Boulder CS

Resolved

JupyterHub's backend storage was changed to a CephFS pool backed by flash media during the spring break maintenance window. We believe that this reconfiguration will greatly improve stability.

Posted Mar 27, 2023 - 07:02 MDT

Update

Cluster reboot complete and services have been restored.

Posted Mar 18, 2023 - 15:28 MDT

Update

Load average on many cluster nodes have climbed to unacceptable values leading to extremely poor performance. We are rebooting the cluster before it crashes completely.

Posted Mar 18, 2023 - 15:14 MDT

Monitoring

We have adjusted some storage parameters early AM and will continue to monitor the cluster.

Posted Mar 17, 2023 - 06:32 MDT

Update

All nodes have finished rebooting and the hub is available again. We are however still investigating the cause.

Posted Mar 16, 2023 - 18:17 MDT

Update

All worker nodes have now failed. We will perform a cluster-wide reboot for JupyterHub. It will be inaccessible at this time until all nodes finish rebooting.

Posted Mar 16, 2023 - 17:54 MDT

Identified

Some worker nodes have once again failed. We are resetting those nodes and also trying to locate a cause.

Posted Mar 16, 2023 - 17:46 MDT

Monitoring

Remaining worker nodes have been restored and the cluster should be back to full capacity. We are monitoring status.

Posted Mar 16, 2023 - 14:10 MDT

Update

A set of worker nodes have been restored to allow the cluster to operate at a reduced capacity.

Posted Mar 16, 2023 - 13:56 MDT

Identified

All JupyterHub backends are either degraded, slow to respond, or unavailable. We are working on restoring all worker nodes.

Posted Mar 16, 2023 - 13:00 MDT

This incident affected: JupyterHub.