Resolved
At current incident has been mitigated.
Monitoring
We've spotted a faulty node and removed it. We made double sure no data loss happened at any point.
We're now continue monitoring the current state.In parallel we'll start investigating how to fix the scheduling error that lead to this affecting users.
Investigating
We've noticed increase number of workspace start failures, in addition to abnormal high CPU usage. Some users reported "OutOfmemory: Pod Node didn't have enough resources" errors.
We added additional resources and investigate the root cause.