Problem occurred just after 08:00 UTC 27th July 21.
CPU usage index reached 1.13 which appears to have caused a problem with this course https://dd4t.dadesktop.com/da/courses/0fed1f7e-df3a-41c2-bbee-e4b74a1a0373 where vm’s were freezing /inaccessible at the same time.
VM’s on this course were also reported to be frozen / inaccessible on 26th July 14:04 UTC & again at 14:08 UTC. Then I moved 3 of the ten vm’s to bl3de server.
Please see screenshots of the DD node apache logs on bl4de.npg.io and dd4t journalctl container also showing apache dying.
HOW TO REPLICATE ISSUE Later at 15:47 UTC, I found this problem can easily be replicated. Visit the course url as above. Then press hard reload in browser (Firefox), you will see for some of the VM’s ‘Operation in progress’ this takes longer than it should, then boom, 503 errors appear.
Then the trainer said that the problem with freezing/inaccessibily with the VM’s had happened again. Checking zabbix and Grafana clearly shows spikes in both bl4 and bl3 servers for cpu usage, network traffic and memory usage.
video which shows the problem clearly