Apache is dying on dd bl4de Node, issue also seen on dd4t apache logs, freezing vms

Problem occurred just after 08:00 UTC 27th July 21.

CPU usage index reached 1.13 which appears to have caused a problem with this course ​https://dd4t.dadesktop.com/da/courses/0fed1f7e-df3a-41c2-bbee-e4b74a1a0373 where vm’s were freezing /inaccessible at the same time.

VM’s on this course were also reported to be frozen / inaccessible on 26th July 14:04 UTC & again at 14:08 UTC. Then I moved 3 of the ten vm’s to bl3de server.

Please see screenshots of the DD node apache logs on bl4de.npg.io and dd4t journalctl container also showing apache dying.

HOW TO REPLICATE ISSUE Later at 15:47 UTC, I found this problem can easily be replicated. Visit the course url as above. Then press hard reload in browser (Firefox), you will see for some of the VM’s ‘Operation in progress’ this takes longer than it should, then boom, 503 errors appear.

Then the trainer said that the problem with freezing/inaccessibily with the VM’s had happened again. Checking zabbix and Grafana clearly shows spikes in both bl4 and bl3 servers for cpu usage, network traffic and memory usage.


video which shows the problem clearly

Post by Xiong Peng

Execute the following command, and you can see that there has been many network interface down event in the past two days. Detailed check of the journal log shows that there are many failed up attempts after each network interface down. In other words, the network of the server interrupted for a few minutes each time, it makes dd4t unable to connect to it, and the novnc connection seems to be frozen:
journalctl -x --since=‘2021-7-26 00:00:00’ |grep 'Link is Down’

It is found that the problem can be reproduced when the 8 heavy machines of this course are running at the same time. However, after moving 2 machines to other server, it seems that the problem does not appear, so it’s probably related to the lack of free memory.

Trainer told me that the VM for delegate Bruno was inaccessible about 2 hrs ago ie 11:08 utc for a moment or two.

This VM was on bl3 server whereas almost all previous issues were spotted on bl4 before.

Memory appeared ok at this time though.