A perfect storm

On Tuesday there was an outage of over two hours in the Winklerstraße student residences that also caused problems for some of our users in the other buildings. We are terribly sorry and hope that the disruption did not cause major inconveniences for you. Further below we describe what happened.

We know that network outages of such severity are very annoying. We have been increasing our efforts since mid last year to improve the resilience of our core network components.

In the coming weeks, we will restore the redundant connection to the university computing center, replace the old router in Winklerstraße 22 and deploy a new, completely redundant core network. Its structure is designed so no failure of a single optical fibre or core network component can result in an outage for our users.

I will now describe the cause of this weeks problems: For monday we had announced short network interruptions on our website and we had posted notices in the respective buildings. We wanted to update the network components in Winklerstraße to the newest software, so we would be able to restore the redundant connection to the university computing center. The URZ replaced their routers last year which meant we could no longer continue to use the "RIP" routing protocol. Everything worked as planned - there were two short network outages of less than five minutes. However, we also had to change the partition layout for the one of the upgrades, i.e. the way data is stored on the hard drives. All systems in StuNet store their data on at least two different hard drives simultaneously ("RAID"), so these kind of changes can be done online without interruption of the service. But when we rebootet the access router in Winklerstraße 22 on Tuesday to complete the upgrade, the system could no longer boot. Apparently the tool "resize2fs" had damaged the partition in a way that it was no longer readable by the bootloader (the operating system was unaffected). We are prepared for these kinds of problems, but none of the two USB sticks we brought could be used to boot the rescue system. We had brought a CD as well but with an incompatible version (64bit instead of 32bit). We should have checked beforehand, but we simply did not expect both USB sticks to fail. So we had to find an internet-connected PC with a CD burner and a blank CD before we were able to boot our rescue system, create the partitions from scratch and copy of the data to restore the system to a usable state.

Unfortunately, our problems did not end there. On Tuesday and Wednesday we noticed the fibre optics interfaces were going offline in irregular intervals for several seconds. We were finally able to locate the problem in a newer version of the driver for our Intel cards and hopefully everything is now working again as you have come to expect it from us.

AG StuNet