Infrastructure Incident Report – Munich Router Outage (June 14, 2025)

Dear Customers,

We would like to inform you about a network incident that occurred on Saturday, June 14, 2025, impacting our infrastructure at the Munich data center. We understand how critical uninterrupted service is for our customers, and we are committed to full transparency regarding this event, its resolution, and the improvements we are implementing to strengthen our systems.

Incident Timeline 01:30 AM – We received initial alerts of backup failures in the Munich environment. These were the first signs of a malfunction in our core router cr2.MUC1.

06:30 AM – Technicians arrived at the data center and began diagnostics on-site.

07:20 AM – A potentially faulty line card was replaced with a spare, and the router was restarted.

07:30 AM – The router crashed again, affecting routing across most VLANs and subnets.

08:15 AM – A second spare line card was brought in from our Network Operations Center (NOC) and installed.

08:25 AM – The router crashed again with identical symptoms.

Our team collected crash information and documented the looping error outputs, forwarding all data to vendor support for analysis. Unfortunately, response times were delayed due to it being a Saturday.

While waiting for vendor feedback, we activated our emergency routing system and began migrating VLANs and subnets to maintain service continuity.

09:40 AM – The first services began to come back online.

10:00 AM – Most services had been restored, except for two VLANs/subnets and one caching nameserver.

10:30 AM – All IPv4 services were operational, with partial IPv6 restoration.

12:00 PM (midday) – Full restoration was completed, including IPv6 and special routing configurations.

Service Credits All customers covered by a Service Level Agreement (SLA) have already received or will shortly receive the appropriate service credits. Should you have any questions regarding these credits or need further assistance, our support team is at your disposal.

Root Cause & Future Improvements While initial symptoms suggested a hardware failure, our analysis indicates the root cause was most likely a software fault, possibly related to firmware or a routing process bug under specific load conditions.

To reduce the risk of such issues in the future, we are accelerating our infrastructure diversification strategy, including:

Introducing routing platforms based on open-source software.

Expanding the use of generic and redundant hardware for critical network functions. (Note: redundancy was in place and sufficient spare parts were available at the NOC during the incident.)

Enhancing our disaster recovery and emergency routing capabilities to ensure faster and more seamless failover in the event of future disruptions.

These changes aim to minimize vendor lock-in, improve fault isolation, and increase the overall resilience and flexibility of our infrastructure.

Communication During the Outage Throughout the incident, our NOC remained fully operational and reachable via phone and other communication channels. We thank the many customers who stayed in touch with us during the recovery process.

We sincerely apologize for the inconvenience caused by this incident. Thank you for your continued trust and understanding. We remain fully committed to delivering reliable services and transparent communication at all times.

With best regards, Your Server24 Team