We wrote to Pure Fibre to ask about the recent Internet outage. For me, that was four hours on Wed 26 May, and one hour on Saturday 29 May. Our questions (in bold), and their answers are below (in italic).
The short story is that the outage affected 50% of residents. We have two links to London, where we connect to the rest of the Internet. Each household is connected through one of these links and can be switched over to the other if there’s a failure. Sadly, the routers in our houses don’t always respond well to being switched over, or being switched back.
Pure Fibre says that they have fixed an issue that made it hard to identify the cause of the outage. And they’re working with the manufacturers of our routers to fix the problem with switching between circuits. Fixing that problem should make it possible to implement automatic failover, and that could make outages like this a thing of the past.
NB: the text below is an almost exact copy of their email. I’ve fixed a few typos, and added “DRA” and “PF” to identify the authors of each paragraph.
DRA (Ian Eiloart): I’m writing about the downtime yesterday, seeking some more information about it.
- DRA: As far as I understand, the outage affected many residents. Would it be fair to say that about 50% of residents were affected? Do you have a more precise number?
- PF (Mark Trojacek): We lost one of the two connections into the development and balance the number of users on each of the two lines, so 50% of residents lost their service when the line went down.
- DRA: From your status site, it appears that one of the uplinks to London failed due to a fibre cut. Is it reasonable to assume that this was the old route along the railway line?
- PF: No, this was actually the newer line that went down. It was a result of a fibre break on a section of cable owned by CityFibre. We have not been provided with the precise location but have been told that it was somewhere in Yorkshire.
- DRA: For this to cause outages at Derwenthorpe, it must be the case either that (a) redundant routing is in place, but failed, or (b) automatic failover is in place, but failed, or (c) that you have no automatic redundancy at all. Can you say which is the case, please?
- PF: While we have redundant routing in place at Derwenthorpe, we have not been able to implement automatic failover. Although automatic failover is in place in our core, since deploying the second line last year, we have found that the customer premise equipment [CPE] devices have struggled to react to the change in gateway MAC address as we switch from one core router to the other. As part of necessary security over the network, we force traffic from the CPE to the learned MAC of the gateway but when this changes (even though the IP address of the gateway remains the same) we have found that many of the devices hang on to the old MAC address.
- While we are working on a resolution to this, we currently have to invoke a manual changeover to reboot all the CPE just prior to switching over from one core router to the other. Even with this manual re-boot we still find 10-15% of the devices need a further re-boot to re-associate with the gateway MAC. While this is not so much of an issue when the circuit goes down, it becomes more of an issue when we have to switch back to the original circuit as we have users with a working connection that suddenly goes down for a second time when the circuit is restored.
- I believe that your device may have experienced this issue over the weekend. For our part, we have no visibility of which devices are online and which are not as they all appear to be online and so we generally field a number of calls from residents who we have advised that the issue is resolved, yet who still have lost service.
- We are working on a resolution of this with the manufacturer but for the moment we have to be cautious about when we switch networks over to ensure that we don’t generate more disruption than the original outage causes.
- DRA: Manual failover between circuits is available, but it was not employed for about four hours after you first reported the incident, and two hours after you reported that the circuit failure was identified. Why is this?
- PF: While we were aware of a circuit failure, we were working with the provider to identify the cause. As detailed above, we did not want to switch the circuits if there was likely to be a rapid resolution. Unfortunately, there was some confusion between ourselves and the provider about circuit IDs (we still have the original 1Gbps circuit into Derwenthorpe as well as the 10Gbps bearer) and for a couple of hours, we/they were trying to identify the cause of the issue. At that point, we didn’t know that we were the victim of a fibre break and so we were performing tests in association with them to identify the cause of the outage. Once we confirmed with them the precise circuit reference a new ticket needed to be opened at which point we made the decision to cut over the line as we were concerned about further delays with the new ticket.
- The original circuit was not restored until the early hours of Friday morning and so we monitored it during the day to ensure that it was stable prior to switching customers back onto it in the early hours of Saturday morning.
- DRA: Will you be putting in place an automatic failover mechanism? If not, why not, and what procedures will you be implementing to ensure that manual failover is deployed much earlier?
- PF: We continue to work on deploying automatic failover and it remains our intention to implement it as soon as possible. We have spare redundant lines into one of our other sites and have equipment deployed there to assist with the testing. We have cleared up the confusion over circuit IDs with the provider which, in this instance, was the main reason for the prolonged delay in making the decision to cut over onto the back up circuit.