Internet Outage

We wrote to Pure Fibre to ask about the recent Internet outage. For me, that was four hours on Wed 26 May, and one hour on Saturday 29 May. Our questions (in bold), and their answers are below (in italic).

The short story is that the outage affected 50% of residents. We have two links to London, where we connect to the rest of the Internet. Each household is connected through one of these links and can be switched over to the other if there’s a failure. Sadly, the routers in our houses don’t always respond well to being switched over, or being switched back.

Pure Fibre says that they have fixed an issue that made it hard to identify the cause of the outage. And they’re working with the manufacturers of our routers to fix the problem with switching between circuits. Fixing that problem should make it possible to implement automatic failover, and that could make outages like this a thing of the past.

NB: the text below is an almost exact copy of their email. I’ve fixed a few typos, and added “DRA” and “PF” to identify the authors of each paragraph.

DRA (Ian Eiloart): I’m writing about the downtime yesterday, seeking some more information about it. 

  1. DRA: As far as I understand, the outage affected many residents. Would it be fair to say that about 50% of residents were affected? Do you have a more precise number?
    • PF (Mark Trojacek): We lost one of the two connections into the development and balance the number of users on each of the two lines, so 50% of residents lost their service when the line went down.
  2. DRA: From your status site, it appears that one of the uplinks to London failed due to a fibre cut. Is it reasonable to assume that this was the old route along the railway line?
    • PF: No, this was actually the newer line that went down. It was a result of a fibre break on a section of cable owned by CityFibre. We have not been provided with the precise location but have been told that it was somewhere in Yorkshire.  
  3. DRA: For this to cause outages at Derwenthorpe, it must be the case either that (a) redundant routing is in place, but failed, or (b) automatic failover is in place, but failed, or (c) that you have no automatic redundancy at all. Can you say which is the case, please?
    • PF: While we have redundant routing in place at Derwenthorpe, we have not been able to implement automatic failover. Although automatic failover is in place in our core, since deploying the second line last year, we have found that the customer premise equipment [CPE] devices have struggled to react to the change in gateway MAC address as we switch from one core router to the other. As part of necessary security over the network, we force traffic from the CPE to the learned MAC of the gateway but when this changes (even though the IP address of the gateway remains the same) we have found that many of the devices hang on to the old MAC address. 
    • While we are working on a resolution to this, we currently have to invoke a manual changeover to reboot all the CPE just prior to switching over from one core router to the other. Even with this manual re-boot we still find 10-15% of the devices need a further re-boot to re-associate with the gateway MAC. While this is not so much of an issue when the circuit goes down, it becomes more of an issue when we have to switch back to the original circuit as we have users with a working connection that suddenly goes down for a second time when the circuit is restored.
    • I believe that your device may have experienced this issue over the weekend. For our part, we have no visibility of which devices are online and which are not as they all appear to be online and so we generally field a number of calls from residents who we have advised that the issue is resolved, yet who still have lost service.  
    • We are working on a resolution of this with the manufacturer but for the moment we have to be cautious about when we switch networks over to ensure that we don’t generate more disruption than the original outage causes.
  4. DRA: Manual failover between circuits is available, but it was not employed for about four hours after you first reported the incident, and two hours after you reported that the circuit failure was identified. Why is this?
    • PF: While we were aware of a circuit failure, we were working with the provider to identify the cause. As detailed above, we did not want to switch the circuits if there was likely to be a rapid resolution. Unfortunately, there was some confusion between ourselves and the provider about circuit IDs (we still have the original 1Gbps circuit into Derwenthorpe as well as the 10Gbps bearer) and for a couple of hours, we/they were trying to identify the cause of the issue. At that point, we didn’t know that we were the victim of a fibre break and so we were performing tests in association with them to identify the cause of the outage. Once we confirmed with them the precise circuit reference a new ticket needed to be opened at which point we made the decision to cut over the line as we were concerned about further delays with the new ticket. 
    • The original circuit was not restored until the early hours of Friday morning and so we monitored it during the day to ensure that it was stable prior to switching customers back onto it in the early hours of Saturday morning.
  5. DRA: Will you be putting in place an automatic failover mechanism? If not, why not, and what procedures will you be implementing to ensure that manual failover is deployed much earlier?
    • PF: We continue to work on deploying automatic failover and it remains our intention to implement it as soon as possible. We have spare redundant lines into one of our other sites and have equipment deployed there to assist with the testing. We have cleared up the confusion over circuit IDs with the provider which, in this instance, was the main reason for the prolonged delay in making the decision to cut over onto the back up circuit. 
Telecoms Uncategorized

Broadband upgrade

The Pure Fibre broadband upgrade is now complete. They’ve installed a second fibre link to London, which should improve speed and reliability of your broadband connection.

The link was ordered last October, and is now completed, having been somewhat delayed by OpenReach misunderstanding what was required, and then by Covid-19.

The second link takes a separate route to London, so the chance of both links being down at the same time is very low. And while they’re both up, bandwidth will now be twice what it was. That should mean that speeds should generally be much faster.

I’ve been getting 350Mb/s on the “average 500Mb/s” plan today, which is quite enough for my purposes.

Broadband speeds have suffered over the last few months, but are much improved now. Note that you’ll get better speeds using Ethernet (plugged in to the wall outlets), than with wi-fi. Wi-fi speeds might be reduced by interference from electrical devices, and will be reduced if the signal has to pass through solid objects like brick and plaster walls.


Fibre upgrade

Some progress on the Fibre upgrade, which should make the Internet and Phones service faster, and much more reliable.

Ducts were installed under Fifth Avenue and Derwent Way last Monday (30 March). Fibre was laid in the ducts. We’re now awaiting OpenReach engineers to connect the fibre at both ends, then make the link live. The equipment at both ends belongs to OpenReach.

It’s been quite a wait. That’s partly because OpenReach originally wanted to install a cabinet on Fifth Avenue, which would require power, and maintenance. Having that equipment in the SSC instead is a much better option in the long run.


Internet Service

A Derwenthorpe resident has been in touch with Pure Fibre asking about our levels of broadband service recently. Here is their reply.

We are sorry that you have been experiencing slower speeds than normal on the Derwenthorpe network. Unfortunately, since the Government’s closing of schools and advice for all to work from home as much as possible, in common with many other Service Providers, we have seen a consistent, unprecedented level of traffic across the Derwenthorpe network. With home schooling and home working increasingly requiring online access during the day and many more people at home at night rather than out socialising, we have seen a doubling of our normal traffic levels throughout the day. 

With the network between Derwenthorpe and London being of fixed capacity, this has put tremendous strain on the network resources, resulting in saturation of the available capacity. We have therefore had to take measures to ensure that all Derwenthorpe subscribers get as fair and reasonable access to the internet as possible during peak periods.   

We are exploring methods of increasing the capacity to Derwenthopre but this is likely to be challenging as most of the providers are operating skeleton crews offering critical support and not accepting or implementing new orders. We will however continue to engage with suppliers to see if a solution can be found. 

In recognition of the reduced service levels at this challenging time, April bills for our Lightspeed package will reflect a reduced monthly rate of £21.60 for the service.


Broadband reliability

We know the broadband has been unreliable since, well… forever. That should change soon. The problem has been that the service relies on a single fibre that runs from the estate to London. If it gets cut at any point on the route, we lose service. If they need to do maintenance at either end of the route, we lose service.

Pure Fibre are working to fix this. They have commissioned a second line to London. The line will be provided by OpenReach, and the service by SSE. Pure Fibre will remain our service provider. In normal operation, this will provide Derwenthorpe with twice the bandwidth. If either line is cut, or otherwise unavailable, then the remaining line will be able to carry all the traffic.

The line requires a short dig on Fifth Avenue/Derwent way. This is due to happen on 30 March. In fact, it will be a tunnel under the school entrance, to the edge of the estate. Fibre should be run through the tunnel, and connected at either end. Then OpenReach will need to enable the service – and that might not happen that day.

I’m a bit nervous that this dig might not happen, due to the coronavirus problem. But I’ve emphasised the importance to both JRHT, and Pure Fibre. And telecoms is recognised as essential to the nation during this crisis, so fingers crossed, we should soon have a more capable service available 365 days a year.

How reliable will it be? Well, if the first route is 98% reliable (that seems to be about the right figure), and the second route is also 98% reliable, then the combined facility should give (1 – .02 x .02) x 100 = 99.96% reliability. That should mean less than four hours downtime in a year.

We’ll keep you updated.