Avoid harmful fall-out from outages with extreme reliability at marginal cost

Pragmatic solutions based on emergency national roaming do not have to compromise operators’ assets or businesses

Across the western world, failures in communications networks are infrequent, but come at the expense of personal safety, data security and business continuity. The recent Rogers outage in Canada lasted nearly a day, while the UK experienced a similar outage in December 2018 from O2’s network. Both felt like they lasted a very long time because our economy is internet-reliant and they become more problematic in an emergency, when people need to call the emergency response services.

Recently Rogers acknowledged that 2.92 million wireline and 10.242 million wireless customers were impacted during the blackout. While subsequent reports determined that it did not breach service level agreements (SLA) with retail customers, Rogers is assessing if it breached SLAs with its vendors.

Update errors

Rogers’ outage was caused by an update to the distribution routers in its network, which caused Rogers’ internet gateway, core gateway and distribution routers to cease communication with one another, as well as with Rogers’ cellular, enterprise and cable networks.

The network of mobile operator O2 experienced an outage in December 2018 affecting all of its 25 million customers. A recent Ofcom inquiry concluded that O2’s outage was significant, and that the disruption was caused by an issue with software provided by Ericsson. A fault in this critical software, linked to the expiry of a ‘security certificate’, caused the software to fail and disrupted O2’s network[1].

Both these outages were caused by software bugs – unintentional errors rather than malicious activity – and made headline news for good reason: millions are inconvenienced or put at risk. We have recently seen IoT networks fail, impacting or idling a variety of systems such as information signs, in-store payments, mobility networks and more.

So the question is whether these very public failures of policy and of systems are fixable?

The path to network reliability

Outages like those described are rare, which is partly why they make front-page news. There has not been a complete failure of a mobile network in the UK since 2018, but there have been many less notorious cases of local unavailability of services and applications.

Mobile networks comprised millions of lines of code and some of the most advanced technology in existence. That they fail as infrequently as they do is amazing (by way of comparison, just think how often your PC or laptop needs a reboot). While we should ensure that they are as reliable as possible, complete reliability is not feasible, and going from, for example, one failure every five years to one every 20 years carries a very high cost that will be passed onto consumers who may not value the additional reliability as much as the raised cost. So all stakeholders must accept that perfection is a journey not a destination, and that the risks are always with us.

Potential solutions

Governments often believe that they should get involved – not least when loud calls for “something to be done” echo in national parliaments – and there can be a role for intervention, but like most things done in haste, they tend to be ill-judged. Governments focus on security threats and worry loudly about Chinese equipment, and while these are potential risks, they should worry at least as much about insufficiently tested software and unintended errors.

Governments also tend to believe that having more suppliers lowers risk, which is true in part, but each supplier is as likely to have bugs in their code as another. The more suppliers there are, the harder it is to ensure that their equipment can be integrated and that their code is error-free. Finally, Government intervention in a competitive market (arguably not the case in Canada today, to revisit that example) is difficult and risks market distortions.

The best form of resilience is technological redundancy: having a second option available when the first, inevitably, fails. Generally, in G7 countries we do; when the mobile network fails, devices shift to Wi-Fi, often without us noticing. Of course, Wi-Fi only works in or near buildings, so is not a perfect substitute and there are ever more people who work, live or travel outside Wi-Fi coverage. The same is true in reverse: if Wi-Fi or broadband fail, we can switch to cellular data, using a mobile hotspot to connect Wi-Fi only devices.

Satellite solutions can help

Satellite connectivity can also play a role in some cases, not least in less connected jurisdictions, although only the most up-to-date space solutions have the capacity to be a complete solution.

Another solution for cases when Wi-Fi can’t be used – national mobile roaming during network failure. Here, when one mobile network fails, the affected subscribers are distributed across the other mobile networks in the country until such time as their home network comes back to life – effectively the model that Canada’s practically-minded minister seeks to enshrine in commercial agreements between his operators this month.

Technically, this solution is relatively easy to implement by giving subscribers a pre-programmed network ID in their SIM cards to which they can roam. The ID is only activated by an operator with a working network once a national network failure has been declared, and is deactivated once it is over.

There are challenges, such as ensuring the other networks are not overwhelmed by traffic, but these are soluble using throttling, reduced data rates or similar. There should be substantial penalties for any operator whose network fails to discourage over-reliance on this roaming to another network mechanism.

This solution is not costly to implement and, apart from the instance of the failure of multiple mobile networks, should mean few users are affected by, or even notice, the network failure.

Opposition to this solution generally comes from operators that worry it will set a precedent leading to national roaming at all times, where one network lacks coverage which can be provided by others. One should not lead to the other: there are good reasons to avoid general national roaming and hence any emergency roaming arrangement needs to come with clear guarantees – which ideally would be legally enforceable that it would not be the thin end of a wedge that inexorably led to wider roaming.

Extreme reliability at marginal cost

Much of the modern world is built on fast, high reliability networks. In addition to telecoms and manufacturing, everything from trains to home thermostats and wrist watches rely on networks. The outages we touched on prove how even a few hours of disruption grind economies to a halt and, in some cases, endanger people’s safety. For these reasons, we believe that acquiring resilience can only be achieved by investing in technological redundancy.

With care and suitable intermediaries who can bring all stakeholders together and help them reach a position that works for them all, we can deliver extreme reliability at marginal cost. In doing so we get closer to perfection while not undermining the workability of the very good – and pave the way for a safer and more reliable technology that works for every user.


[1] o2-network-outage-cceb.pdf (ofcom.org.uk)