Bell tackles service assurance as BT rings Network changes

Alex Bell, Enterprise Architect, BT, talks about progress in automating service operations, benefits at scale and next steps

BT’s outgoing Chief Architect, Neil McRae, brought Alex Bell into BT’s Networks unit two years ago to transform service. Since Bell moved to the 18-month old Digital division in September, there has been a shake-up within Networks*, including McRae’s departure at the end of the year.

Bell starts by saying, “In services, we have 56 different tools for managing tickets; we have 60 to 65 different tools for monitoring our various different components in our estate,” resulting from the organic evolution that is typical of most telcos.

Unsurprising, this “introduced unnecessary cost because if you’ve got tickets that have to go through multiple different systems to fix customers’ issues, it’s a drain”. Hence two years ago, Bell started work on a service strategy to improve the management of services with better ticketing, monitoring and assurance in the network, including monitoring assurance in the IT estate, with the ultimate goal of automating operations.

He describes the three programmes BT has activated to achieve that desired, automated end state:

• digital ops (digi-ops) transformation to improve service management, including standardising on ServiceNow;

• AIOps which means transitioning to AI-enabled intelligent root cause analysis to pinpoint issues without human intervention; then

• zero-ops which takes automated root cause analysis and fixing those problems wherever possible in an automated way, splits across Networks and IT.

Digi-ops

BT opted to standardise on ServiceNow because, Bell says, “it seems to have done very well at putting everything together as a component that you license with optional add-ons that all fit together really neatly. So if you’ve got the CMDB [configuration management database] that tells you about your estate, and you want to run change or incident, problem or monitoring, [ServiceNow] uses that same single CMDB as its reference point.”

The team found that in the other major contender’s offer, “there was a lot of difficulty in getting some of the components to work together,” he adds. “For us in the telco world – and we’ve co-created with ServiceNow on this – they’ve been very focused on how they evolve their product into the telco market, such as enhancements to the CMDB, to cater for enhancing telcos’ capabilities for their workflow and product engine.”

AIOps accelerates

Bell notes there are a number of offers in the full stack observability sector, including dynatrace, Datadog, AppDynamics and New Relic, all of which BT assessed. Bell stresses, “For us the hybrid model is important because we’ve got a lot of on-premise capability to provide us with the security we need in the network, as well as hybrid workloads in the cloud. We needed something that supported that and…automated root cause analysis. dynatrace came out ahead.”

BT soon discovered from its pilot with dynatrace that “the time to value is quick”. This “was real-life trial of the capability across the actual estate,” according to Bell. “In a in a period of about two months, we were out across the mobile provision and dev-estate and after three months, we were in production. That pace is something I hadn’t seen before.”

He explains, “We saw things like a 90% reduction in meantime to identify compared with the previous environment” which is contributes to mean time to repair (MTTR), which is what makes the really big impact on service in some situations but that 90% improvement in MTTR “would be unrealistic.”

This was seriously good news: “In our business case, were expecting a 50% reduction in MTTR and chain, not only from doing automated root cause, but with zero-ops in the mix, to detect the root cause automatically and fix it. That’s where the much more significant savings come into it. I think we also saw a 50% reduction in ticket volume, which has a big impact in the consumer estate and translates into an impact on us as a business.

Zeroing in on zero-ops

The business case is to save was £27 million over five years. “We’re very focused on hard metrics, hard savings. It’s foundational to Harmeen’s work to double productivity and halve costs.” Harmeen Mehta (pictured) is Chief Digital and Innovation Officer and leads the Digital division. She joined BT, with much fanfare in March 2021, to drive rapid change.

Bell continues, “That £27 million is all about the about speeding up the process of identifying root causes without human involvement. Data is exploding at the moment – becoming an exponential problem. Having computers work out the root cause then zero-ops to fix things automatically is going to be massive.”

As an example, Bell says iPhone launches are critical times for BT’s network. “The iPhone launch two years was very stable. With dynatrace involved, we detected issues that we estimate would have taken us a couple of hours to identify without them,” he explains. “They did it within tens of minutes and fixed things. This year dynatrace is across the whole consumer estate and it was the most stable launch that we’ve had ever. We attribute that in no small part to having dynatrace across the full stack.”

Bell says dynatrace is currently deployed across about 25% of the estate and being rolled at out “at pace. The target in the business case was to be across the full estate within a year and a half, but the way it’s shaping up, we’re hopeful that we can get there within 12 months.”

The integration of dynatrace with ServiceNow – as opposed to adopting other popular technologies in the market such as CloudBees and Ansible – is at the heart of zero-ops.

dynatrace is embedded in BT’s applications and web pages, monitoring customers and their experience, as well as in IT estate behind customer-facing assets. It traces transactions through the stack. When it detects an anomaly in users’ experience on a server in the chain, it looks for the root cause – such as a full database queue or infrastructure issue – and identifies it.

Bell picks up the story: “Then dynatrace notifies ServiceNow via an integration that the two vendors accelerated for us, and…ServiceNow, as a workflow engine, calls out to the appropriate scripting engines to remedy the situation.”

Two game-changing chances

It’s still early days for BT’s zero-ops. Bell says, “We’re using a couple of use cases, simple stuff like server restarts and to restart services, when it detects an issue with experience. When it’s an issue we know about, we call the automated scripts to auto-remediate that problem, bring down the service, bring it back up, then dynatrace confirms the issue is fixed and the ticket is closed.”

However, “Over the next couple of months, we expect that to really ramp up. That’s where Harmeen’s applying a lot of the focus, talking to the SRE [site reliability engineering] community that she’s introduced and concepts she’s bought in around that auto remediation,” he adds.

BT expects this to benefit both its consumer and enterprise customer bases. “We’re expecting this to have a big uplift in NPS for both,” Bell states. “Customers are more and more impatient about experience…they’ll switch off and go elsewhere or call us. With that change in dynamics, it’s vital we’re able to avoid those situations.

“Detecting anomalies before things are launched changes the game for us and even more so if something make it through test cycles but we can detect it once in production, the minute customers start to see it, and recall it. We invoke the right roll back and deal with problems before we lose customers’ confidence. That goes for consumers as much as for business.”

The sticky problems of scale

There is a lot of grumbling among European telcos’ shareholders in particular about the lack of tangible results and impact on the bottom line from telcos’ transformation efforts. While BT has travelled a considerable distance in the last two years regarding service assurance and automation, when will it really see these benefits scale? 

Bell says, “We talk about a hockey stick, and about how the transformation…is expected to really ramp up and already, in some cases, we are seeing that. Just looking at dynatrace, in the early days, we were deploying at the rate of 25 hosts at a time, and now it’s about 1,000 hosts a night.” But this “phenomenal pace” bought some problems of its own, “because we some of the engines processing in the background started to creak at that pace.”

He continues, “We’re not going quite at that pace at the moment, but it’s still a huge volume of hosts we’re able to get through, which means we can drive change quickly.”

Likewise, telcos tends to talk up their use of DevOps methodologies – continuous integration, delivery/deployment and test, usually abbreviated to CI/CD/CT – but their use is typically very limited. Bell says that Mehta is determined to expand the CI/CD pipeline and that, “The work that’s happening in Digi-Ops, AIOps and zero-ops is it’s growing their use across the organisation…dynatrace assures those faster, more frequent deployments.”

Getting away from the 65 different monitoring tools and the cost of keeping them all up to date and overseeing the contracts and more has been a surprisingly big benefit too. “By bringing it all together with single vendor end to end you get that root cause ability, real depth of insight, and generate significant savings. That’s been particularly interesting.”

What about the dangers of being overly reliant on a single vendor, including missing out on innovation? Bell says, “There are parts of the estate where we have more than one [vendor involved] as a result of exactly that [avoiding over-reliance]…In this space, the real difference is gaining deep observability…you just can’t get that you try to run silos of assurance.”

He adds, “If innovation proves problematic, just think what we’ve got across within a year, it’s not the end of the world. As we become more cloud orientated in IT, our ability to change things becomes far quicker and more agile.”

Dynatrace recently announced Grail which it describes as, “Boundless observability, security, and business analytics with context”. Grail changes how data is ingested and handled within dynatrace, which according to Bell makes it a direct competitor “with the likes of the ELKs or Splunks of the world” – platform-based capabilities that BT hasn’t previously targeted.

Next steps – introducing Grail

“Grail’s first step of log ingestion puts [dynatrace] on a footing with them and we’ll be looking hard at options to expand the [dynatrace] footprint across more functions. That’s a cost opportunity because it’s further consolidating and simplifying the estate.” It also means the analytics for those logs are handled in one place, which should deliver more accurate root causes and therefore better outcomes, faster.”

Specifically, Bell says BT is working closely with dynatrace, “at whether there’s opportunity to bring logs in from the network to give us even bigger insights. We’re very excited about Grail, although it is yet to be proven. A proof of concept will be first port of call but if we like what we see, we’ll be expanding the remit.”

* BT is restructuring its Networks internal service unit, which is led by CTO Howard Watson. In October, it was announced that he would become Chief Security & Networks Officer, shortly after it was made public that Neil McRae, MD Architecture & Strategy, BT Group Chief Architect, would leave the operator at the end of the year after a 12-year stint. Greg McCall, who is MD of Service Platforms, will become Chief Networks Officer. There are also restructured teams in Networks: Network Services; Strategy & Research; Cyber & Information Security; Operational Resilience & Service Management; Security Transformation; and Health, Safety and Environment.

There are two other leavers: Andy Skingley, MD of Dynamic Infrastructure will shortly retire and Tim Whitley, MD for Applied Research, will lead the new Strategy & Research team on a short-term basis before quitting BT next summer.