A few years ago, a small town in Alaska experienced a series of winter storms—that’s business as usual—but this time a rising river flooded the local telecommunications facility and knocked it offline. All services were switched over to a backup data center.
Unfortunately, a network card then failed, leaving the village completely cut off during a blizzard.
“It was messy,” recalls Matt Childs, a Senior Solutions Architect at Snowflake who at the time worked at this Alaskan telecom provider. “The card had to be replaced, but nobody could get out in that weather. Finally a guy—and this guy is the hero of the story—grabbed a card, jumped on a snow machine, and drove 200 miles to make the fix.”
Heroic? Yes. But optimal? Certainly not, says Childs. He tells the story to illustrate how the emerging discipline of observability—using new tools and data for a more complete view of complex systems—holds the key to solving some of the challenges telecoms face now: maintaining reliable services over complex networks, whether in sparse rural areas or skyscraper-cluttered dense cities, while juggling 5G rollouts and extreme weather.
In a perfect world, the backup data center, the network, or even the network card itself could have alerted the team of the potential problem before the town got disconnected.
And that is what observability is about: faster, more intelligent understanding of the state of entire networks and applications. Determining what kinds of observability data to use, and how to move, store, enrich, and analyze that data will help separate those that prosper from those that flounder.
What’s observability in the telecom context?
Once reliant on purpose-built hardware to deliver “five-nines” service for voice and data services, telecom networks now compete to support:
- More services with a wider variety of quality of service (QoS) requirements
- More software-defined capabilities
- More microservice-based applications
- Even more complex pricing tiers and plans
- Compliance with an array of different and changing geographic privacy requirements
Today’s technologies, Childs notes, eventually turn into tomorrow’s dinosaurs, but telecoms have millions of dollars invested in legacy equipment and applications. That infrastructure will roll offline gradually—5G wasn’t born in a day, as many have noticed—which means services new and old alike must keep humming along over a changing mix of technologies.
Providing observability across a telecom network means gathering and analyzing new amounts of data about hardware and software, including relationships and dependencies, in order to ensure service quality. It encompasses classical network monitoring logs and metrics, as well as newer tools such as telemetry data and distributed tracing.
The value of observability for telecoms starts with the vital ability to both guarantee the right QoS for each customer’s requirements and to price it correctly.
On the QoS side, Child says, “For example, a customer might be able to make a call or send data, but it’s not optimal—the service is degraded but not gone. The provider is not delivering at the level they should.
“The goal is, can the telecom providers be proactive in looking at that across the network—before the customer calls and asks why the call quality is so poor.” Good observability means the provider can quickly spot problems from jitter to packet loss to Wi-Fi interference, and zero in on a specific line, box, subsystem, or application that’s causing those issues.
On the issue of pricing, Childs says observability can help a telecom refine its service guarantees; for example, by predicting and lessening the number of times that spikes in demand will cause issues with bandwidth availability.
Beyond those bedrock quality and pricing use cases for observability, other potential benefits include:
- Predictive maintenance to reduce downtime with more informed targeted interventions and schedules. A 2022 State of Observability Report found that observability leaders report 69% faster mean time to remediate, along with nearly 90% lower costs of downtime.
- Guiding infrastructure investment and new services based on better analysis of customer needs.
- Autonomous network management, with good algorithms and data models enabling the network to make smart decisions faster than humans can.
So much data
Observability requires a lot of data. Beyond basic communications data such as when a transmission began and ended, a telecom provider would like to see transmission speed, retries, lost packets on a call, and quality measures such as whether the voice frequency on a given call rose above, or fell below, defined limits.
5G makes the volume of metadata even bigger, but also more necessary. For example, many telecoms aim to use 5G’s native network-slicing capabilities to offer more fine-grained services and SLAs.
Achieving all of this requires providers to think carefully about data issues, including:
Storage. The sheer quantity of data involved may be daunting. Saving frequent snapshots or “samples” of distributed traces, for example, can add up quickly. In a late-2022 telecom observability study conducted by EMA and NS1, 43.5% of respondents said data storage is a significant challenge.
Analyze it, then condense or delete it. “Network operators are looking at live dashboards, but a lot of that information has a short shelf life—they’re not going to keep all that data, but aggregate it in greatly reduced form,” Childs says.
“The overall health of the network two weeks ago is important for seeing trends or setting new baselines, especially if you are adding traffic or adding new services. But you’re not going to be using weeks-old data for troubleshooting,” he explains.
Thoughtful decisions about frequency of data capture and about data retention policies are in order.
Movement. “This is the big one,” says Childs. Even if data isn’t stored long-term, a lot of it can be moved around the network. He estimates that telemetry and metadata used in observability might add 25% more overhead to the transmission volume.
In these considerations, “What do I need to know, and when do I need to know it?” is an important tactical question.
Silos. It’s hard to claim a system-wide view if different monitoring tools keep their data separate. Real observability means being able to correlate network and application performance, to have all troubleshooting or reliability teams working from one data source, and to enable consistent, informed pricing decisions.
Telecoms will need to actively reduce silos, especially since proliferating hardware types, data types, and service offerings pull in the opposite direction. This may require both technical, infrastructure, and procedural changes.
Enrichment. Telecoms can find value in combining their own network data with third-party data sources; weather data is the most salient example. Particularly in rugged, sparsely populated areas—which are often served by smaller regional telecoms—knowing about temperatures and wind can help understand when a problem is caused by local conditions, versus an inherent equipment issue.
What data to combine is one question. When to combine it is just as important. In the weather data example, temperatures and wind speed could be recorded locally and backhauled alongside local telemetry data—but that’s expensive. It may be possible, and cheaper, to add third-party weather data and match it with the telemetry data in the data center, reducing those transmission costs.
“Don’t pay to move it” if you don’t have it, as Childs says.
Smarter devices. Network overload can also be reduced by intelligent devices throughout the network. This isn’t only a question of better routing, it’s also about being able to spot problems or patterns in local data without backhauling everything to the cloud or data center for analysis.
“I want things in the field to be smart, to not only carry the traffic but tell me how they’re doing,” says Childs.
Despite ongoing improvements on this front, throughout the network, those smart devices aren’t all there yet.
Which brings us back to the storm-tossed village in Alaska.
“How much better it would have been if that network component could have told us, ‘Hey I know I’m just a redundant backup, but something’s wrong with me’?” Childs says.
Telecoms will continue to press vendors for technology that not only provides full visibility into network data, routes data, and captures telemetry signals, but can perform ongoing self-examination to ensure the network is working correctly.
As the world’s appetite for faster networks keeps rising, telecoms’ central role continues to expand as well. The demand for reliability and quality across complex networks and service portfolios means network visibility is more important than ever. Making good observability decisions will help telecoms, and their customers, thrive.