IP, Subnets, Routing, and DNS
When “the server is up” but customers can’t connect
A telecom operations team rolls out a routine change to a pooled service: a new VM for a policy API and a small DNS update so clients can find it. The hypervisor dashboard is green, the VM responds to ping, and CPU looks fine—yet call setup starts failing in one region and VoLTE registrations time out intermittently. The on-call engineer hears the familiar line: “But the server is reachable.”
In telecom, that’s a trap. Reachable only means packets can get to an IP address; it says nothing about whether the right clients can find the right endpoint, whether traffic is using the right route, or whether name lookups are returning the intended targets. IP addressing, subnetting, routing, and DNS are the “plumbing” that decides where traffic goes and how dependencies are discovered.
This lesson gives you a practical mental model: how IP addresses are structured, how subnets draw boundaries, how routes choose paths, and how DNS turns service names into the IPs your apps actually use.
The network “stack” you troubleshoot: IP, subnets, routes, names
An IP address identifies a network interface on an IP network—think “where to deliver the packet.” In most enterprise telecom environments you’ll see IPv4 heavily (e.g., 10.28.14.7), and often IPv6 in parallel depending on the domain. The key beginner takeaway is that IP is about addressing and delivery, not “the internet” versus “internal.” Internal networks still use IP; they just use private ranges and controlled routing.
A subnet is an IP address range that shares a common network prefix. Subnetting is how you draw boundaries: “these hosts are local to each other” and “that destination is elsewhere.” A subnet is defined by a prefix length (CIDR notation) like /24 or /20, which determines how many addresses are in the block. Subnets affect real operations: where gateways sit, where broadcast or ARP behavior occurs, and which systems are considered “on-link” versus requiring a router hop.
Routing is the decision-making that sends packets to the next hop toward a destination network. End hosts keep a small routing table (often “local subnet + default gateway”), while routers keep large tables that reflect policy, topology, and segmentation. In telecom environments, routing boundaries often map to failure domains and security zones: access, core, management, and service networks may be separate by design.
DNS (Domain Name System) maps names to records—most commonly A/AAAA records that resolve a service name to an IP address. DNS is not just “website stuff.” It’s a key dependency in service chains: if authentication, policy, or routing control-plane components can’t resolve each other consistently, you get timeouts that look like random instability. DNS also introduces caching and time-to-live (TTL) behavior—meaning a change can take time to fully propagate across clients.
IP and subnets: the boundary lines that decide “local” vs “routed”
An IPv4 address has two conceptual parts: a network portion and a host portion. The prefix length (like /24) tells you how many leading bits define the network. Practically, you’ll often translate this into a subnet mask (/24 = 255.255.255.0) and a usable range. For example, 10.28.14.0/24 means any address from 10.28.14.0 to 10.28.14.255 is in that subnet, with some addresses reserved by convention (network and broadcast in traditional IPv4 subnetting).
Subnetting matters because it determines who can talk directly at Layer 2 versus who must go through a default gateway (a router interface). If a host believes a destination is “local,” it will try to reach it directly (e.g., via ARP in IPv4). If it believes the destination is “remote,” it sends the packet to the gateway. A single wrong prefix length can turn a working network into a confusing partial outage: some destinations appear reachable, others blackhole, and symptoms vary by source host.
Best practice in telecom operations is to treat subnet design as intentional architecture, not just address allocation. Subnets are often aligned to functions and risk: management networks separate from service traffic; latency-sensitive functions separated from noisy domains; and redundancy designed so that a single subnet or gateway failure does not take out all instances of a critical service. This mirrors the earlier operational theme of failure domains: where you draw boundaries determines what can fail together.
Common pitfalls for beginners include confusing “same first three octets” with “same subnet” (that only holds for /24, and real networks frequently use /20, /21, or /27), and assuming “if I can ping it, the subnet is fine.” Ping can succeed across routed paths even when the subnetting is wrong locally, and it can fail due to firewalls even when subnetting is perfect. A typical misconception is that subnetting is “just for saving addresses.” In production, subnetting is primarily about containment, policy, and predictable routing behavior.
Routing: how packets choose a path (and how they choose the wrong one)
Routing is the process of selecting a next hop based on a destination IP and a routing table. On an ordinary server, the most important routes are the directly connected route (your local subnet) and the default route (0.0.0.0/0) pointing at the gateway. On routers, there are many specific routes, and the general rule is longest prefix match: the most specific route wins. This is why 10.28.14.0/24 takes precedence over 10.28.0.0/16 if both exist.
In telecom environments, routing is also about policy and segmentation. You may intentionally prevent certain subnets from reaching others, or force traffic through inspection points, load balancers, or service meshes at the network layer. The operational implication is that a connectivity issue is often not “a broken cable,” but a missing or incorrect route, an unexpected asymmetric path (packets go out one way and return another), or a gateway doing something different than you assume.
Best practices start with clarity and consistency. Keep routing design understandable: document which networks exist, what gateways they use, and what the expected traffic paths are for critical service chains. When introducing virtualization, remember that there can be multiple “routing-like” decisions: inside the VM (guest OS routes), inside the hypervisor virtual switch, and in the physical network. This is the same layered thinking used for servers and virtualization: symptoms can originate from different layers even when they look identical from the application.
Common pitfalls include relying on “it works from one host” as proof that routing is correct everywhere. Different hosts can have different default gateways, different static routes, or different policy routes. Another frequent problem is mistaking NAT for routing; NAT changes addresses, while routing decides the path. Even if NAT is in play at network edges, inside most telecom service networks you still need correct routing for stability and observability.
DNS: service discovery, caching, and the hidden time delay
DNS translates names into records so clients don’t hardcode IP addresses. In telecom systems, DNS is often used for service discovery and resiliency: a name might map to multiple A/AAAA records (multiple instances), or it might use CNAME chains to point “service” names at environment-specific targets. The operational advantage is flexibility: you can move or replace endpoints without reconfiguring every client—if DNS is managed carefully.
The feature that makes DNS efficient—caching—is also what makes it tricky during change. DNS responses are cached by resolvers and sometimes by the application itself, for a duration influenced by TTL. That means updates do not instantly take effect everywhere. During cutovers, you can get a mixed world: some clients use the new IP, others keep the old one, and failures appear “random” unless you remember caching. This is especially visible in telecom where distributed nodes and many clients amplify inconsistency.
Best practices focus on controlling blast radius and timing. Keep TTLs appropriate for how often you expect change, and plan DNS changes with propagation in mind. Ensure redundancy in resolvers and make sure critical components know which resolvers to use; a resolver outage can look like an application outage. Also, be careful with “quick fixes” like editing /etc/hosts on one server—those changes bypass DNS and can create configuration drift, making incidents harder to diagnose in the future.
Common pitfalls include assuming DNS is always the culprit when a connection fails, or assuming DNS can’t be the culprit when ping by IP works. DNS failures often show up as timeouts and retries that look like slowness rather than a clean error. A typical misconception is that DNS is purely an “internet” dependency; in reality, internal DNS is often more critical because most services in a telecom stack rely on names to locate each other reliably.
How these pieces differ (and how they interact)
| Dimension | IP address | Subnet (CIDR) | Routing | DNS |
|---|---|---|---|---|
| Primary job | Identifies an interface so packets can be delivered. | Defines a local network boundary (which addresses are “on-link”). | Selects the next hop/path to reach non-local destinations. | Maps names to records, usually IPs, so clients can find services. |
| Where it’s configured | NIC/interface config on hosts; also on load balancers and routers. | Host interface settings and network design documentation/IPAM. | Host routing table; routers; sometimes hypervisor/virtual network layers. | Authoritative DNS servers; recursive resolvers; client resolver settings. |
| Common failure pattern | Duplicate IP, wrong IP, wrong gateway association. | Wrong prefix length causing “local vs remote” confusion. | Missing/incorrect route, asymmetric path, wrong default gateway. | Stale cache, wrong record, resolver outage, split-horizon mismatch. |
| What it looks like to an operator | Some services unreachable; ARP issues; conflicts. | Intermittent reachability; “works from some hosts.” | Large-scale reachability problems by network segment. | Name lookup failures; timeouts; “IP works, name doesn’t.” |
A helpful way to keep this straight: IP is “where,” subnet is “who’s local,” routing is “how to get there,” and DNS is “what name points where.” Incidents often combine them: DNS points to an IP in the wrong subnet, or routing can’t reach the subnet DNS returned.
Two telecom-flavored incidents, worked end-to-end
Example 1: VoLTE registrations fail after “just a DNS change”
During a busy hour, multiple sites report increased VoLTE registration timeouts. Core services look healthy on their hosts, but application logs show repeated failures when contacting an authentication endpoint by name (for example, auth.service.core). Operators try the direct IP from one node and it works, but name-based calls fail from several others. That mismatch is your clue: transport works sometimes, but discovery is inconsistent.
Step-by-step reasoning helps isolate it. First, verify whether the issue is name resolution or routing by checking if affected nodes resolve the name to the same IPs as unaffected nodes. You may find the resolver at Site A returns a new IP while Site B still returns the old one—classic caching plus staggered resolver behavior. Next, check whether the new IP is in the expected subnet and reachable via the intended gateway; a DNS record that points to an IP in a management subnet can “work” from a few admin hosts but fail from service subnets due to segmentation policies.
Impact, benefits, and limitations show up clearly. DNS-based discovery gives you flexibility to move endpoints without touching every client, which is valuable in fast-moving telecom environments. The limitation is propagation delay: until caches converge, you get partial rollout behavior that looks like instability. The operational best practice is to plan DNS changes like real deployments: control TTL ahead of time, validate reachability from each relevant subnet, and ensure resolver redundancy so a single resolver’s stale cache doesn’t become a region-wide symptom.
Example 2: A new VM is “up,” but half the network can’t reach it
A team provisions a new VM for an internal API used by multiple network functions. Monitoring from the same rack shows the VM responds to ICMP and the service port is open. Yet remote nodes in another region time out. The initial suspicion is “firewall,” but the pattern—local success, remote failure—often points to subnetting or routing rather than the application.
Work it like a layered diagnosis. Confirm the VM’s IP, subnet mask/prefix length, and default gateway. A common misconfiguration is assigning the correct IP but the wrong prefix (e.g., /16 instead of /24): the VM believes many remote addresses are “local,” tries to ARP for them, and never sends traffic to the gateway. From nearby hosts that really are local, it works; from everywhere else, it appears dead. If subnetting is correct, look at routing: is the subnet of the new VM advertised or reachable from remote domains, or did the VM land on a VLAN/subnet that’s only present locally on that hypervisor cluster?
The impact is operationally expensive in telecom because partial reachability doesn’t trip every alarm. Some call flows succeed while others fail, retries increase load, and the incident surface grows. The benefit of getting the network primitives right is predictability: when IP/subnet/gateway and routing are consistent, service rollouts behave consistently. The limitation is that virtualization can hide complexity—there may be a virtual switch or mis-tagged VLAN between the VM and the physical network—so you must be disciplined about checking each layer rather than assuming “VM up” equals “service reachable.”
[[flowchart-placeholder]]
What to remember when you’re on-call
IP, subnets, routing, and DNS are not separate trivia topics—they are one system that decides whether clients can locate and reach services reliably. Keep your mental model tight:
-
IP + subnet decide what a host thinks is local and where it sends everything else (the gateway).
-
Routing decides the path across network boundaries, using longest-prefix match and policy.
-
DNS decides which IP a client attempts in the first place, with caching that can delay or fragment changes.
Now that the foundation is in place, we’ll move into Telecom Components and Traffic Flows [25 minutes].