Observability-Driven Performance Investigation
Using trace and log analysis to eliminate hidden inefficiencies in distributed APIs.
"Hope is not a strategy. Engineering reliability requires measuring it."
The Theory
Google’s “Site Reliability Engineering” (SRE) handbook defines the hierarchy of reliability, placing Monitoring at the very base. It argues that you cannot improve what you cannot measure. Observability tools like Dynatrace (distributed tracing) and Sumo Logic (log aggregation) provide the visibility needed to diagnose issues that traditional metrics miss.
The diagram above shows the investigation flow: from detecting duplicate requests in the GraphQL Federation layer, through trace and log analysis, to the final fix that reduced system requests by 10-12%.
The Problem
At Kiwibank, the team tracked average request speed per endpoint to identify performance bottlenecks. While working on a performance improvement project, we noticed unusually high request volumes that didn’t correlate with user traffic patterns.
The Investigation Challenge:
- GraphQL Federation architecture meant requests traversed multiple services
- Traditional metrics showed elevated load, but couldn’t pinpoint the source
- The system appeared healthy, yet infrastructure costs were higher than expected
- Needed to trace individual requests across the federated API boundary
The Investigation
As part of the team, I investigated duplicated requests that were being produced by the GraphQL Federation API layer.
The Approach:
- Trace Analysis (Dynatrace): Analyzed distributed traces to follow request paths across federated services, identifying where duplication occurred.
- Log Analysis (Sumo Logic): Correlated log patterns to confirm which specific GraphQL API was responsible for the duplicated behavior.
- Root Cause Identification: Pinpointed the exact service and code path causing unnecessary duplicate downstream calls.
Impact Metrics
| Metric | Before | After |
|---|---|---|
| System Request Volume | Baseline | 10-12% reduction |
| Infrastructure Cost | Elevated | Reduced proportionally |
| Root Cause Visibility | Limited | Full trace correlation |
When To Use This Approach
Combine trace and log analysis when:
- Request volumes don’t match expected user traffic
- Distributed systems (microservices, federation) hide inefficiencies
- You suspect duplicate or unnecessary internal calls
- Infrastructure costs are higher than traffic would suggest
Key investigation techniques:
- Trace analysis to follow request paths across service boundaries
- Log correlation to confirm patterns across multiple services
- Compare request IDs to identify unexpected duplication
- Measure before/after to quantify the impact of fixes
The Outcome
- Request Reduction: After the fix, system-wide requests reduced by 10-12%—a significant number in saved computation and infrastructure cost.
- Methodology: Demonstrated the value of combining Dynatrace traces with Sumo Logic logs for cross-service investigation.
- Team Impact: Contributed to improving average request speed per endpoint as part of the broader performance project.