Observability-Driven Performance Investigation

"Hope is not a strategy. Engineering reliability requires measuring it."
— Google, Site Reliability Engineering

Architecture diagram for Observability-Driven Performance Investigation

The Theory

Google’s “Site Reliability Engineering” (SRE) handbook defines the hierarchy of reliability, placing Monitoring at the very base. It argues that you cannot improve what you cannot measure. Observability tools like Dynatrace (distributed tracing) and Sumo Logic (log aggregation) provide the visibility needed to diagnose issues that traditional metrics miss.

The diagram above shows the investigation flow: from detecting duplicate requests in the GraphQL Federation layer, through trace and log analysis, to the final fix that reduced system requests by 10-12%.

The Problem

At Kiwibank, the team tracked average request speed per endpoint to identify performance bottlenecks. While working on a performance improvement project, we noticed unusually high request volumes that didn’t correlate with user traffic patterns.

The Investigation Challenge:

GraphQL Federation architecture meant requests traversed multiple services
Traditional metrics showed elevated load, but couldn’t pinpoint the source
The system appeared healthy, yet infrastructure costs were higher than expected
Needed to trace individual requests across the federated API boundary

The Investigation

As part of the team, I investigated duplicated requests that were being produced by the GraphQL Federation API layer.

The Approach:

Trace Analysis (Dynatrace): Analyzed distributed traces to follow request paths across federated services, identifying where duplication occurred.
Log Analysis (Sumo Logic): Correlated log patterns to confirm which specific GraphQL API was responsible for the duplicated behavior.
Root Cause Identification: Pinpointed the exact service and code path causing unnecessary duplicate downstream calls.

Impact Metrics

Metric	Before	After
System Request Volume	Baseline	10-12% reduction
Infrastructure Cost	Elevated	Reduced proportionally
Root Cause Visibility	Limited	Full trace correlation

When To Use This Approach

Combine trace and log analysis when:

Request volumes don’t match expected user traffic
Distributed systems (microservices, federation) hide inefficiencies
You suspect duplicate or unnecessary internal calls
Infrastructure costs are higher than traffic would suggest

Key investigation techniques:

Trace analysis to follow request paths across service boundaries
Log correlation to confirm patterns across multiple services
Compare request IDs to identify unexpected duplication
Measure before/after to quantify the impact of fixes

The Outcome

Request Reduction: After the fix, system-wide requests reduced by 10-12%—a significant number in saved computation and infrastructure cost.
Methodology: Demonstrated the value of combining Dynatrace traces with Sumo Logic logs for cross-service investigation.
Team Impact: Contributed to improving average request speed per endpoint as part of the broader performance project.

Make an Inquiry