"Hope is not a strategy. Engineering reliability requires measuring it."
— Google, Site Reliability Engineering
Architecture diagram for Observability-Driven Performance Investigation

The Theory

Google’s “Site Reliability Engineering” (SRE) handbook defines the hierarchy of reliability, placing Monitoring at the very base. It argues that you cannot improve what you cannot measure. Observability tools like Dynatrace (distributed tracing) and Sumo Logic (log aggregation) provide the visibility needed to diagnose issues that traditional metrics miss.

The diagram above shows the investigation flow: from detecting duplicate requests in the GraphQL Federation layer, through trace and log analysis, to the final fix that reduced system requests by 10-12%.

The Problem

At Kiwibank, the team tracked average request speed per endpoint to identify performance bottlenecks. While working on a performance improvement project, we noticed unusually high request volumes that didn’t correlate with user traffic patterns.

The Investigation Challenge:

  • GraphQL Federation architecture meant requests traversed multiple services
  • Traditional metrics showed elevated load, but couldn’t pinpoint the source
  • The system appeared healthy, yet infrastructure costs were higher than expected
  • Needed to trace individual requests across the federated API boundary

The Investigation

As part of the team, I investigated duplicated requests that were being produced by the GraphQL Federation API layer.

The Approach:

  1. Trace Analysis (Dynatrace): Analyzed distributed traces to follow request paths across federated services, identifying where duplication occurred.
  2. Log Analysis (Sumo Logic): Correlated log patterns to confirm which specific GraphQL API was responsible for the duplicated behavior.
  3. Root Cause Identification: Pinpointed the exact service and code path causing unnecessary duplicate downstream calls.

Impact Metrics

MetricBeforeAfter
System Request VolumeBaseline10-12% reduction
Infrastructure CostElevatedReduced proportionally
Root Cause VisibilityLimitedFull trace correlation

When To Use This Approach

Combine trace and log analysis when:

  • Request volumes don’t match expected user traffic
  • Distributed systems (microservices, federation) hide inefficiencies
  • You suspect duplicate or unnecessary internal calls
  • Infrastructure costs are higher than traffic would suggest

Key investigation techniques:

  • Trace analysis to follow request paths across service boundaries
  • Log correlation to confirm patterns across multiple services
  • Compare request IDs to identify unexpected duplication
  • Measure before/after to quantify the impact of fixes

The Outcome

  • Request Reduction: After the fix, system-wide requests reduced by 10-12%—a significant number in saved computation and infrastructure cost.
  • Methodology: Demonstrated the value of combining Dynatrace traces with Sumo Logic logs for cross-service investigation.
  • Team Impact: Contributed to improving average request speed per endpoint as part of the broader performance project.