Decoupled Data Processing

"The key insight is that derived data can be recomputed from the original source, which means it can be optimized for read patterns."
— Martin Kleppmann, Designing Data-Intensive Applications

Architecture diagram for Decoupled Data Processing

The Theory

In “Designing Data-Intensive Applications”, Martin Kleppmann describes how derived data systems can be optimized independently of source data. By precomputing results and storing them in formats optimized for specific access patterns, you avoid expensive transformations at request time.

As illustrated in the diagram above, the key transformation is moving from on-request XML parsing to background preprocessing with SOLR indexing.

The Problem

At Landcare Research, the BioTaNZ platform stored biological specimen data as deeply nested XML documents (4-5 levels deep). When researchers requested CSV exports, the system would:

Parse XML documents on-the-fly
Transform nested structures into flat CSV rows
Stream results to the client

The Bottleneck:

Each download request triggered expensive XML parsing
Nested document traversal consumed significant CPU
System could only handle ~300 concurrent download units
Researchers experienced timeouts on large exports

The Solution

I decoupled document transformation from client requests by introducing background preprocessing:

Automated Backend Processing: When documents arrived, a background job converted XML to CSV format immediately—not on client request.
SOLR Document Storage: Preprocessed data indexed in SOLR for fast retrieval with automated updates when source documents changed.
Direct CSV Delivery: Client requests now served preprocessed CSV files instead of triggering on-the-fly transformation.

Impact Metrics

Metric	Before	After
Download Throughput	300 units	30,000 units
Processing Model	On-request	Background
User Experience	Timeouts	Instant downloads

When To Use This Approach

Decouple processing from requests when:

Export/download operations involve expensive transformations
Source data format differs significantly from delivery format (e.g., XML → CSV)
Users experience timeouts on data-heavy operations
The same transformation runs repeatedly for different clients

Key implementation decisions:

Process on data arrival, not on data request
Use search indexes (SOLR, Elasticsearch) for fast document retrieval
Keep preprocessed data in sync with source via automated jobs
Design for eventual consistency—slight delay is acceptable for massive throughput gains

The Outcome

100x Throughput: Download capacity increased from 300 to 30,000 units
Eliminated Timeouts: Researchers could export large datasets reliably
Reduced Server Load: CPU-intensive XML parsing moved to background processing

Make an Inquiry