"The key insight is that derived data can be recomputed from the original source, which means it can be optimized for read patterns."
— Martin Kleppmann, Designing Data-Intensive Applications
Architecture diagram for Decoupled Data Processing

The Theory

In “Designing Data-Intensive Applications”, Martin Kleppmann describes how derived data systems can be optimized independently of source data. By precomputing results and storing them in formats optimized for specific access patterns, you avoid expensive transformations at request time.

As illustrated in the diagram above, the key transformation is moving from on-request XML parsing to background preprocessing with SOLR indexing.

The Problem

At Landcare Research, the BioTaNZ platform stored biological specimen data as deeply nested XML documents (4-5 levels deep). When researchers requested CSV exports, the system would:

  1. Parse XML documents on-the-fly
  2. Transform nested structures into flat CSV rows
  3. Stream results to the client

The Bottleneck:

  • Each download request triggered expensive XML parsing
  • Nested document traversal consumed significant CPU
  • System could only handle ~300 concurrent download units
  • Researchers experienced timeouts on large exports

The Solution

I decoupled document transformation from client requests by introducing background preprocessing:

  1. Automated Backend Processing: When documents arrived, a background job converted XML to CSV format immediately—not on client request.
  2. SOLR Document Storage: Preprocessed data indexed in SOLR for fast retrieval with automated updates when source documents changed.
  3. Direct CSV Delivery: Client requests now served preprocessed CSV files instead of triggering on-the-fly transformation.

Impact Metrics

MetricBeforeAfter
Download Throughput300 units30,000 units
Processing ModelOn-requestBackground
User ExperienceTimeoutsInstant downloads

When To Use This Approach

Decouple processing from requests when:

  • Export/download operations involve expensive transformations
  • Source data format differs significantly from delivery format (e.g., XML → CSV)
  • Users experience timeouts on data-heavy operations
  • The same transformation runs repeatedly for different clients

Key implementation decisions:

  • Process on data arrival, not on data request
  • Use search indexes (SOLR, Elasticsearch) for fast document retrieval
  • Keep preprocessed data in sync with source via automated jobs
  • Design for eventual consistency—slight delay is acceptable for massive throughput gains

The Outcome

  • 100x Throughput: Download capacity increased from 300 to 30,000 units
  • Eliminated Timeouts: Researchers could export large datasets reliably
  • Reduced Server Load: CPU-intensive XML parsing moved to background processing