"In a data system, many things can go wrong. The key is to build systems that are resilient to the failures that will inevitably happen."
— Martin Kleppmann, Designing Data-Intensive Applications
Architecture diagram for Phishing Detection Pipeline

The Theory

In “Designing Data-Intensive Applications”, Martin Kleppmann describes how modern systems must handle derived data—where the same information is transformed and stored in multiple formats for different access patterns. A phishing detection platform exemplifies this: raw emails must be parsed, enriched with external data, scored by multiple ML signals, and served to analyst dashboards.

The pipeline stages shown in the diagram above illustrate this flow: from email ingestion through data enrichment, decision tree scoring with Google Vision and sentiment analysis, to the MongoDB-backed dashboard.

The Architecture

At Rapid7, the Cyber Threat Intelligence platform used a distributed microservices architecture deployed on Kubernetes infrastructure with dynamically scaled workers based on incoming job volume.

Pipeline Stages:

  1. Email Parsing: Microservices responsible for parsing incoming emails and extracting content, attachments, and metadata.
  2. Data Enrichment: Industry-specific data aggregation including domain zone information and IP-related statistics.
  3. Phishing Scoring: Python-based decision tree scoring system combining multiple signals to calculate phishing likelihood.

Scoring Signals

Image Analysis (Google Vision API):

  • Analyzed image content attached to emails
  • Compared against known brand identities to detect spoofing attempts
  • Scored similarity to legitimate brand assets as a phishing indicator

Text Sentiment Analysis:

  • Detected time pressure language (“Act Now!”, “Urgent Action Required”)
  • Identified manipulative techniques common in social engineering
  • Scored psychological manipulation indicators

My Contributions

Dashboard Query Optimization:

  • Identified performance bottleneck in main dashboard queries
  • Optimized MongoDB queries and added missing indexes
  • Result: Dashboard full content load improved from 12 seconds → 3.4 seconds

Scoring System Maintenance:

  • Participated in Python-based decision tree scoring system updates
  • Modified weights and logic of certain phishing scoring functions
  • Maintained and evolved the scoring rules based on emerging threat patterns

Impact Metrics

MetricBeforeAfter
Dashboard Load Time12 seconds3.4 seconds
Query PerformanceUnoptimizedIndexed

When To Use This Approach

Build a multi-signal scoring pipeline when:

  • Single indicators are insufficient (legitimate emails can share traits with phishing)
  • You need explainable scores (why was this flagged?)
  • Processing volume requires horizontal scaling
  • Different analysis types (image, text, metadata) must be combined

Key architecture patterns:

  • Kubernetes with dynamic worker scaling for variable load
  • Separate microservices for parsing, enrichment, and scoring
  • MongoDB with proper indexing for dashboard query performance
  • Decision tree scoring for explainable, tunable classification

The Outcome

  • Performance: Dashboard queries optimized from 12s to 3.4s through MongoDB query optimization and indexing.
  • Scoring Accuracy: Contributed to decision tree scoring logic updates to improve phishing detection.
  • Scalability: K8s infrastructure with dynamic scaling handled variable email ingestion volumes.