Phishing Detection Pipeline

"In a data system, many things can go wrong. The key is to build systems that are resilient to the failures that will inevitably happen."
— Martin Kleppmann, Designing Data-Intensive Applications

Architecture diagram for Phishing Detection Pipeline

The Theory

In “Designing Data-Intensive Applications”, Martin Kleppmann describes how modern systems must handle derived data—where the same information is transformed and stored in multiple formats for different access patterns. A phishing detection platform exemplifies this: raw emails must be parsed, enriched with external data, scored by multiple ML signals, and served to analyst dashboards.

The pipeline stages shown in the diagram above illustrate this flow: from email ingestion through data enrichment, decision tree scoring with Google Vision and sentiment analysis, to the MongoDB-backed dashboard.

The Architecture

At Rapid7, the Cyber Threat Intelligence platform used a distributed microservices architecture deployed on Kubernetes infrastructure with dynamically scaled workers based on incoming job volume.

Pipeline Stages:

Email Parsing: Microservices responsible for parsing incoming emails and extracting content, attachments, and metadata.
Data Enrichment: Industry-specific data aggregation including domain zone information and IP-related statistics.
Phishing Scoring: Python-based decision tree scoring system combining multiple signals to calculate phishing likelihood.

Scoring Signals

Image Analysis (Google Vision API):

Analyzed image content attached to emails
Compared against known brand identities to detect spoofing attempts
Scored similarity to legitimate brand assets as a phishing indicator

Text Sentiment Analysis:

Detected time pressure language (“Act Now!”, “Urgent Action Required”)
Identified manipulative techniques common in social engineering
Scored psychological manipulation indicators

My Contributions

Dashboard Query Optimization:

Identified performance bottleneck in main dashboard queries
Optimized MongoDB queries and added missing indexes
Result: Dashboard full content load improved from 12 seconds → 3.4 seconds

Scoring System Maintenance:

Participated in Python-based decision tree scoring system updates
Modified weights and logic of certain phishing scoring functions
Maintained and evolved the scoring rules based on emerging threat patterns

Impact Metrics

Metric	Before	After
Dashboard Load Time	12 seconds	3.4 seconds
Query Performance	Unoptimized	Indexed

When To Use This Approach

Build a multi-signal scoring pipeline when:

Single indicators are insufficient (legitimate emails can share traits with phishing)
You need explainable scores (why was this flagged?)
Processing volume requires horizontal scaling
Different analysis types (image, text, metadata) must be combined

Key architecture patterns:

Kubernetes with dynamic worker scaling for variable load
Separate microservices for parsing, enrichment, and scoring
MongoDB with proper indexing for dashboard query performance
Decision tree scoring for explainable, tunable classification

The Outcome

Performance: Dashboard queries optimized from 12s to 3.4s through MongoDB query optimization and indexing.
Scoring Accuracy: Contributed to decision tree scoring logic updates to improve phishing detection.
Scalability: K8s infrastructure with dynamic scaling handled variable email ingestion volumes.

Make an Inquiry