Phishing Detection Pipeline
Multi-signal analysis for scoring phishing likelihood in enterprise email security.
"In a data system, many things can go wrong. The key is to build systems that are resilient to the failures that will inevitably happen."
The Theory
In “Designing Data-Intensive Applications”, Martin Kleppmann describes how modern systems must handle derived data—where the same information is transformed and stored in multiple formats for different access patterns. A phishing detection platform exemplifies this: raw emails must be parsed, enriched with external data, scored by multiple ML signals, and served to analyst dashboards.
The pipeline stages shown in the diagram above illustrate this flow: from email ingestion through data enrichment, decision tree scoring with Google Vision and sentiment analysis, to the MongoDB-backed dashboard.
The Architecture
At Rapid7, the Cyber Threat Intelligence platform used a distributed microservices architecture deployed on Kubernetes infrastructure with dynamically scaled workers based on incoming job volume.
Pipeline Stages:
- Email Parsing: Microservices responsible for parsing incoming emails and extracting content, attachments, and metadata.
- Data Enrichment: Industry-specific data aggregation including domain zone information and IP-related statistics.
- Phishing Scoring: Python-based decision tree scoring system combining multiple signals to calculate phishing likelihood.
Scoring Signals
Image Analysis (Google Vision API):
- Analyzed image content attached to emails
- Compared against known brand identities to detect spoofing attempts
- Scored similarity to legitimate brand assets as a phishing indicator
Text Sentiment Analysis:
- Detected time pressure language (“Act Now!”, “Urgent Action Required”)
- Identified manipulative techniques common in social engineering
- Scored psychological manipulation indicators
My Contributions
Dashboard Query Optimization:
- Identified performance bottleneck in main dashboard queries
- Optimized MongoDB queries and added missing indexes
- Result: Dashboard full content load improved from 12 seconds → 3.4 seconds
Scoring System Maintenance:
- Participated in Python-based decision tree scoring system updates
- Modified weights and logic of certain phishing scoring functions
- Maintained and evolved the scoring rules based on emerging threat patterns
Impact Metrics
| Metric | Before | After |
|---|---|---|
| Dashboard Load Time | 12 seconds | 3.4 seconds |
| Query Performance | Unoptimized | Indexed |
When To Use This Approach
Build a multi-signal scoring pipeline when:
- Single indicators are insufficient (legitimate emails can share traits with phishing)
- You need explainable scores (why was this flagged?)
- Processing volume requires horizontal scaling
- Different analysis types (image, text, metadata) must be combined
Key architecture patterns:
- Kubernetes with dynamic worker scaling for variable load
- Separate microservices for parsing, enrichment, and scoring
- MongoDB with proper indexing for dashboard query performance
- Decision tree scoring for explainable, tunable classification
The Outcome
- Performance: Dashboard queries optimized from 12s to 3.4s through MongoDB query optimization and indexing.
- Scoring Accuracy: Contributed to decision tree scoring logic updates to improve phishing detection.
- Scalability: K8s infrastructure with dynamic scaling handled variable email ingestion volumes.