NLP: Financial News Classification
Comparative analysis of statistical (TF-IDF + Naive Bayes) versus embedding-based (LSA + MLP) models for categorizing Reuters-21578 financial news into earn, acq, trade, and crude categories.
Initialization • Click 'Initialize Python Runtime' to load Pyodide and Scikit-learn • Wait for status to show 'Ready' before proceeding Execution • Classification analysis runs automatically after initialization • Click 'Re-run Classification' to execute both models again • View real-time updates in the status indicator Layout • Python Implementation Section: View complete syntax-highlighted code (5 algorithm steps) • Analysis Results Section: Interactive output with model comparison table, F1-score charts, confusion matrices, and LSA variance analysis
import json
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import classification_report, confusion_matrix
# === STEP 1: Load Real Reuters-21578 Dataset from API ===
# Fetch authentic financial news articles from Reuters-21578 corpus
# Categories: earn (earnings), acq (acquisitions), trade (tariffs), crude (oil)
from js import fetch
import json as json_module
np.random.seed(42)
# Fetch real Reuters articles from API
response = await fetch('/api/ReutersData/articles')
text = await response.text()
samples = json_module.loads(text)
# Create dataset with labels
texts = []
labels = []
for category, docs in samples.items():
texts.extend(docs)
labels.extend([category] * len(docs))
# Split into train (70%) and test (30%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.3, random_state=42, stratify=labels
)
# === STEP 2: TF-IDF Vectorization ===
# Convert text to numerical features using Term Frequency-Inverse Document Frequency
vectorizer = TfidfVectorizer(max_features=500, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# === STEP 3: Baseline Model - Multinomial Naive Bayes ===
# Statistical classifier assuming feature independence (Bag of Words)
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
nb_predictions = nb_model.predict(X_test_tfidf)
# Calculate metrics
nb_report = classification_report(y_test, nb_predictions, output_dict=True, zero_division=0)
nb_cm = confusion_matrix(y_test, nb_predictions, labels=['acq', 'crude', 'earn', 'trade'])
# === STEP 4: LSA Dimensionality Reduction ===
# Compress high-dimensional TF-IDF vectors to capture latent semantic structure
lsa = TruncatedSVD(n_components=20, random_state=42)
X_train_lsa = lsa.fit_transform(X_train_tfidf)
X_test_lsa = lsa.transform(X_test_tfidf)
# Calculate explained variance
explained_variance = lsa.explained_variance_ratio_[:10] # Top 10 components
# === STEP 5: Neural Network Model - MLP on LSA Embeddings ===
# Multi-Layer Perceptron for non-linear classification
mlp_model = MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=500, random_state=42)
mlp_model.fit(X_train_lsa, y_train)
mlp_predictions = mlp_model.predict(X_test_lsa)
# Calculate metrics
mlp_report = classification_report(y_test, mlp_predictions, output_dict=True, zero_division=0)
mlp_cm = confusion_matrix(y_test, mlp_predictions, labels=['acq', 'crude', 'earn', 'trade'])
# === STEP 6: Prepare Output ===
results = {
'dataset_stats': {
'total_docs': len(texts),
'train_docs': len(X_train),
'test_docs': len(X_test),
'categories': list(samples.keys()),
'vocab_size': len(vectorizer.vocabulary_)
},
'naive_bayes': {
'accuracy': float(nb_report['accuracy']),
'precision': float(nb_report['weighted avg']['precision']),
'recall': float(nb_report['weighted avg']['recall']),
'f1_score': float(nb_report['weighted avg']['f1-score']),
'per_class': {
cat: {
'precision': float(nb_report[cat]['precision']),
'recall': float(nb_report[cat]['recall']),
'f1': float(nb_report[cat]['f1-score'])
} for cat in ['acq', 'crude', 'earn', 'trade'] if cat in nb_report
},
'confusion_matrix': nb_cm.tolist()
},
'mlp_lsa': {
'accuracy': float(mlp_report['accuracy']),
'precision': float(mlp_report['weighted avg']['precision']),
'recall': float(mlp_report['weighted avg']['recall']),
'f1_score': float(mlp_report['weighted avg']['f1-score']),
'per_class': {
cat: {
'precision': float(mlp_report[cat]['precision']),
'recall': float(mlp_report[cat]['recall']),
'f1': float(mlp_report[cat]['f1-score'])
} for cat in ['acq', 'crude', 'earn', 'trade'] if cat in mlp_report
},
'confusion_matrix': mlp_cm.tolist(),
'lsa_components': 20,
'explained_variance': [float(v) for v in explained_variance]
}
}
print(json.dumps(results))
Need AI Engineering?
From prototypes to production-grade systems.
The Synonymy Problem
In 1987, algorithmic traders faced a multi-million dollar problem: How do you teach a machine that “acquire”, “purchase”, and “buy” all mean the same thing?
If a computer only looks for keywords, it acts like a rigorous librarian—it finds exactly what you asked for, but misses everything that means the same thing but is written differently.
The Contestants
1. TF-IDF + Naive Bayes (The Speed Demon) The Librarian.
- The Strategy: Counts exact keywords. “Earnings” appeared 5 times? Must be an earnings report.
- The Flaw: Blind to context. It treats “bank” (river) and “bank” (money) as the same word.
- Best For: Speed. It’s incredibly fast and “good enough” for simple tasks.
2. LSA + MLP (The Detective) The Reader.
- The Strategy: It compresses words into “concepts” (Latent Semantic Analysis). It notices that “barrel”, “crude”, and “OPEC” often appear together, forming a concept of “Oil”.
- The Advantage: It spots the hidden connections. Even if the word “oil” never appears, it knows a story about “barrels” and “pipelines” is about Oil.
- Result: Slower to train, but far smarter at catching subtle meanings.
Making It Visible
This lab uses real articles from the Reuters-21578 corpus, a benchmark dataset for text classification research. The dataset includes authentic financial news from 1987 covering:
- earn: Quarterly earnings reports and financial results
- acq: Mergers, acquisitions, and takeover announcements
- trade: International trade agreements and tariff policies
- crude: Oil prices, OPEC decisions, and petroleum markets
Articles are served from a static JSON resource containing actual Reuters newswire text.
- F1-Score: The overall accuracy grade.
- Confusion Matrix: A heat map of mistakes. If the model keeps confusing “Trade” stories for “Crude” oil stories, this map will light up.
References
[1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361-397.
[2] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. (The foundational text on TF-IDF and LSA).
[3] Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. (The optimizer used in our MLP neural network).