NLP: Financial News Classification

OPERATIONAL MANUAL

Initialization • Click 'Initialize Python Runtime' to load Pyodide and Scikit-learn • Wait for status to show 'Ready' before proceeding Execution • Classification analysis runs automatically after initialization • Click 'Re-run Classification' to execute both models again • View real-time updates in the status indicator Layout • Python Implementation Section: View complete syntax-highlighted code (5 algorithm steps) • Analysis Results Section: Interactive output with model comparison table, F1-score charts, confusion matrices, and LSA variance analysis

Python Implementation

This Python code runs in your browser using Pyodide (Python compiled to WebAssembly).

import json
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import classification_report, confusion_matrix

# === STEP 1: Load Real Reuters-21578 Dataset from API ===
# Fetch authentic financial news articles from Reuters-21578 corpus
# Categories: earn (earnings), acq (acquisitions), trade (tariffs), crude (oil)
from js import fetch
import json as json_module

np.random.seed(42)

# Fetch real Reuters articles from API
response = await fetch('/api/ReutersData/articles')
text = await response.text()
samples = json_module.loads(text)

# Create dataset with labels
texts = []
labels = []
for category, docs in samples.items():
    texts.extend(docs)
    labels.extend([category] * len(docs))

# Split into train (70%) and test (30%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)

# === STEP 2: TF-IDF Vectorization ===
# Convert text to numerical features using Term Frequency-Inverse Document Frequency
vectorizer = TfidfVectorizer(max_features=500, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# === STEP 3: Baseline Model - Multinomial Naive Bayes ===
# Statistical classifier assuming feature independence (Bag of Words)
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
nb_predictions = nb_model.predict(X_test_tfidf)

# Calculate metrics
nb_report = classification_report(y_test, nb_predictions, output_dict=True, zero_division=0)
nb_cm = confusion_matrix(y_test, nb_predictions, labels=['acq', 'crude', 'earn', 'trade'])

# === STEP 4: LSA Dimensionality Reduction ===
# Compress high-dimensional TF-IDF vectors to capture latent semantic structure
lsa = TruncatedSVD(n_components=20, random_state=42)
X_train_lsa = lsa.fit_transform(X_train_tfidf)
X_test_lsa = lsa.transform(X_test_tfidf)

# Calculate explained variance
explained_variance = lsa.explained_variance_ratio_[:10]  # Top 10 components

# === STEP 5: Neural Network Model - MLP on LSA Embeddings ===
# Multi-Layer Perceptron for non-linear classification
mlp_model = MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=500, random_state=42)
mlp_model.fit(X_train_lsa, y_train)
mlp_predictions = mlp_model.predict(X_test_lsa)

# Calculate metrics
mlp_report = classification_report(y_test, mlp_predictions, output_dict=True, zero_division=0)
mlp_cm = confusion_matrix(y_test, mlp_predictions, labels=['acq', 'crude', 'earn', 'trade'])

# === STEP 6: Prepare Output ===
results = {
    'dataset_stats': {
        'total_docs': len(texts),
        'train_docs': len(X_train),
        'test_docs': len(X_test),
        'categories': list(samples.keys()),
        'vocab_size': len(vectorizer.vocabulary_)
    },
    'naive_bayes': {
        'accuracy': float(nb_report['accuracy']),
        'precision': float(nb_report['weighted avg']['precision']),
        'recall': float(nb_report['weighted avg']['recall']),
        'f1_score': float(nb_report['weighted avg']['f1-score']),
        'per_class': {
            cat: {
                'precision': float(nb_report[cat]['precision']),
                'recall': float(nb_report[cat]['recall']),
                'f1': float(nb_report[cat]['f1-score'])
            } for cat in ['acq', 'crude', 'earn', 'trade'] if cat in nb_report
        },
        'confusion_matrix': nb_cm.tolist()
    },
    'mlp_lsa': {
        'accuracy': float(mlp_report['accuracy']),
        'precision': float(mlp_report['weighted avg']['precision']),
        'recall': float(mlp_report['weighted avg']['recall']),
        'f1_score': float(mlp_report['weighted avg']['f1-score']),
        'per_class': {
            cat: {
                'precision': float(mlp_report[cat]['precision']),
                'recall': float(mlp_report[cat]['recall']),
                'f1': float(mlp_report[cat]['f1-score'])
            } for cat in ['acq', 'crude', 'earn', 'trade'] if cat in mlp_report
        },
        'confusion_matrix': mlp_cm.tolist(),
        'lsa_components': 20,
        'explained_variance': [float(v) for v in explained_variance]
    }
}

print(json.dumps(results))

⚠

Offline Mode

Python implementation is not available offline. Please connect to the internet to view the code.

Analysis Results

⚠

Offline Mode

Portfolio results are not available offline. Please connect to the internet to run the analysis.

Need AI Engineering?

From prototypes to production-grade systems.

Request Audit

The Synonymy Problem

In 1987, algorithmic traders faced a multi-million dollar problem: How do you teach a machine that “acquire”, “purchase”, and “buy” all mean the same thing?

If a computer only looks for keywords, it acts like a rigorous librarian—it finds exactly what you asked for, but misses everything that means the same thing but is written differently.

The Contestants

1. TF-IDF + Naive Bayes (The Speed Demon) The Librarian.

The Strategy: Counts exact keywords. “Earnings” appeared 5 times? Must be an earnings report.
The Flaw: Blind to context. It treats “bank” (river) and “bank” (money) as the same word.
Best For: Speed. It’s incredibly fast and “good enough” for simple tasks.

2. LSA + MLP (The Detective) The Reader.

The Strategy: It compresses words into “concepts” (Latent Semantic Analysis). It notices that “barrel”, “crude”, and “OPEC” often appear together, forming a concept of “Oil”.
The Advantage: It spots the hidden connections. Even if the word “oil” never appears, it knows a story about “barrels” and “pipelines” is about Oil.
Result: Slower to train, but far smarter at catching subtle meanings.

Making It Visible

This lab uses real articles from the Reuters-21578 corpus, a benchmark dataset for text classification research. The dataset includes authentic financial news from 1987 covering:

earn: Quarterly earnings reports and financial results
acq: Mergers, acquisitions, and takeover announcements
trade: International trade agreements and tariff policies
crude: Oil prices, OPEC decisions, and petroleum markets

Articles are served from a static JSON resource containing actual Reuters newswire text.

F1-Score: The overall accuracy grade.
Confusion Matrix: A heat map of mistakes. If the model keeps confusing “Trade” stories for “Crude” oil stories, this map will light up.

References

[1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361-397.

[2] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. (The foundational text on TF-IDF and LSA).

[3] Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. (The optimizer used in our MLP neural network).

Request Audit