Regression Analysis: Comparing Models

OPERATIONAL MANUAL

Initialization • Click 'Initialize Python Runtime' to load Pyodide and scikit-learn • Wait for status to show 'Ready' before proceeding Execution • Analysis runs automatically after initialization • Click 'Re-run Analysis' to execute the models again • View real-time updates in the status indicator Layout • Python Implementation Section: View complete syntax-highlighted code (5 analysis steps) • Analysis Results Section: Interactive output with model metrics, predictions, and PCA variance charts

Python Implementation

This Python code runs in your browser using Pyodide (Python compiled to WebAssembly).

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import json

# ============================================================
# STEP 1: Generate Synthetic Transaction Dataset
# ============================================================
np.random.seed(42)
n_samples = 300

data = {
    'time_delta': np.random.randint(1, 120, n_samples),
    'category_id': np.random.randint(1, 6, n_samples),
    'hour_of_day': np.random.randint(0, 24, n_samples),
    'day_of_week': np.random.randint(1, 8, n_samples),
}

data['amount'] = (
    50 + 
    data['time_delta'] * 0.5 +
    data['category_id'] * 15 +
    (data['hour_of_day'] - 12)**2 * 0.3 +
    np.random.normal(0, 10, n_samples)
)

df = pd.DataFrame(data)
X = df[['time_delta', 'category_id', 'hour_of_day', 'day_of_week']]
y = df['amount']

# Perform cross-validation on full dataset (before split)
lr_model_cv = LinearRegression()
lr_cv = cross_val_score(lr_model_cv, X, y, cv=5, scoring='neg_mean_absolute_error').mean()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ============================================================
# STEP 2: Model 1 - Linear Regression
# ============================================================
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_mae = mean_absolute_error(y_test, lr_pred)
lr_mse = mean_squared_error(y_test, lr_pred)
lr_r2 = r2_score(y_test, lr_pred)

# ============================================================
# STEP 3: Model 2 - Polynomial Regression
# ============================================================
# Cross-validation with pipeline on full dataset
poly_pipeline = Pipeline([('poly', PolynomialFeatures(degree=2, include_bias=False)), ('lr', LinearRegression())])
poly_cv = cross_val_score(poly_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error').mean()

poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
poly_pred = poly_model.predict(X_test_poly)
poly_mae = mean_absolute_error(y_test, poly_pred)
poly_mse = mean_squared_error(y_test, poly_pred)
poly_r2 = r2_score(y_test, poly_pred)

# ============================================================
# STEP 4: Model 3 - PCA + Regression (with Scaling)
# ============================================================
# Cross-validation with full pipeline on full dataset
pca_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3)),
    ('lr', LinearRegression())
])
pca_cv = cross_val_score(pca_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error').mean()

# Train final model
scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.transform(X_test_poly)

pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train_poly_scaled)
X_test_pca = pca.transform(X_test_poly_scaled)
pca_model = LinearRegression()
pca_model.fit(X_train_pca, y_train)
pca_pred = pca_model.predict(X_test_pca)
pca_mae = mean_absolute_error(y_test, pca_pred)
pca_mse = mean_squared_error(y_test, pca_pred)
pca_r2 = r2_score(y_test, pca_pred)

# Export results as JSON
results = {
    'metrics': [
        {'model': 'Linear', 'mae': lr_mae, 'mse': lr_mse, 'r2': lr_r2, 'cv': -lr_cv},
        {'model': 'Polynomial', 'mae': poly_mae, 'mse': poly_mse, 'r2': poly_r2, 'cv': -poly_cv},
        {'model': 'PCA', 'mae': pca_mae, 'mse': pca_mse, 'r2': pca_r2, 'cv': -pca_cv}
    ],
    'predictions': {
        'actual': y_test.tolist()[:10],
        'linear': lr_pred.tolist()[:10],
        'polynomial': poly_pred.tolist()[:10],
        'pca': pca_pred.tolist()[:10]
    },
    'pca_variance': pca.explained_variance_ratio_.tolist()
}

print(f"RESULTS_JSON:{json.dumps(results)}")

⚠

Offline Mode

Python implementation is not available offline. Please connect to the internet to view the code.

Analysis Results

⚠

Offline Mode

Portfolio results are not available offline. Please connect to the internet to run the analysis.

Need AI Engineering?

From prototypes to production-grade systems.

Request Audit

The Complexity of Human Behavior

Predicting financial transactions isn’t simple. People don’t spend money in straight lines—we tend to spend more during lunch hours or weekends. This lab demonstrates why simple AI models fail to capture these human patterns and how complex models succeed.

The Models Compared

1. Linear Regression (The Underachiever) Draws a straight line through data.

The Flaw: It assumes spending increases linearly with time. It completely misses the “noon peak” in transaction volume.
Result: High error rate (poor R² score) because reality is curved.

2. Polynomial Regression (The Curve Master) Allows the line to bend.

The Fix: By squaring features (e.g., hour²), it captures non-linear patterns like the lunch rush.
Result: Fits the data curve beautifully, drastically reducing error.

3. PCA + Polynomial (The Efficiency Expert) Smart complexity.

The Insight: We created 14 new features to fit the curve, which is computationally heavy. PCA (Principal Component Analysis) compresses those back down to the 3 most important ones.
Result: Nearly the same accuracy as the full polynomial model, but faster and less prone to overfitting.

Powered by WebAssembly

This entire data science pipeline—generating synthetic data, training three models, and cross-validating—runs locally in your browser via Pyodide. No server required.

Scalability: Naturally distributed - each user provides their own compute

Trade-offs:

Initial load time (~5-10 seconds for Pyodide + packages)
Limited to browser memory constraints (~4GB typical)
Single-threaded execution (no multi-core parallelism yet)

References

[1] Alpaydin, E. (2014). Introduction to Machine Learning. MIT Press. (See Chapter 4.6: “Regression” for the mathematical basis of our polynomial models).

[2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (See Chapter 2: “Linear Algebra” for PCA and dimensionality reduction concepts).

Request Audit