Spam Email Classification

Project Overview

This project was completed as part of SIADS 542: Supervised Learning at the University of Michigan. The task is binary classification: given features extracted from email content, identify whether a message is spam (Class 1) or legitimate (Class 0). What makes spam detection analytically interesting is that the two types of errors have very different costs — making it a natural case study in precision-recall tradeoffs and threshold selection.

The analysis builds from baseline dummy classifiers through SVC and Logistic Regression, evaluating each model with confusion matrices, precision-recall curves, ROC curves, and GridSearchCV over regularization parameters. A final comparison between normalized and unnormalized features demonstrates the concrete impact of feature scaling on classifier performance.

Python scikit-learn Logistic Regression Support Vector Classifier Precision-Recall Curves ROC Curves Confusion Matrix GridSearchCV Feature Normalization

Skills demonstrated in this project

~86%

Logistic Regression accuracy on test set

~90%

Precision at optimized regularization

12

Hyperparameter combinations evaluated in GridSearchCV

The Precision-Recall Tradeoff

Spam detection is a canonical example of asymmetric error costs. A false positive — flagging a legitimate email as spam — causes a user to miss a real message, which is often the more damaging outcome. A false negative — letting spam through — is annoying but recoverable. This asymmetry means overall accuracy is the wrong metric to optimize: a classifier that maximizes accuracy might still let through a lot of spam or, worse, incorrectly filter legitimate messages at an unacceptable rate.

The project frames every modeling decision around this tradeoff explicitly. Rather than reporting a single accuracy number, each classifier is evaluated with confusion matrices that make false positives and false negatives visible, precision-recall curves that show how the tradeoff shifts with decision threshold, and ROC curves that summarize performance across all possible thresholds. The GridSearchCV optimization targets precision specifically — reflecting the real-world priority of a spam filter that doesn't lose good emails.

Methods & Evaluation

Baseline

Dummy Classifiers

Two dummy classifiers established the baseline: one using stratified random prediction (respecting the training label distribution) and one always predicting the majority class. Comparing their precision, recall, and accuracy made clear how much headroom a real classifier needs to be useful — and illustrated why the majority-class dummy has high accuracy but zero recall for spam.

Model 1

Support Vector Classifier

An SVC with default hyperparameters was fit on standardized training features and evaluated on the test set. A second SVC with parameters C=1e9, gamma=1e-8 was used to explore decision function thresholds — classifying instances as spam if their raw decision score exceeded −100, rather than using the standard 0 threshold. This produced a confusion matrix that traded precision for recall, illustrating how threshold choice directly controls which type of error the model commits more of.

Model 2

Logistic Regression with Precision-Recall & ROC Analysis

A Logistic Regression classifier was evaluated using both the precision-recall curve and the ROC curve on the test set. From the precision-recall curve, the recall at a precision of 0.90 was extracted programmatically — answering the practical question: if we set the filter to flag only messages we're 90% sure are spam, how much spam do we actually catch? The ROC curve answered the complementary question: at a false positive rate of 10%, what true positive rate does the model achieve?

Hyperparameter Search

GridSearchCV over Regularization Strength and Penalty Type

A GridSearchCV over six values of C (0.005 to 10.0) and two penalty types (L1 and L2) was run with 5-fold cross-validation, optimizing for precision. The mean cross-validated precision for each of the 12 combinations was returned as a 6×2 array and visualized as a heatmap — showing how regularization strength and penalty type jointly affect the spam filter's precision across folds.

Feature Scaling

Normalized vs. Unnormalized Features

The GridSearchCV was re-run on raw, unnormalized features and the best precision from each run was compared. Feature normalization was applied correctly — fitting the StandardScaler on training data only and applying the transform to both sets — to avoid data leakage. The performance gap between the normalized and unnormalized runs demonstrated concretely that regularized linear models are sensitive to feature scale: without normalization, the regularization penalty is applied unevenly across features, degrading precision meaningfully.

Results

The Logistic Regression classifier with optimized regularization produced the strongest results. Evaluated on the test set with default threshold:

392

True positives (spam correctly flagged)

43

False positives (good email flagged as spam)

68

False negatives (spam that got through)

0.901

Precision

0.852

Recall

0.878

Accuracy

Key Insights

Accuracy is a misleading metric when error costs are asymmetric. The majority-class dummy classifier achieved ~61% accuracy by predicting everything as non-spam — but its recall for spam was zero. A model optimized purely for accuracy in this domain could look good on paper while failing completely at the actual task.
Decision threshold selection is a modeling decision, not just an implementation detail. Adjusting the SVC decision threshold from 0 to −100 shifted the confusion matrix substantially — catching more spam at the cost of more false positives. The precision-recall curve makes this tradeoff navigable: it shows exactly what recall is achievable at any given precision target, letting a practitioner choose the operating point that matches the real-world cost structure.
Feature normalization has a measurable impact on regularized classifiers. The unnormalized GridSearchCV produced meaningfully lower best precision than the normalized version. This is expected — L1 and L2 penalties assume features are on a common scale — but seeing the gap quantified reinforced why normalization is a required step, not an optional one.

Spam Email Classification: Precision, Recall, and the Cost of Being Wrong