SIADS 542: Supervised Learning · University of Michigan · December 2025
This project was completed as part of SIADS 542: Supervised Learning at the University of Michigan. The task is binary classification: given features extracted from email content, identify whether a message is spam (Class 1) or legitimate (Class 0). What makes spam detection analytically interesting is that the two types of errors have very different costs — making it a natural case study in precision-recall tradeoffs and threshold selection.
The analysis builds from baseline dummy classifiers through SVC and Logistic Regression, evaluating each model with confusion matrices, precision-recall curves, ROC curves, and GridSearchCV over regularization parameters. A final comparison between normalized and unnormalized features demonstrates the concrete impact of feature scaling on classifier performance.
Skills demonstrated in this project
Spam detection is a canonical example of asymmetric error costs. A false positive — flagging a legitimate email as spam — causes a user to miss a real message, which is often the more damaging outcome. A false negative — letting spam through — is annoying but recoverable. This asymmetry means overall accuracy is the wrong metric to optimize: a classifier that maximizes accuracy might still let through a lot of spam or, worse, incorrectly filter legitimate messages at an unacceptable rate.
The project frames every modeling decision around this tradeoff explicitly. Rather than reporting a single accuracy number, each classifier is evaluated with confusion matrices that make false positives and false negatives visible, precision-recall curves that show how the tradeoff shifts with decision threshold, and ROC curves that summarize performance across all possible thresholds. The GridSearchCV optimization targets precision specifically — reflecting the real-world priority of a spam filter that doesn't lose good emails.
Two dummy classifiers established the baseline: one using stratified random prediction (respecting the training label distribution) and one always predicting the majority class. Comparing their precision, recall, and accuracy made clear how much headroom a real classifier needs to be useful — and illustrated why the majority-class dummy has high accuracy but zero recall for spam.
An SVC with default hyperparameters was fit on standardized training features and evaluated on the test set. A second SVC with parameters C=1e9, gamma=1e-8 was used to explore decision function thresholds — classifying instances as spam if their raw decision score exceeded −100, rather than using the standard 0 threshold. This produced a confusion matrix that traded precision for recall, illustrating how threshold choice directly controls which type of error the model commits more of.
A Logistic Regression classifier was evaluated using both the precision-recall curve and the ROC curve on the test set. From the precision-recall curve, the recall at a precision of 0.90 was extracted programmatically — answering the practical question: if we set the filter to flag only messages we're 90% sure are spam, how much spam do we actually catch? The ROC curve answered the complementary question: at a false positive rate of 10%, what true positive rate does the model achieve?
A GridSearchCV over six values of C (0.005 to 10.0) and two penalty types (L1 and L2) was run with 5-fold cross-validation, optimizing for precision. The mean cross-validated precision for each of the 12 combinations was returned as a 6×2 array and visualized as a heatmap — showing how regularization strength and penalty type jointly affect the spam filter's precision across folds.
The GridSearchCV was re-run on raw, unnormalized features and the best precision from each run was compared. Feature normalization was applied correctly — fitting the StandardScaler on training data only and applying the transform to both sets — to avoid data leakage. The performance gap between the normalized and unnormalized runs demonstrated concretely that regularized linear models are sensitive to feature scale: without normalization, the regularization penalty is applied unevenly across features, degrading precision meaningfully.
The Logistic Regression classifier with optimized regularization produced the strongest results. Evaluated on the test set with default threshold: