Regression & Classification Fundamentals

SIADS 542: Supervised Learning · University of Michigan · December 2025

On this page

Project Overview
Regression: Polynomial Models & Regularization
Classification: KNN & SVC
Key Insights

Project Overview

This project spans two assignments from SIADS 542: Supervised Learning at the University of Michigan, covering the foundational methods of both regression and classification. Rather than treating them separately, the assignments are structured around a shared theme: understanding how model complexity relates to generalization — where underfitting and overfitting live, and how to navigate between them.

The regression work uses a synthetic noisy cubic dataset to explore polynomial feature expansion, bias-variance tradeoff, and Lasso regularization. The classification work applies k-Nearest Neighbors and Support Vector Classifiers to the Breast Cancer Wisconsin dataset, exploring hyperparameter tuning, weighted distance, and the effect of the RBF kernel's gamma parameter on train vs. test accuracy.

Python scikit-learn Polynomial Regression Lasso Regularization KNN Support Vector Classifier Hyperparameter Tuning Bias-Variance Tradeoff One-Hot Encoding

Skills demonstrated in this project

Polynomial degrees evaluated (1, 3, 7, 11)

569

Patient records in classification dataset

Features in the Breast Cancer Wisconsin dataset

Gamma values swept in SVC validation curve

Regression: Polynomial Models & Regularization

The regression portion of the project works with a synthetic dataset where the true underlying function is a known cubic polynomial — but the observed data includes Gaussian noise. This setup makes it possible to evaluate models not just against the noisy test set, but against a "gold standard" noiseless version of the true function, revealing how well each model recovers the actual signal.

Polynomial Feature Expansion

Fitting Polynomials of Degrees 1, 3, 7, and 11

Using scikit-learn's PolynomialFeatures and LinearRegression, polynomial models of degrees 1, 3, 7, and 11 were fit to the training data and evaluated by R² on both training and test sets. Degree 1 underfit the data badly — a straight line can't capture a cubic curve. Degree 11 memorized the training data nearly perfectly but generalized poorly to the test set, producing wild oscillations in regions between training points. Degree 3 achieved the best test R², aligning with the true underlying cubic structure.

Lasso Regularization

Constraining Model Complexity with Sparse Coefficients

The same polynomial degrees were re-fit using Lasso regression (α=0.01), which adds an L1 penalty that drives less-useful coefficients to zero. Compared to unregularized linear regression, Lasso produced smoother fits at high polynomial degrees — damping the oscillations seen at degree 11 by effectively ignoring coefficients that weren't supported by the data. Evaluated against the gold standard noiseless test set, degree 3 again performed best, confirming that the regularization was helping the model recover the true function rather than fitting noise.

KNN Regression

Non-Parametric Baseline

A KNN regressor with default hyperparameters was fit as a non-parametric baseline. Its R² on the test set provided a useful reference point: it performed reasonably but was outperformed by the correctly-specified polynomial models, illustrating that when the true functional form is recoverable, parametric models with the right structure have an advantage over flexible non-parametric alternatives.

Classification: KNN & SVC

The classification portion applies supervised learning to the Breast Cancer Wisconsin dataset — a real medical dataset where each of 569 patient records is labeled malignant or benign based on 30 digitized cell nucleus measurements. The stakes of the problem frame the model evaluation: missing a true malignant case (false negative) is costlier than a false alarm, which motivated the exploration of recall-focused evaluation alongside overall accuracy.

k-Nearest Neighbors

Hyperparameter Tuning and Overfitting

KNN classifiers were built and evaluated at k=1 and k=15, then a parameter sweep across all odd values of k from 1 to 19 identified the optimal value for test set accuracy. Separately, the same sweep was run optimizing for training set accuracy instead — a deliberate overfitting exercise. At k=1, the model achieved perfect training accuracy but lower test accuracy, illustrating the classic overfitting signature. The gap between training-optimized and test-optimized k values quantified exactly how much performance was being sacrificed by selecting on the wrong metric.

Support Vector Classifier

Gamma Sweep with Validation Curves

An SVC with an RBF kernel was evaluated across six values of the gamma parameter (1e-7 to 1e-2) using scikit-learn's validation_curve with 3-fold cross-validation. At very low gamma, the decision boundary is too smooth — the model underfits. At very high gamma, the boundary becomes too sensitive to individual training points — the model overfits, with training accuracy near 1.0 and test accuracy dropping sharply. The optimal gamma produced the highest cross-validated test accuracy and was identified programmatically from the resulting accuracy curves.

One-Hot Encoding

Preparing Categorical Features for ML Models

A separate component of the classification work focused on preprocessing: applying one-hot encoding to categorical features in a housing price dataset using scikit-learn's OneHotEncoder. The implementation correctly fits the encoder on training data only and applies the transform to both sets — avoiding data leakage — and handles unseen categories in the test set through the handle_unknown parameter.

Key Insights

Model complexity and generalization follow a predictable pattern — in both regression and classification. Whether it's polynomial degree or the gamma parameter of an RBF kernel, the relationship between complexity and test performance follows the same arc: underfitting at one extreme, overfitting at the other, and a sweet spot in between that requires empirical search rather than intuition.
Regularization recovers signal from noise. Lasso's performance on the gold standard test set showed that penalizing model complexity isn't just about preventing overfitting on the training data — it actively improves the model's ability to approximate the true underlying function, even when that function is unknown in practice.
Optimizing on the wrong metric produces measurably worse models. The overfitting exercise with KNN made this concrete: selecting k based on training accuracy gave a model that looked better in development but performed worse on held-out data. The gap was quantifiable and instructive — not just a theoretical warning.