Classifying Physical Activity from Wearable Sensor Data

SIADS 542: Supervised Learning · University of Michigan · December 2025

On this page

Project Overview
Identifying and Solving Data Leakage
Modeling & Feature Selection
Results
Key Insights

Project Overview

This project was completed as part of SIADS 542: Supervised Learning at the University of Michigan. The goal is four-class activity classification: given physiological sensor measurements from 40 test subjects, predict which type of activity the subject is performing — neutral, emotional, mental, or physical — using no more than 10 of the 533 available features.

The dataset contains 4,480 rows, one per sensor collection event, across 40 subjects each exposed to 28 collection events per activity type. The feature constraint — 10 of 533 — reflects a realistic deployment scenario where minimizing sensor count and computational overhead matters. The open-ended final task required building a feature selection pipeline and optimizing a model to maximize ROC-AUC under that constraint.

The most important contribution of the project isn't the final model — it's identifying and correcting a structural data leakage problem that inflated baseline accuracy by over 20 percentage points.

Python scikit-learn Gradient Boosting Random Forest Feature Selection Multi-Class Classification ROC-AUC Data Leakage Detection Custom Train/Test Split

Skills demonstrated in this project

4,480

Sensor observations across 40 subjects

533

Available features

Maximum features allowed in final model

0.913

ROC-AUC (macro, one-vs-rest)

Identifying and Solving Data Leakage

The first significant finding came before any modeling. A standard scikit-learn train_test_split on this dataset produces an inflated accuracy score — around 80.7% — that makes the baseline model look far more capable than it actually is. The cause is a structural property of the data: each of the 40 subjects appears across hundreds of rows, and a row-level random split assigns some of that subject's rows to training and others to the test set.

The problem: When the same subject appears in both training and test sets, the model can learn subject-specific physiological patterns and exploit them at test time. This isn't activity classification — it's subject recognition. The model memorizes individuals rather than generalizing to new people.

To eliminate this, a custom custom_train_test_split() function was built that splits by subject rather than by row. Test subjects are selected at random and all of their rows are held out entirely — ensuring the model never sees any data from a test subject during training. When the baseline model was re-run with the corrected split, accuracy dropped to ~58.6%, a 22-point reduction that reflects what the model actually knows how to do on genuinely unseen subjects.

The function accepts both float and integer values for test_size (proportion vs. count of subjects), uses numpy.random.default_rng() for reproducibility, and rounds up subject counts in accordance with scikit-learn conventions.

Modeling & Feature Selection

With a valid train/test split in place, the modeling work proceeded in two stages: establishing performance with interpretable baseline models, then building an optimized pipeline for the constrained 10-feature final task.

Baseline

Decision Tree & Logistic Regression

A Decision Tree classifier using a subset of MAD (median absolute deviation) features from the sensor data served as the baseline after the corrected split. Logistic Regression with StandardScaler preprocessing was then applied to the same feature subset, and its multi-class confusion matrix was analyzed in detail — identifying which activity pairs were most often confused and quantifying per-class precision and recall. Emotional and mental activities showed the most cross-class confusion, which aligned with the expectation that physiological signals for cognitive states are harder to separate than signals for physical movement.

Feature Importance

Random Forest Feature Ranking

A Random Forest classifier was used to rank all MAD features by importance, then a loop evaluated models trained on incrementally increasing feature counts — one feature, then two, then three, and so on — tracking how accuracy improved with each addition. This revealed that the top few features captured most of the available signal, with diminishing returns beyond the top 5 or 6. The importance scores and accuracy curve were plotted together, making the feature value cutoff visually interpretable.

Final Model

Gradient Boosting with Two-Stage Feature Selection

The final model used a two-stage approach. First, a lightweight GradientBoostingClassifier (25 estimators, max depth 5) was trained on all numeric features to generate importance scores across the full feature space. The top 10 features by importance were selected. A second, more fully-specified GradientBoostingClassifier (100 estimators) was then trained on those 10 features and evaluated using ROC-AUC with macro averaging and a one-vs-rest multi-class strategy. This approach is computationally tractable — exhaustively searching all C(533, 10) ≈ 4.7×10²⁰ combinations is infeasible — and still achieves meaningful feature selection by using the model's own learned signal to guide the search.

Results

The final Gradient Boosting model achieved a ROC-AUC of 0.913 (macro, one-vs-rest) using 10 features selected from 533 — clearing the assignment's top scoring threshold of ≥0.90. The 10 selected features spanned multiple sensor modalities including ECG heart rate variability, electrodermal activity, and several interbeat interval (IT) metrics, suggesting the model found complementary signal across different physiological channels rather than concentrating on a single sensor type.

For context, the corrected baseline Decision Tree using a standard row-level split scored ~80.7% accuracy — but this figure was inflated by data leakage. After applying the subject-aware split, the same model dropped to ~58.6%, which is the honest starting point. The final model's 0.913 ROC-AUC represents genuine improvement from that realistic baseline, not from a leaked one.

Key Insights

Data leakage can inflate accuracy by more than 20 percentage points without any obvious signal in the code. Standard train/test split is the right default for most tabular datasets, but it's wrong whenever rows belonging to the same entity need to stay together. Recognizing the structure of the data — subjects with repeated measurements — was the critical step that made the rest of the analysis meaningful.
Feature selection under a hard constraint requires a tractable strategy. With 533 features and a limit of 10, exhaustive search is impossible. Using a fast model to generate importance rankings and then re-training a stronger model on the top candidates is a practical and effective approach — it leverages the model's own learned signal to guide the search rather than relying on domain knowledge or random sampling.
Cognitive and emotional activity states are harder to distinguish from physiology alone. The confusion matrix consistently showed higher error rates between mental and emotional labels than between physical and neutral ones. This is an inherently harder separation — physical activity produces clear, distinct physiological signals, while the differences between thinking hard and feeling an emotion are subtler and more individual.