Privacy-Preserving Spam Email Detection

Nov 2024 - Dec 2024 Category: Data Classification

Role

Lead Developer for Data Processing & Modeling

Course

EECE 5644 – Machine Learning & Pattern Recognition

Objective

This project aimed to build a lightweight, privacy-preserving spam email classification system. By replacing raw text with word frequency vectors and applying dimensionality reduction techniques (e.g., PCA), the project compared the effectiveness of models like Random Forest and RBF-SVM under various preprocessing pipelines.

Technologies

Python, scikit-learn, NumPy, Pandas

Natural Language Processing (Word Frequency Vector), Random Forest, RBF Support Vector Machine (SVM), Principal Component Analysis (PCA), SMOTE undersampling balance method, Seaborn and Matplotlib visualization tools

My Contributions

Led the development of data preprocessing pipeline: cleaning, SMOTE, normalization, PCA
Implemented and compared Random Forest and RBF-SVM models in Python
Evaluated performance with and without dimensionality reduction
Generated confusion matrices, accuracy scores, and feature importance plots
Contributed all code for modeling and analysis phases

Results

Achieved over 90% accuracy with reduced input dimensions
Preserved privacy by avoiding direct use of email content
Random Forest showed more stable and interpretable results compared to SVM
Visualization included PCA projections, feature importance, and confusion matrix

GitHub Repository

View GitHub Repository