Back to Projects

Privacy-Preserving Spam Email Detection

Nov 2024 - Dec 2024 Category: Data Classification

Role

Lead Developer for Data Processing & Modeling

Course

EECE 5644 – Machine Learning & Pattern Recognition

Objective

This project aimed to build a lightweight, privacy-preserving spam email classification system. By replacing raw text with word frequency vectors and applying dimensionality reduction techniques (e.g., PCA), the project compared the effectiveness of models like Random Forest and RBF-SVM under various preprocessing pipelines.

Technologies

Python, scikit-learn, NumPy, Pandas

Natural Language Processing (Word Frequency Vector), Random Forest, RBF Support Vector Machine (SVM), Principal Component Analysis (PCA), SMOTE undersampling balance method, Seaborn and Matplotlib visualization tools

My Contributions

  • Led the development of data preprocessing pipeline: cleaning, SMOTE, normalization, PCA
  • Implemented and compared Random Forest and RBF-SVM models in Python
  • Evaluated performance with and without dimensionality reduction
  • Generated confusion matrices, accuracy scores, and feature importance plots
  • Contributed all code for modeling and analysis phases

Results

  • Achieved over 90% accuracy with reduced input dimensions
  • Preserved privacy by avoiding direct use of email content
  • Random Forest showed more stable and interpretable results compared to SVM
  • Visualization included PCA projections, feature importance, and confusion matrix

GitHub Repository

View GitHub Repository