Role
Lead Developer for Data Processing & Modeling
Course
EECE 5644 – Machine Learning & Pattern Recognition
Objective
This project aimed to build a lightweight, privacy-preserving spam email classification system. By replacing raw text with word frequency vectors and applying dimensionality reduction techniques (e.g., PCA), the project compared the effectiveness of models like Random Forest and RBF-SVM under various preprocessing pipelines.
Technologies
Python, scikit-learn, NumPy, Pandas
Natural Language Processing (Word Frequency Vector), Random Forest, RBF Support Vector Machine (SVM), Principal Component Analysis (PCA), SMOTE undersampling balance method, Seaborn and Matplotlib visualization tools
My Contributions
- Led the development of data preprocessing pipeline: cleaning, SMOTE, normalization, PCA
- Implemented and compared Random Forest and RBF-SVM models in Python
- Evaluated performance with and without dimensionality reduction
- Generated confusion matrices, accuracy scores, and feature importance plots
- Contributed all code for modeling and analysis phases
Results
- Achieved over 90% accuracy with reduced input dimensions
- Preserved privacy by avoiding direct use of email content
- Random Forest showed more stable and interpretable results compared to SVM
- Visualization included PCA projections, feature importance, and confusion matrix
GitHub Repository
View GitHub Repository