Sole contributor: completed the end-to-end data science pipeline including data processing, modeling, and analytics
DS5110 – Essentials of Data Science
This project is based on the Behavioral Risk Factor Surveillance System (BRFSS), a large-scale public health survey dataset released by the U.S. CDC. The dataset spans multiple years and regions, containing complex demographic and health-related variables. The objective was to design and implement an end-to-end data science pipeline that transforms raw, heterogeneous BRFSS data into structured insights through data cleaning, modeling, exploratory analysis, and visualization, enabling interpretable analysis of health risk factors across populations.
Python data analysis (Pandas, NumPy), Large-scale data cleaning & ETL Pipeline, Multi-table and hierarchical data structure modeling, Parquet data storage & performance optimization, Exploratory Data Analysis (EDA), Data visualization (Plotly / Matplotlib)
Data analysis pipeline completed; code will be updated upon completion of the frontend visualization dashboard