Back to Projects

BRFSS Interactive Data Dashboard

October 2025 - December 2025 Categories: Data Engineering, Data Visualization, Web Development

Project Overview

A production-ready, interactive web dashboard for exploring CDC's Behavioral Risk Factor Surveillance System (BRFSS) data. The system processes over 1.7 million records with 98 unique health questions across 56 U.S. states and territories, spanning 13 years (2011-2023) of public health survey data.

Key Highlights

  • Data Scale: 1.7M+ records processed with automated quality filtering
  • Geographic Coverage: 56 U.S. states and territories
  • Time Range: 2011-2023 (13 years of data)
  • Data Quality: 99.76% completeness with automated validation
  • Visualizations: 7 multi-dimensional analysis panels with interactive features

Role

Sole contributor: completed the end-to-end data science pipeline including data processing, modeling, and analytics

Course

DS5110 – Essentials of Data Science

Objective

This project is based on the Behavioral Risk Factor Surveillance System (BRFSS), a large-scale public health survey dataset released by the U.S. CDC. The dataset spans multiple years and regions, containing complex demographic and health-related variables. The objective was to design and implement an end-to-end data science pipeline that transforms raw, heterogeneous BRFSS data into structured insights through data cleaning, modeling, exploratory analysis, and visualization, enabling interpretable analysis of health risk factors across populations.

Technology Stack

Backend & Data Processing

  • Python 3.8+ - Core programming language
  • Pandas 2.0+ - Data manipulation and analysis
  • NumPy 1.24+ - Numerical computations
  • PyArrow 12.0+ - Parquet file handling for efficient storage

Frontend & Visualization

  • Dash 2.14+ - Web application framework for interactive dashboards
  • Plotly 5.17+ - Interactive visualizations and charts
  • Plotly Express - High-level chart creation API
  • HTML/CSS - UI styling and responsive layout

Data Engineering

  • 7-Step ETL Pipeline - Automated data cleaning and transformation
  • Parquet Format - Columnar storage for efficient querying
  • Quality Validation - Automated data quality checks and filtering

Key Features & Implementation

1. Automated Data Cleaning Pipeline (7 Steps)

  • Step 0 - Preprocessing: Text normalization, encoding resolution, special character handling
  • Step 1 - QuestionID Merging: Consolidated duplicate QuestionIDs for identical questions
  • Step 2 - ResponseID Merging: Standardized historical response changes and variants
  • Step 3 - BreakoutID Merging: Normalized age groups, income brackets, education levels
  • Step 4 - Numeric Cleaning: Type conversion, validation, and proportion calculation
  • Step 5 - Aggregation: Recalculated confidence intervals using binomial distribution
  • Step 6 - Quality Enhancement: Missing value handling, consistency validation, quality reporting

2. Interactive Dashboard Features

  • Multi-Dimensional Analysis: 7 visualization panels covering overall distribution, gender, age, education, income, geography, and temporal trends
  • Hierarchical Filtering: Class → Topic → Question selection with cascading dropdowns
  • Interactive U.S. Map: Choropleth visualization with intelligent color mapping and statistical summary panel
  • Confidence Intervals: Displayed on all charts with detailed hover tooltips
  • Age Granularity Toggle: Switch between 7-group and 3-group age analysis

3. Data Engineering & Architecture

  • Designed modular backend architecture separating data loading, processing, aggregation, and API-style callbacks
  • Implemented efficient Parquet-based data storage for fast columnar querying
  • Built reusable aggregation pipelines with proper statistical weighting (sample size-based)
  • Automated quality filtering (minimum sample size ≥30) for statistical reliability

Results & Impact

  • Production-Ready System: Delivered a complete, reusable end-to-end data science pipeline with automated quality assurance
  • Data Quality: Achieved 99.76% data completeness with automated validation and filtering
  • Multi-Dimensional Analysis: Enabled systematic exploration of health indicators across 7 demographic and geographic dimensions
  • Longitudinal Analysis: Supported 13-year trend analysis (2011-2023) with temporal visualization
  • Geographic Insights: Interactive U.S. map visualization enabling regional health comparisons across 56 states and territories
  • Real-World Application: Produced interpretable results relevant to public health research and policy decision-making

GitHub Repository

View on GitHub: BRFSS-Dashboard

The repository includes complete source code, documentation, data cleaning pipeline scripts, and deployment guides. The project follows industry-standard SDE practices with modular architecture, comprehensive documentation, and production-ready deployment configurations.