BRFSS Interactive Data Dashboard

October 2025 - December 2025 Categories: Data Engineering, Data Visualization, Web Development

Project Overview

A production-ready, interactive web dashboard for exploring CDC's Behavioral Risk Factor Surveillance System (BRFSS) data. The system processes over 1.7 million records with 98 unique health questions across 56 U.S. states and territories, spanning 13 years (2011-2023) of public health survey data.

Key Highlights

Data Scale: 1.7M+ records processed with automated quality filtering
Geographic Coverage: 56 U.S. states and territories
Time Range: 2011-2023 (13 years of data)
Data Quality: 99.76% completeness with automated validation
Visualizations: 7 multi-dimensional analysis panels with interactive features

Role

Sole contributor: completed the end-to-end data science pipeline including data processing, modeling, and analytics

Course

DS5110 – Essentials of Data Science

Objective

This project is based on the Behavioral Risk Factor Surveillance System (BRFSS), a large-scale public health survey dataset released by the U.S. CDC. The dataset spans multiple years and regions, containing complex demographic and health-related variables. The objective was to design and implement an end-to-end data science pipeline that transforms raw, heterogeneous BRFSS data into structured insights through data cleaning, modeling, exploratory analysis, and visualization, enabling interpretable analysis of health risk factors across populations.

Technology Stack

Backend & Data Processing

Python 3.8+ - Core programming language
Pandas 2.0+ - Data manipulation and analysis
NumPy 1.24+ - Numerical computations
PyArrow 12.0+ - Parquet file handling for efficient storage

Frontend & Visualization

Dash 2.14+ - Web application framework for interactive dashboards
Plotly 5.17+ - Interactive visualizations and charts
Plotly Express - High-level chart creation API
HTML/CSS - UI styling and responsive layout

Data Engineering

7-Step ETL Pipeline - Automated data cleaning and transformation
Parquet Format - Columnar storage for efficient querying
Quality Validation - Automated data quality checks and filtering

Key Features & Implementation

1. Automated Data Cleaning Pipeline (7 Steps)

Step 0 - Preprocessing: Text normalization, encoding resolution, special character handling
Step 1 - QuestionID Merging: Consolidated duplicate QuestionIDs for identical questions
Step 2 - ResponseID Merging: Standardized historical response changes and variants
Step 3 - BreakoutID Merging: Normalized age groups, income brackets, education levels
Step 4 - Numeric Cleaning: Type conversion, validation, and proportion calculation
Step 5 - Aggregation: Recalculated confidence intervals using binomial distribution
Step 6 - Quality Enhancement: Missing value handling, consistency validation, quality reporting

2. Interactive Dashboard Features

Multi-Dimensional Analysis: 7 visualization panels covering overall distribution, gender, age, education, income, geography, and temporal trends
Hierarchical Filtering: Class → Topic → Question selection with cascading dropdowns
Interactive U.S. Map: Choropleth visualization with intelligent color mapping and statistical summary panel
Confidence Intervals: Displayed on all charts with detailed hover tooltips
Age Granularity Toggle: Switch between 7-group and 3-group age analysis

3. Data Engineering & Architecture

Designed modular backend architecture separating data loading, processing, aggregation, and API-style callbacks
Implemented efficient Parquet-based data storage for fast columnar querying
Built reusable aggregation pipelines with proper statistical weighting (sample size-based)
Automated quality filtering (minimum sample size ≥30) for statistical reliability

Results & Impact

Production-Ready System: Delivered a complete, reusable end-to-end data science pipeline with automated quality assurance
Data Quality: Achieved 99.76% data completeness with automated validation and filtering
Multi-Dimensional Analysis: Enabled systematic exploration of health indicators across 7 demographic and geographic dimensions
Longitudinal Analysis: Supported 13-year trend analysis (2011-2023) with temporal visualization
Geographic Insights: Interactive U.S. map visualization enabling regional health comparisons across 56 states and territories
Real-World Application: Produced interpretable results relevant to public health research and policy decision-making

GitHub Repository

View on GitHub: BRFSS-Dashboard

The repository includes complete source code, documentation, data cleaning pipeline scripts, and deployment guides. The project follows industry-standard SDE practices with modular architecture, comprehensive documentation, and production-ready deployment configurations.