NAFLD Data Analysis System
Non-Alcoholic Fatty Liver Disease
Comprehensive Project Documentation
A FastAPI-based Web Application for Medical Data Analysis
April 8, 2026
Table of Contents
1. Executive Summary
2. Project Overview
3. System Architecture
4. Module Documentation
5. API Endpoints
6. Data Analysis Functions
7. Visualization Functions
8. Technical Implementation Details
9. Usage Guide
10. Future Enhancements
1. Executive Summary
This project implements a comprehensive web-based data analysis system for Non-Alcoholic Fatty Liver Disease (NAFLD) research. The system is built using FastAPI, a modern Python web framework, and provides automated exploratory data analysis (EDA), data preprocessing, and interactive visualizations for medical researchers and healthcare professionals.
Key Features
• RESTful API for data access and analysis
• Automated missing value imputation using iterative methods
• Comprehensive statistical analysis including correlation and outlier detection
• Interactive visualizations using Plotly
• Medical-specific analytics for liver disease indicators
• HTML-based reporting for easy sharing and presentation
Technology Stack
Component
Technology
Web Framework
FastAPI
Data Processing
Pandas, NumPy
Machine Learning
Scikit-learn
Visualization
Plotly, Missingno, Yellowbrick
2. Project Overview
2.1 About NAFLD
Non-Alcoholic Fatty Liver Disease (NAFLD) is a medical condition characterized by excess fat accumulation in the liver of individuals who consume little to no alcohol. It represents a spectrum of liver conditions ranging from simple fatty liver (NAFL) to non-alcoholic steatohepatitis (NASH), which can progress to cirrhosis and liver cancer.
2.2 Project Purpose
This system is designed to facilitate medical research by providing:
• Automated data quality assessment and cleaning
• Statistical analysis of patient biomarkers and clinical indicators
• Visualization of disease patterns and risk factors
• Web-based access to analysis results
2.3 Dataset Description
The dataset contains comprehensive medical information for NAFLD patients including:
• Demographics: Age, Gender, Height, Weight, BMI
• Body Measurements: Waist and Hip Circumference
• Vital Signs: Systolic and Diastolic Blood Pressure
• Comorbidities: Diabetes, Hypertension, Hyperlipidemia, Metabolic Syndrome
• Liver Function Tests: AST, ALT, ALP, GGT, LDH
• Lipid Profile: Total Cholesterol, Triglycerides, HDL, LDL
• Blood Parameters: Glucose, Insulin, HOMA Index
• Disease Indicators: Fibrosis Status, Steatosis, NAS Score, Diagnosis
3. System Architecture
The system follows a modular architecture with clear separation of concerns:
3.1 Application Structure
Module
Responsibility
main.py
FastAPI application, API endpoints, request routing
analysis.py
Statistical analysis functions (summary stats, correlation, outliers)
preprocessing.py
Data cleaning and missing value imputation
visualisation.py
Data visualization functions using Plotly and other libraries
3.2 Data Flow
1. User sends HTTP request to FastAPI endpoint
2. Main application loads Excel dataset into Pandas DataFrame
3. Preprocessing module cleans and imputes missing data
4. Analysis module performs statistical computations
5. Results formatted as HTML or JSON
6. Response returned to user's browser
4. Module Documentation
4.1 main.py - Application Core
This is the central module that orchestrates the entire application. It initializes the FastAPI framework, loads the dataset, and defines all API endpoints.
Key Components:
• FastAPI Instance: Creates the web application server
• Dataset Loading: Reads NAFLD.xlsx file into memory using Pandas
• Module Imports: Imports analysis and visualization functions
• Data Copy: Creates a working copy to preserve original data
Design Pattern:
The module follows the Model-View-Controller (MVC) pattern where FastAPI acts as the controller, the DataFrame represents the model, and HTML responses serve as views.
4.2 analysis.py - Statistical Analysis
This module provides core statistical analysis functions that extract insights from the dataset. All functions are pure and stateless, accepting a DataFrame and returning processed results.
Functions Overview:
Function
Purpose
basic_info()
Returns row count and column names
missing_values()
Counts null values per column
summary_stats()
Computes mean, std, min, max, quartiles for numeric columns
correlation()
Calculates Pearson correlation matrix between numeric features
outliers()
Detects outliers using IQR method (Q1-1.5*IQR, Q3+1.5*IQR)
4.3 preprocessing.py - Data Cleaning
This module handles data quality issues, specifically missing value imputation. It uses advanced machine learning techniques to predict and fill missing values based on patterns in complete data.
Iterative Imputation Method:
The IterativeImputer from scikit-learn uses a round-robin approach where each feature with missing values is modeled as a function of other features. This is more sophisticated than simple mean/median imputation as it captures relationships between variables.
Algorithm Steps:
7. Identify numeric columns with missing values
8. For each column, use other columns to predict missing values
9. Iterate until convergence or maximum iterations reached
10. Replace missing values with predicted values
Advantages:
• Preserves relationships between variables
• More accurate than simple mean imputation
• Handles multiple missing values per row
• Uses Bayesian ridge regression by default
4.4 visualisation.py - Data Visualization
This module contains functions for creating interactive visualizations using Plotly and specialized medical data visualization libraries. The visualizations are designed specifically for medical research and disease pattern analysis.
Visualization Categories:
1. Distribution Analysis
• plot_distribution(): Histograms for Age, BMI, Glucose, Cholesterol
• Purpose: Detect skewness and abnormal ranges
2. Correlation Analysis
• plot_correlation_matrix(): Heatmap of all numeric features
• Key Relationships: BMI ↔ Glucose, ALT ↔ AST, Insulin ↔ HOMA
3. Outlier Detection
• plot_outliers(): Box plots for BMI, Glucose, ALT, AST
• Purpose: Identify extreme patients requiring special attention
4. Disease Risk Analysis
• plot_disease_vs_feature(): Compare features across disease groups
• Examples: BMI vs Fibrosis, Glucose vs Diabetes, ALT vs NASH
5. Categorical Distribution
• plot_categorical_distribution(): Count plots for Gender, Smoking, Disease Type
6. Feature Relationships
• feature_relationships(): Scatter matrix for BMI, Glucose, Insulin, HOMA
• Purpose: Identify clustering and multivariate relationships
7. Missing Value Patterns
• plot_missing_values(): Matrix, bar, heatmap, and dendrogram visualizations using missingno
8. Medical-Specific Combinations
• plot_medical_combos(): Clinically relevant pairs
• BMI vs HOMA (insulin resistance indicator)
• Glucose vs HbA1c (blood sugar control)
• ALT vs AST ratio (liver damage pattern)
• Cholesterol vs HDL (cardiovascular risk)
5. API Endpoints
5.1 GET /view
• Purpose: Display raw dataset in HTML table format
• Response Type: HTML
• Use Case: Quick data inspection in browser
This endpoint reads the Excel file and converts it to an HTML table using Pandas' to_html() method, allowing quick visual inspection of the raw data without any processing.
5.2 GET /NAFLD
• Purpose: Provide comprehensive information about NAFLD
• Response Type: Plain Text
• Content Includes:
• Disease definition and progression
• Symptoms and causes
• Diagnosis methods
• Treatment approaches
• Complete data dictionary for all patient variables
5.3 GET /eda
• Purpose: Perform comprehensive Exploratory Data Analysis
• Response Type: HTML
• Processing Steps:
Step
Action
1
Create local copy of data
2
Count missing values (before imputation)
3
Apply iterative imputation
4
Count missing values (after imputation - should be zero)
5
Calculate basic info (rows, columns)
6
Compute summary statistics
7
Calculate correlation matrix
8
Detect outliers using IQR method
9
Format results as HTML tables
10
Return comprehensive HTML report
5.4 GET /visualisation
• Status: Incomplete (endpoint defined but not implemented)
• Intended Purpose: Generate and display interactive visualizations
Note: The visualization functions are defined in visualisation.py but are not currently integrated into this endpoint. Implementation would involve calling the plot functions and embedding the results in HTML.
6. Data Analysis Functions - Detailed Explanation
6.1 basic_info(df)
Input: Pandas DataFrame
Output: Dictionary with 'rows' and 'columns' keys
This function provides the most basic metadata about the dataset. It returns the total number of rows (patient records) and a list of all column names. This is typically the first step in any data analysis to understand the dataset structure.
6.2 missing_values(df)
Input: Pandas DataFrame
Output: Dictionary mapping column names to count of missing values
Uses Pandas' isnull() method to identify missing (NaN, None, NaT) values. Returns a count for each column. This is crucial for data quality assessment and deciding on imputation strategies. In medical datasets, missing values can occur due to test not performed, patient non-compliance, or data entry errors.
6.3 summary_stats(df)
Input: Pandas DataFrame
Output: Nested dictionary with statistics for each numeric column
Computes descriptive statistics using Pandas' describe() method. For each numeric column, calculates:
• Count: Number of non-null values
• Mean: Average value
• Std: Standard deviation (measure of spread)
• Min: Minimum value
• 25%: First quartile (25th percentile)
• 50%: Median (50th percentile)
• 75%: Third quartile (75th percentile)
• Max: Maximum value
6.4 correlation(df)
Input: Pandas DataFrame
Output: Nested dictionary representing correlation matrix
Calculates Pearson correlation coefficients between all pairs of numeric variables. Correlation values range from -1 to 1:
• 1: Perfect positive correlation
• 0: No linear correlation
• -1: Perfect negative correlation
In medical research, correlation analysis helps identify relationships between biomarkers, risk factors, and disease outcomes. For example, high correlation between BMI and insulin resistance (HOMA index) would support their known relationship.
6.5 outliers(df)
Input: Pandas DataFrame
Output: Dictionary mapping column names to outlier counts
Uses the Interquartile Range (IQR) method to detect outliers. This is a robust statistical method that works as follows:
11. Calculate Q1 (25th percentile) and Q3 (75th percentile)
12. Calculate IQR = Q3 - Q1
13. Define lower bound = Q1 - 1.5 × IQR
14. Define upper bound = Q3 + 1.5 × IQR
15. Values outside these bounds are considered outliers
In medical datasets, outliers may represent rare conditions, measurement errors, or genuinely extreme cases requiring special clinical attention. The 1.5 × IQR threshold is a standard convention that balances sensitivity and specificity.
7. Visualization Functions - Detailed Explanation
All visualization functions in this module use Plotly, which creates interactive HTML-based charts. Unlike static images, Plotly visualizations allow users to zoom, pan, hover for details, and export images directly from the browser.
7.1 Medical Significance of Visualizations
Each visualization type serves specific clinical and research purposes:
Distribution Plots
• Age Distribution: Identifies if study population is representative
• BMI Distribution: Shows obesity prevalence in patient population
• Glucose Distribution: Reveals diabetes and prediabetes prevalence
• Cholesterol Distribution: Indicates cardiovascular risk in population
Correlation Matrix
The correlation heatmap is one of the most important visualizations for medical research. Key relationships to look for:
• BMI ↔ Insulin/HOMA: Validates insulin resistance in obesity
• ALT ↔ AST: Both are liver enzymes; strong correlation expected
• Glucose ↔ HbA1c: Both measure blood sugar; correlation validates data quality
• Triglycerides ↔ HDL: Inverse relationship indicates metabolic dysfunction
Box Plots for Outlier Detection
Box plots visually display the distribution and identify extreme values. In medical context:
• Outliers may represent measurement errors requiring data cleaning
• Outliers may represent rare conditions worthy of case studies
• Severe outliers in liver enzymes may indicate acute liver injury
Disease vs Feature Comparisons
These visualizations compare biomarker levels across disease categories:
• BMI vs Fibrosis: Tests if obesity correlates with liver scarring
• Glucose vs Diabetes: Validates diabetes diagnosis accuracy
• ALT vs NASH: Determines if liver enzyme elevation distinguishes NASH from NAFL
Feature Importance (Machine Learning)
After training a predictive model (e.g., Random Forest), this visualization shows which features contribute most to predictions. High importance features become:
• Candidates for biomarker panels
• Focus areas for clinical intervention
• Variables to monitor during treatment
8. Technical Implementation Details
8.1 Why FastAPI?
FastAPI was chosen for several technical advantages:
• High Performance: Built on Starlette and Pydantic, comparable to NodeJS and Go
• Automatic API Documentation: Generates interactive docs at /docs endpoint
• Type Safety: Uses Python type hints for validation
• Easy Testing: Built-in test client for endpoint testing
• Modern Python: Async/await support for concurrent requests
8.2 Data Loading Strategy
The dataset is loaded once at application startup and stored in memory. This design choice offers:
• Speed: No disk I/O for each request
• Consistency: All requests work with same dataset version
• Simplicity: No database setup required
However, this approach has limitations:
• Changes to source Excel file require application restart
• Memory usage scales with dataset size
• Not suitable for very large datasets (>10GB)
8.3 Imputation Methodology
The IterativeImputer uses a sophisticated algorithm:
16. Initialize missing values with column means
17. For each feature with missing values:
• Treat it as target variable
• Use other features as predictors
• Train regression model (Bayesian Ridge by default)
• Predict missing values
18. Repeat step 2 for all features in round-robin fashion
19. Iterate until convergence or max iterations (10 by default)
Why Bayesian Ridge Regression?
• Robust to multicollinearity (common in medical data)
• Automatic regularization prevents overfitting
• Probabilistic predictions account for uncertainty
• Works well with limited sample sizes
8.4 HTML Rendering Approach
The EDA endpoint constructs HTML programmatically using Python string formatting. This approach:
• Avoids template engine overhead
• Provides full control over HTML structure
• Enables dynamic table generation from dictionaries
• Results in self-contained HTML documents
9. Usage Guide
9.1 Installation Requirements
Install required Python packages:
pip install fastapi pandas scikit-learn plotly missingno yellowbrick openpyxl
9.2 Starting the Application
Run with Uvicorn (ASGI server):
uvicorn main:app --reload
The --reload flag enables auto-restart on code changes (development only).
9.3 Accessing Endpoints
Once running, access endpoints through your browser:
• View Dataset: http://localhost:8000/view
• NAFLD Information: http://localhost:8000/NAFLD
• Exploratory Data Analysis: http://localhost:8000/eda
• API Documentation: http://localhost:8000/docs
9.4 Interpreting EDA Results
Missing Values Report
The report shows missing values before and after imputation. Columns with high missing rates (>20%) should be:
• Investigated for systematic missingness patterns
• Considered for exclusion if missingness is non-random
• Flagged in research limitations
Summary Statistics Interpretation
• Mean vs Median: Large difference indicates skewed distribution
• Standard Deviation: High values indicate high variability in population
• Min/Max: Check for biologically impossible values (data errors)
Correlation Matrix Guidelines
• |r| > 0.7: Strong correlation (consider multicollinearity)
• 0.3 < |r| < 0.7: Moderate correlation
• |r| < 0.3: Weak or no correlation
Outlier Analysis
High outlier counts may indicate:
• Data entry errors (values should be verified)
• Heterogeneous patient population
• Rare medical conditions worthy of investigation
10. Future Enhancements
10.1 Immediate Improvements
20. Complete Visualization Endpoint: Implement /visualisation route to generate and display plots
21. Error Handling: Add try-except blocks for file loading and processing errors
22. Configuration File: Move file path to config file instead of hardcoding
23. Logging: Add comprehensive logging for debugging and monitoring
24. Unit Tests: Write tests for analysis and preprocessing functions
10.2 Feature Additions
25. Machine Learning Models:
• Predictive models for fibrosis progression
• Classification models for NASH vs NAFL
• Risk scoring algorithms
26. Advanced Visualizations:
• Interactive dashboards with filters
• Time series analysis if longitudinal data available
• 3D scatter plots for multivariate relationships
27. Database Integration:
• PostgreSQL or MongoDB for scalability
• Support for multiple datasets
• Data versioning and audit trails
28. User Interface:
• React or Vue.js frontend
• User authentication and role-based access
• Report export functionality (PDF, Excel)
29. Statistical Testing:
• T-tests and ANOVA for group comparisons
• Chi-square tests for categorical associations
• Survival analysis for disease progression
10.3 Production Readiness
• Docker Containerization: Create Dockerfile for easy deployment
• HTTPS/TLS: Secure communication for patient data
• HIPAA Compliance: Implement data encryption and access controls
• Load Balancing: Support multiple concurrent users
• Monitoring: Application performance monitoring and alerting
10.4 Research Extensions
• Multi-center Data Integration: Combine datasets from different hospitals
• Genomic Data Integration: Link genetic markers to disease phenotypes
• Natural Language Processing: Extract insights from clinical notes
• Deep Learning: Neural networks for image analysis (ultrasound, MRI)
Conclusion
This NAFLD Data Analysis System represents a solid foundation for medical data analysis and research. The modular architecture separates concerns effectively, making the codebase maintainable and extensible. The use of modern Python libraries (FastAPI, Pandas, Scikit-learn, Plotly) ensures that the system leverages best-in-class tools for web development, data processing, and visualization.
The iterative imputation approach demonstrates sophistication beyond simple statistical methods, while the comprehensive visualization framework provides multiple perspectives for data exploration. The RESTful API design makes the system accessible and integrable with other tools and platforms.
While the current implementation focuses on exploratory analysis, the architecture is well-positioned for future enhancements including machine learning models, interactive dashboards, and production deployment. The clear separation between data processing, analysis, and presentation layers will facilitate these extensions without major refactoring.
For medical researchers working with NAFLD data, this system provides an efficient pathway from raw data to actionable insights, enabling evidence-based clinical decision making and supporting the advancement of liver disease research.
Appendix: Code Structure Summary
File Organization
File
Functions/Endpoints
Lines
main.py
GET /view, GET /NAFLD, GET /eda, GET /visualisation
~120
analysis.py
basic_info, missing_values, summary_stats, correlation, outliers
~30
preprocessing.py
iterative_imputation
~10
visualisation.py
10 visualization functions (distribution, correlation, outliers, etc.)
~150
Key Dependencies
• fastapi - Web framework
• pandas - Data manipulation
• scikit-learn - Machine learning and preprocessing
• plotly - Interactive visualizations
• missingno - Missing data visualization
• yellowbrick - Machine learning visualization
• openpyxl - Excel file handling