Jitika Gupta | Freelancer Portfolio Item #470407

NAFLD Data Analysis System Non-Alcoholic Fatty Liver Disease Comprehensive Project Documentation A FastAPI-based Web Application for Medical Data Analysis April 8, 2026 Table of Contents 1. Executive Summary 2. Project Overview 3. System Architecture 4. Module Documentation 5. API Endpoints 6. Data Analysis Functions 7. Visualization Functions 8. Technical Implementation Details 9. Usage Guide 10. Future Enhancements 1. Executive Summary This project implements a comprehensive web-based data analysis system for Non-Alcoholic Fatty Liver Disease (NAFLD) research. The system is built using FastAPI, a modern Python web framework, and provides automated exploratory data analysis (EDA), data preprocessing, and interactive visualizations for medical researchers and healthcare professionals. Key Features • RESTful API for data access and analysis • Automated missing value imputation using iterative methods • Comprehensive statistical analysis including correlation and outlier detection • Interactive visualizations using Plotly • Medical-specific analytics for liver disease indicators • HTML-based reporting for easy sharing and presentation Technology Stack Component Technology Web Framework FastAPI Data Processing Pandas, NumPy Machine Learning Scikit-learn Visualization Plotly, Missingno, Yellowbrick 2. Project Overview 2.1 About NAFLD Non-Alcoholic Fatty Liver Disease (NAFLD) is a medical condition characterized by excess fat accumulation in the liver of individuals who consume little to no alcohol. It represents a spectrum of liver conditions ranging from simple fatty liver (NAFL) to non-alcoholic steatohepatitis (NASH), which can progress to cirrhosis and liver cancer. 2.2 Project Purpose This system is designed to facilitate medical research by providing: • Automated data quality assessment and cleaning • Statistical analysis of patient biomarkers and clinical indicators • Visualization of disease patterns and risk factors • Web-based access to analysis results 2.3 Dataset Description The dataset contains comprehensive medical information for NAFLD patients including: • Demographics: Age, Gender, Height, Weight, BMI • Body Measurements: Waist and Hip Circumference • Vital Signs: Systolic and Diastolic Blood Pressure • Comorbidities: Diabetes, Hypertension, Hyperlipidemia, Metabolic Syndrome • Liver Function Tests: AST, ALT, ALP, GGT, LDH • Lipid Profile: Total Cholesterol, Triglycerides, HDL, LDL • Blood Parameters: Glucose, Insulin, HOMA Index • Disease Indicators: Fibrosis Status, Steatosis, NAS Score, Diagnosis 3. System Architecture The system follows a modular architecture with clear separation of concerns: 3.1 Application Structure Module Responsibility main.py FastAPI application, API endpoints, request routing analysis.py Statistical analysis functions (summary stats, correlation, outliers) preprocessing.py Data cleaning and missing value imputation visualisation.py Data visualization functions using Plotly and other libraries 3.2 Data Flow 1. User sends HTTP request to FastAPI endpoint 2. Main application loads Excel dataset into Pandas DataFrame 3. Preprocessing module cleans and imputes missing data 4. Analysis module performs statistical computations 5. Results formatted as HTML or JSON 6. Response returned to user's browser 4. Module Documentation 4.1 main.py - Application Core This is the central module that orchestrates the entire application. It initializes the FastAPI framework, loads the dataset, and defines all API endpoints. Key Components: • FastAPI Instance: Creates the web application server • Dataset Loading: Reads NAFLD.xlsx file into memory using Pandas • Module Imports: Imports analysis and visualization functions • Data Copy: Creates a working copy to preserve original data Design Pattern: The module follows the Model-View-Controller (MVC) pattern where FastAPI acts as the controller, the DataFrame represents the model, and HTML responses serve as views. 4.2 analysis.py - Statistical Analysis This module provides core statistical analysis functions that extract insights from the dataset. All functions are pure and stateless, accepting a DataFrame and returning processed results. Functions Overview: Function Purpose basic_info() Returns row count and column names missing_values() Counts null values per column summary_stats() Computes mean, std, min, max, quartiles for numeric columns correlation() Calculates Pearson correlation matrix between numeric features outliers() Detects outliers using IQR method (Q1-1.5*IQR, Q3+1.5*IQR) 4.3 preprocessing.py - Data Cleaning This module handles data quality issues, specifically missing value imputation. It uses advanced machine learning techniques to predict and fill missing values based on patterns in complete data. Iterative Imputation Method: The IterativeImputer from scikit-learn uses a round-robin approach where each feature with missing values is modeled as a function of other features. This is more sophisticated than simple mean/median imputation as it captures relationships between variables. Algorithm Steps: 7. Identify numeric columns with missing values 8. For each column, use other columns to predict missing values 9. Iterate until convergence or maximum iterations reached 10. Replace missing values with predicted values Advantages: • Preserves relationships between variables • More accurate than simple mean imputation • Handles multiple missing values per row • Uses Bayesian ridge regression by default 4.4 visualisation.py - Data Visualization This module contains functions for creating interactive visualizations using Plotly and specialized medical data visualization libraries. The visualizations are designed specifically for medical research and disease pattern analysis. Visualization Categories: 1. Distribution Analysis • plot_distribution(): Histograms for Age, BMI, Glucose, Cholesterol • Purpose: Detect skewness and abnormal ranges 2. Correlation Analysis • plot_correlation_matrix(): Heatmap of all numeric features • Key Relationships: BMI ↔ Glucose, ALT ↔ AST, Insulin ↔ HOMA 3. Outlier Detection • plot_outliers(): Box plots for BMI, Glucose, ALT, AST • Purpose: Identify extreme patients requiring special attention 4. Disease Risk Analysis • plot_disease_vs_feature(): Compare features across disease groups • Examples: BMI vs Fibrosis, Glucose vs Diabetes, ALT vs NASH 5. Categorical Distribution • plot_categorical_distribution(): Count plots for Gender, Smoking, Disease Type 6. Feature Relationships • feature_relationships(): Scatter matrix for BMI, Glucose, Insulin, HOMA • Purpose: Identify clustering and multivariate relationships 7. Missing Value Patterns • plot_missing_values(): Matrix, bar, heatmap, and dendrogram visualizations using missingno 8. Medical-Specific Combinations • plot_medical_combos(): Clinically relevant pairs • BMI vs HOMA (insulin resistance indicator) • Glucose vs HbA1c (blood sugar control) • ALT vs AST ratio (liver damage pattern) • Cholesterol vs HDL (cardiovascular risk) 5. API Endpoints 5.1 GET /view • Purpose: Display raw dataset in HTML table format • Response Type: HTML • Use Case: Quick data inspection in browser This endpoint reads the Excel file and converts it to an HTML table using Pandas' to_html() method, allowing quick visual inspection of the raw data without any processing. 5.2 GET /NAFLD • Purpose: Provide comprehensive information about NAFLD • Response Type: Plain Text • Content Includes: • Disease definition and progression • Symptoms and causes • Diagnosis methods • Treatment approaches • Complete data dictionary for all patient variables 5.3 GET /eda • Purpose: Perform comprehensive Exploratory Data Analysis • Response Type: HTML • Processing Steps: Step Action 1 Create local copy of data 2 Count missing values (before imputation) 3 Apply iterative imputation 4 Count missing values (after imputation - should be zero) 5 Calculate basic info (rows, columns) 6 Compute summary statistics 7 Calculate correlation matrix 8 Detect outliers using IQR method 9 Format results as HTML tables 10 Return comprehensive HTML report 5.4 GET /visualisation • Status: Incomplete (endpoint defined but not implemented) • Intended Purpose: Generate and display interactive visualizations Note: The visualization functions are defined in visualisation.py but are not currently integrated into this endpoint. Implementation would involve calling the plot functions and embedding the results in HTML. 6. Data Analysis Functions - Detailed Explanation 6.1 basic_info(df) Input: Pandas DataFrame Output: Dictionary with 'rows' and 'columns' keys This function provides the most basic metadata about the dataset. It returns the total number of rows (patient records) and a list of all column names. This is typically the first step in any data analysis to understand the dataset structure. 6.2 missing_values(df) Input: Pandas DataFrame Output: Dictionary mapping column names to count of missing values Uses Pandas' isnull() method to identify missing (NaN, None, NaT) values. Returns a count for each column. This is crucial for data quality assessment and deciding on imputation strategies. In medical datasets, missing values can occur due to test not performed, patient non-compliance, or data entry errors. 6.3 summary_stats(df) Input: Pandas DataFrame Output: Nested dictionary with statistics for each numeric column Computes descriptive statistics using Pandas' describe() method. For each numeric column, calculates: • Count: Number of non-null values • Mean: Average value • Std: Standard deviation (measure of spread) • Min: Minimum value • 25%: First quartile (25th percentile) • 50%: Median (50th percentile) • 75%: Third quartile (75th percentile) • Max: Maximum value 6.4 correlation(df) Input: Pandas DataFrame Output: Nested dictionary representing correlation matrix Calculates Pearson correlation coefficients between all pairs of numeric variables. Correlation values range from -1 to 1: • 1: Perfect positive correlation • 0: No linear correlation • -1: Perfect negative correlation In medical research, correlation analysis helps identify relationships between biomarkers, risk factors, and disease outcomes. For example, high correlation between BMI and insulin resistance (HOMA index) would support their known relationship. 6.5 outliers(df) Input: Pandas DataFrame Output: Dictionary mapping column names to outlier counts Uses the Interquartile Range (IQR) method to detect outliers. This is a robust statistical method that works as follows: 11. Calculate Q1 (25th percentile) and Q3 (75th percentile) 12. Calculate IQR = Q3 - Q1 13. Define lower bound = Q1 - 1.5 × IQR 14. Define upper bound = Q3 + 1.5 × IQR 15. Values outside these bounds are considered outliers In medical datasets, outliers may represent rare conditions, measurement errors, or genuinely extreme cases requiring special clinical attention. The 1.5 × IQR threshold is a standard convention that balances sensitivity and specificity. 7. Visualization Functions - Detailed Explanation All visualization functions in this module use Plotly, which creates interactive HTML-based charts. Unlike static images, Plotly visualizations allow users to zoom, pan, hover for details, and export images directly from the browser. 7.1 Medical Significance of Visualizations Each visualization type serves specific clinical and research purposes: Distribution Plots • Age Distribution: Identifies if study population is representative • BMI Distribution: Shows obesity prevalence in patient population • Glucose Distribution: Reveals diabetes and prediabetes prevalence • Cholesterol Distribution: Indicates cardiovascular risk in population Correlation Matrix The correlation heatmap is one of the most important visualizations for medical research. Key relationships to look for: • BMI ↔ Insulin/HOMA: Validates insulin resistance in obesity • ALT ↔ AST: Both are liver enzymes; strong correlation expected • Glucose ↔ HbA1c: Both measure blood sugar; correlation validates data quality • Triglycerides ↔ HDL: Inverse relationship indicates metabolic dysfunction Box Plots for Outlier Detection Box plots visually display the distribution and identify extreme values. In medical context: • Outliers may represent measurement errors requiring data cleaning • Outliers may represent rare conditions worthy of case studies • Severe outliers in liver enzymes may indicate acute liver injury Disease vs Feature Comparisons These visualizations compare biomarker levels across disease categories: • BMI vs Fibrosis: Tests if obesity correlates with liver scarring • Glucose vs Diabetes: Validates diabetes diagnosis accuracy • ALT vs NASH: Determines if liver enzyme elevation distinguishes NASH from NAFL Feature Importance (Machine Learning) After training a predictive model (e.g., Random Forest), this visualization shows which features contribute most to predictions. High importance features become: • Candidates for biomarker panels • Focus areas for clinical intervention • Variables to monitor during treatment 8. Technical Implementation Details 8.1 Why FastAPI? FastAPI was chosen for several technical advantages: • High Performance: Built on Starlette and Pydantic, comparable to NodeJS and Go • Automatic API Documentation: Generates interactive docs at /docs endpoint • Type Safety: Uses Python type hints for validation • Easy Testing: Built-in test client for endpoint testing • Modern Python: Async/await support for concurrent requests 8.2 Data Loading Strategy The dataset is loaded once at application startup and stored in memory. This design choice offers: • Speed: No disk I/O for each request • Consistency: All requests work with same dataset version • Simplicity: No database setup required However, this approach has limitations: • Changes to source Excel file require application restart • Memory usage scales with dataset size • Not suitable for very large datasets (>10GB) 8.3 Imputation Methodology The IterativeImputer uses a sophisticated algorithm: 16. Initialize missing values with column means 17. For each feature with missing values: • Treat it as target variable • Use other features as predictors • Train regression model (Bayesian Ridge by default) • Predict missing values 18. Repeat step 2 for all features in round-robin fashion 19. Iterate until convergence or max iterations (10 by default) Why Bayesian Ridge Regression? • Robust to multicollinearity (common in medical data) • Automatic regularization prevents overfitting • Probabilistic predictions account for uncertainty • Works well with limited sample sizes 8.4 HTML Rendering Approach The EDA endpoint constructs HTML programmatically using Python string formatting. This approach: • Avoids template engine overhead • Provides full control over HTML structure • Enables dynamic table generation from dictionaries • Results in self-contained HTML documents 9. Usage Guide 9.1 Installation Requirements Install required Python packages: pip install fastapi pandas scikit-learn plotly missingno yellowbrick openpyxl 9.2 Starting the Application Run with Uvicorn (ASGI server): uvicorn main:app --reload The --reload flag enables auto-restart on code changes (development only). 9.3 Accessing Endpoints Once running, access endpoints through your browser: • View Dataset: http://localhost:8000/view • NAFLD Information: http://localhost:8000/NAFLD • Exploratory Data Analysis: http://localhost:8000/eda • API Documentation: http://localhost:8000/docs 9.4 Interpreting EDA Results Missing Values Report The report shows missing values before and after imputation. Columns with high missing rates (>20%) should be: • Investigated for systematic missingness patterns • Considered for exclusion if missingness is non-random • Flagged in research limitations Summary Statistics Interpretation • Mean vs Median: Large difference indicates skewed distribution • Standard Deviation: High values indicate high variability in population • Min/Max: Check for biologically impossible values (data errors) Correlation Matrix Guidelines • |r| > 0.7: Strong correlation (consider multicollinearity) • 0.3 < |r| < 0.7: Moderate correlation • |r| < 0.3: Weak or no correlation Outlier Analysis High outlier counts may indicate: • Data entry errors (values should be verified) • Heterogeneous patient population • Rare medical conditions worthy of investigation 10. Future Enhancements 10.1 Immediate Improvements 20. Complete Visualization Endpoint: Implement /visualisation route to generate and display plots 21. Error Handling: Add try-except blocks for file loading and processing errors 22. Configuration File: Move file path to config file instead of hardcoding 23. Logging: Add comprehensive logging for debugging and monitoring 24. Unit Tests: Write tests for analysis and preprocessing functions 10.2 Feature Additions 25. Machine Learning Models: • Predictive models for fibrosis progression • Classification models for NASH vs NAFL • Risk scoring algorithms 26. Advanced Visualizations: • Interactive dashboards with filters • Time series analysis if longitudinal data available • 3D scatter plots for multivariate relationships 27. Database Integration: • PostgreSQL or MongoDB for scalability • Support for multiple datasets • Data versioning and audit trails 28. User Interface: • React or Vue.js frontend • User authentication and role-based access • Report export functionality (PDF, Excel) 29. Statistical Testing: • T-tests and ANOVA for group comparisons • Chi-square tests for categorical associations • Survival analysis for disease progression 10.3 Production Readiness • Docker Containerization: Create Dockerfile for easy deployment • HTTPS/TLS: Secure communication for patient data • HIPAA Compliance: Implement data encryption and access controls • Load Balancing: Support multiple concurrent users • Monitoring: Application performance monitoring and alerting 10.4 Research Extensions • Multi-center Data Integration: Combine datasets from different hospitals • Genomic Data Integration: Link genetic markers to disease phenotypes • Natural Language Processing: Extract insights from clinical notes • Deep Learning: Neural networks for image analysis (ultrasound, MRI) Conclusion This NAFLD Data Analysis System represents a solid foundation for medical data analysis and research. The modular architecture separates concerns effectively, making the codebase maintainable and extensible. The use of modern Python libraries (FastAPI, Pandas, Scikit-learn, Plotly) ensures that the system leverages best-in-class tools for web development, data processing, and visualization. The iterative imputation approach demonstrates sophistication beyond simple statistical methods, while the comprehensive visualization framework provides multiple perspectives for data exploration. The RESTful API design makes the system accessible and integrable with other tools and platforms. While the current implementation focuses on exploratory analysis, the architecture is well-positioned for future enhancements including machine learning models, interactive dashboards, and production deployment. The clear separation between data processing, analysis, and presentation layers will facilitate these extensions without major refactoring. For medical researchers working with NAFLD data, this system provides an efficient pathway from raw data to actionable insights, enabling evidence-based clinical decision making and supporting the advancement of liver disease research. Appendix: Code Structure Summary File Organization File Functions/Endpoints Lines main.py GET /view, GET /NAFLD, GET /eda, GET /visualisation ~120 analysis.py basic_info, missing_values, summary_stats, correlation, outliers ~30 preprocessing.py iterative_imputation ~10 visualisation.py 10 visualization functions (distribution, correlation, outliers, etc.) ~150 Key Dependencies • fastapi - Web framework • pandas - Data manipulation • scikit-learn - Machine learning and preprocessing • plotly - Interactive visualizations • missingno - Missing data visualization • yellowbrick - Machine learning visualization • openpyxl - Excel file handling