R.Ranjani Priyaa | Freelancer Data Cleaning Report

data cleaning report

DATA CLEANING AND PREPROCESSING (Medical Appointment No Shows) For this project, I used the Medical Appointment No-Shows dataset from Kaggle, which I renamed as task1. Two methods were applied for data cleaning and preprocessing: Microsoft Excel and Python. 1) Microsoft Excel To work with the dataset in Excel, I first converted it from CSV to XLSX format. After reviewing the data, I performed several cleaning and preprocessing steps: Column Names and Data Standardization: • Column names had mixed casing and spaces. I standardized them by converting to lowercase and replacing spaces with underscores. o Example: Patient ID → patient_id, Appointment Date → appointment_date • The gender column used abbreviations M and F instead of full names. • scheduled_day and appointment_day columns combined both date and time. The appointment_day time was always 00:00:00. • The age column contained a value of -1. Handling Missing Values: • Applied filters to all columns to identify missing values. Removing Duplicates: • Used the Remove Duplicates tool from the Data tab to eliminate duplicate rows. Standardizing Data Values: • Changed column headings to a consistent format (lowercase, no spaces). • Used Find and Replace to update F → Female and M → Male. (Initially, I mistakenly replaced all occurrences of M in the sheet, which required correction.) Date and Time Formatting: • Used the Text to Columns tool to separate dates and times. • Original format:-T08:39:56Z • Steps: 1. Split by T →- (date) and 08:39:56Z (time) 2. Split by Z → 08:39:56 3. Reformatted date from YYYY-MM-DD to DD-MM-YYYY • Deleted appointment_time column since it was always 00:00:00. Correcting Age Values: • Changed invalid age -1 to 1. Mistake I have done so far: I used find and replace for whole sheet so, every M change into Male. And I clean the error once again. 2) Python For automated and reproducible cleaning, I used Python with Jupyter Notebook in Visual Studio. Steps: Load Dataset: import pandas as pd df = pd.read_csv("task 1 csv.csv") Handle Missing Values: print("Missing values:\n", df.isnull().sum()) Remove Duplicates: df.drop_duplicates(inplace=True) Standardize Text Columns: for col in df.select_dtypes(include='object').columns: df[col] = df[col].str.strip().str.lower() df['gender'] = df['gender'].astype(str).str.strip().str.upper().map({ 'F': 'female', 'M': 'male' }) Format Date and Time Columns: if 'scheduledday' in df.columns: df['scheduledday'] = pd.to_datetime(df['scheduledday'], errors='coerce') df['scheduled_date'] = df['scheduledday'].dt.strftime('%d-%m-%Y') df['scheduled_time'] = df['scheduledday'].dt.strftime('%H:%M:%S') df.drop(columns=['scheduledday'], inplace=True) if 'appointmentday' in df.columns: df['appointmentday'] = pd.to_datetime(df['appointmentday'], errors='coerce') df['appointment_date'] = df['appointmentday'].dt.strftime('%d-%m-%Y') df['appointment_time'] = df['appointmentday'].dt.strftime('%H:%M:%S') df.drop(columns=['appointmentday'], inplace=True) Rename Columns: df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_') df.drop(columns=['appointment_time'], inplace=True) Fix Data Types: if 'age' in df.columns: df['age'] = df['age'].astype(int, errors='ignore') if 'date' in df.columns: df['date'] = pd.to_datetime(df['date'], errors='coerce') Export Cleaned Data: df.to_csv("cleaned_data.csv", index=False) print("\nData cleaning completed! Cleaned file saved as 'cleaned_data.csv'") Summary of Python Cleaning: • Missing values identified and handled. • Duplicate rows removed. • Text columns standardized. • Gender values mapped correctly to Male and Female. • Dates and times extracted and formatted properly. • Unnecessary columns removed. • Data types corrected. • Cleaned dataset saved as cleaned_data.csv.