Demographic analysis using pandas
Analyzing demographic data using Pandas in Python
The provided code performs demographic data analysis using the Pandas library in Python. It reads a dataset from a CSV file and calculates various statistics and percentages based on the data. Here's a general overview of what the code does:
The code imports the Pandas library, which is used for data manipulation and analysis.
The calculate_demographic_data function is defined. It takes an optional argument print_data which determines whether the calculated results will be printed.
Inside the function, the dataset is read from the 'adult_data.csv' file using pd.read_csv, and the resulting data is stored in a Pandas DataFrame called df.
The code calculates the number of people of each race by counting the occurrences of each race in the 'race' column using the value_counts() method. The result is stored in a Pandas Series called race_count, with the race names as the index labels.
The average age of men is calculated by filtering the DataFrame for rows where the 'sex' column is 'Male' and then computing the mean of the 'age' column. The average age is rounded to one decimal place and stored in the average_age_men variable.
The percentage of people with a Bachelor's degree is determined by filtering the DataFrame for rows where the 'education' column is 'Bachelors', counting the number of such rows, dividing it by the total number of rows in the DataFrame, and multiplying by 100. The percentage is rounded to one decimal place and stored in the percentage_bachelors variable.
The code calculates the percentage of people with advanced education (Bachelors, Masters, or Doctorate) who earn more than $50K. It filters the DataFrame for rows where the 'education' column is 'Bachelors', 'Masters', or 'Doctorate', and then filters for those with salaries greater than $50K. The percentage is rounded to one decimal place and stored in the higher_education_rich variable.
The code calculates the percentage of people without advanced education (excluding Bachelors, Masters, and Doctorate) who earn more than $50K. Similar to the previous step, it filters the DataFrame for rows where the 'education' column does not contain those advanced degrees, and then filters for those with salaries greater than $50K. The resulting percentage is rounded to one decimal place and stored in the lower_education_rich variable.
The minimum number of hours a person works per week is determined by finding the minimum value in the 'hours-per-week' column of the DataFrame using the min() method. The minimum value is stored in the min_work_hours variable.
The code calculates the percentage of people who work the minimum number of hours per week and earn more than $50K. It filters the DataFrame for rows where the 'hours-per-week' column equals the minimum value and then filters for those with salaries greater than $50K. The length of this subset is divided by the total number of people working the minimum hours and multiplied by 100. The resulting percentage is rounded to one decimal place and stored in the rich_percentage variable.
The code identifies the country with the highest percentage of people earning more than $50K. It calculates the ratio of people with salaries greater than $50K to the total number of people for each country and determines the country with the highest ratio. The highest-earning country name is stored in the highest_earning_country variable.
The code calculates the percentage of people in the highest-earning country who earn more than $50K. It filters the DataFrame for rows where the salary is greater than $50K and the native country matches the highest-earning country. The ratio of these rows to the total number of rows for the highest-earning country is calculated and multiplied by 100. The resulting percentage is rounded to one decimal place and stored in the highest_earning_country_percentage variable.
The code identifies the most popular occupation among those who earn more than $50K in India. It filters the DataFrame for rows where the salary is greater than $50K and the native country is India, and then determines the occupation with the highest frequency using the value_counts() method. The most popular occupation is stored in the top_IN_occupation variable.
Finally, a dictionary named demographic_data is created, containing all the calculated values.
If print_data is set to True, the function prints the calculated values in a key-value format.
The demographic_data dictionary is returned by the function, providing access to the calculated values for further usage.
import pandas as pd
def calculate_demographic_data(print_data=True):
# Read the dataset
df = pd.read_csv('adult_data.csv')
# Calculate the number of people of each race
race_count = df['race'].value_counts()
# Calculate the average age of men
average_age_men = round(df[df['sex'] == 'Male']['age'].mean(), 1)
# Calculate the percentage of people with a Bachelor's degree
percentage_bachelors = round(len(df[df['education'] == 'Bachelors']) / len(df) * 100, 1)
# Calculate the percentage of people with advanced education earning >50K
higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50K']) / len(higher_education) * 100, 1)
# Calculate the percentage of people without advanced education earning >50K
lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50K']) / len(lower_education) * 100, 1)
# Calculate the minimum number of hours a person works per week
min_work_hours = df['hours-per-week'].min()
# Calculate the percentage of the people who work the minimum number of hours per week and earn >50K
num_min_workers = len(df[df['hours-per-week'] == min_work_hours])
rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)
# Identify the country with the highest percentage of people earning >50K and calculate that percentage
highest_earning_country = (df[df['salary'] == '>50K']['native-country'].value_counts() / df['native-country'].value_counts()).idxmax()
highest_earning_country_percentage = round((df[(df['salary'] == '>50K') & (df['native-country'] == highest_earning_country)].shape[0] / df[df['native-country'] == highest_earning_country].shape[0]) * 100, 1)
# Identify the most popular occupation for those who earn >50K in India
top_IN_occupation = df[(df['salary'] == '>50K') & (df['native-country'] == 'India')]['occupation'].value_counts().idxmax()
# Create a dictionary with the calculated values
demographic_data = {
'race_count': race_count,
'average_age_men': average_age_men,
'percentage_bachelors': percentage_bachelors,
'higher_education_rich': higher_education_rich,
'lower_education_rich': lower_education_rich,
'min_work_hours': min_work_hours,
'rich_percentage': rich_percentage,
'highest_earning_country': highest_earning_country,
'highest_earning_country_percentage': highest_earning_country_percentage,
'top_IN_occupation': top_IN_occupation
}
# Print the data if requested
if print_data:
for key, value in demographic_data.items():
print(f'{key}: {value}')
return demographic_data