FUNDAMENTALS OF DATA SCIENCE
CENSUS DATA REPORT
INTRODUCTION
We are presented with the 1881 census record of a moderately sized town between two large
cities. The town does not have a university but has university students who reside in the town
and commute to the university in the neighboring city.
The objective of this report is to examine the census data, employ the needed data cleaning and
analysis methods and develop insights that will enable the local government make decision on
•
What to build on an unoccupied piece of land
•
What to invest in in the town
To answer these questions, the report is divided into 3 sections. The first section is data cleaning
which is done correct inappropriate entry or errors in our data. The second section is data
analysis and visualization to get insight into the attribute of the population. The last section of
the report discusses the insight and provides recommendations to accomplish our objectives.
DATA CLEANING
To clean the data, first the data is checked for duplicate records, then the columns that make
up the data frame is examined to know the columns that is important for analysis. After loading
the census csv dataset to the Pandas data frame, it had 12 columns. The first column named
‘unnamed 0:’ which contains the serial numbers of the records was dropped from the data frame
because it is not important to our analysis.
The data was cleaned by looking at the unique values of each column to see if the response was
correctly imported. On the House Number column, incorrect input ‘Four’ was discovered. This
was replaced with the correct input of ‘4’ and the column was converted to the integer datatype.
On the First Name and Surname columns, it was discovered that one record had a missing first
name and four records had a missing surname. There was no record with both first name and
surname missing. These records are left on the data frame because the surname and first name
are not needed for our analysis.
The Age column had some incorrect input like ’-’ and ‘eight’. This was corrected
by replacing it with the integer 8 and converting float values to integer. Three records were
found to have no input for age. This was corrected by replacing the empty string with the mode
of the Age column.
Two records were missing input in the Relationship to Head of House column. This was
corrected by looking at their individual address to see the properties of their family and age to
determine the most appropriate relationship to be assigned.
On the Marital Status and Gender columns inputs where abbreviations were used are corrected
to their correct values examples are ‘F’ for Female and ‘M’ for male in the Gender column and
‘M’ for Married and ‘D’ for Divorced on the Marital Status column. Marital Status for minors
was inputted as NA meaning ‘Not Applicable’. The legal marriage age is 16 years for female
with the consent of the parent while that of male is 18 years old.
A household where the Head is a 17-year-old student, and all other occupants are lodgers. The
record for the household was dropped from the dataset because a responded must be 18 years
and above to be considered the Head of Household (Sharfman, 2022).
Outliers was removed for record that showed a 35-year-old man with occupation as ‘Child’,
this is dropped from the data frame.
On the religion column, the religion for minors is recorded as NA this is seen as NaN by pandas,
so it was replaced with string ‘NA’ for it to considered during analysis. Two records with ‘Sith’
religion were dropped from the data frame. This is because ‘Sith’ is not a real-world religion,
it is used as a fictional religion in the move franchise ‘Star Wars’ (Wikipedia, 2022). There are
two records with the input ‘Undecided’ and one with ‘Private’ for religion. Since the number
of records with this input is not statistically significant, the input is replaced with ‘None’ to get
a cleaner analysis.
ANALYSIS AND INSIGHTS
Population Demographics
After data cleaning, the initial insight into the population is done to get a basic understanding
of the populations and its unique attributes. Insights from the population demography will
inform on attributes that need further analysis.
Population pyramid
The population pyramid is a graphical representation of the distribution of age group by gender.
To plot the age pyramid, a new column for age-group was added to the data frame. This is done
by grouping the age in five internals. Analysis of the age distribution shows that the town does
not have an aging population a considerable number of people are of working age which is
between 16-64 years (GOV.UK, 2018). Since the median age of the town is 35, it means that
there is no pressing need for the construction of care homes for the elderly and more
opportunities should be provided for mostly youthful populations.
Figure 1: Population Pyramid of the town residents
Marital Status
The legal age for marriage in the United Kingdom is 18 years (GOV.UK, 2022). Marital status
for people under 18 years is inputted as Not Applicable. Minors represent 25% of the total
population of the town. Analysis of those that are eligible for marriage shows that the majority
(45%) are single, 38% are married, 12% are divorced and 4.8% responded as widowed.
Approximately 50% of the divorced people in the town are female, which mean that most men
leave the town after divorced. The slightly higher percentage of female population in the town
is normal considering that there is more female population in the town.
Figure 2: Marital status by population
Figure 3: Marital Status by Age
Religion
Religion for minors (people less than 18 years) is recorded as ‘NA’ meaning not applicable.
45% of the others responded to having no religion, 28% responded that they are Christians,
14% said they were Catholics, 9% responded as Methodists, while the other religions represent
3% of the total population.
Figure 4: Religion Distribution
Analysis of the religion shows that although there are more Christian than any other religion,
majority of the residents responded to not having any religion. The chart also show that
Christians have a growing population in the town. While the construction of a Christian church
might be needed in the future, it is not much of a high priority now given that it will not benefit
a sizeable percentage of the population.
Figure 5: Religion by Age
Infirmity
99.2% of the entire population responded to having no infirmity, while only 0.8% responded
to having one of the six infirmities recorded. The population of those with infirmity is not
significant enough to dedicate any investment in infirmity.
Occupation and Unemployment
To better analyze the occupation of the town inhabitants, the occupation field is further
classified into six distinct groups. This groups are:
•
Child: These people to respond with ‘Child’ as occupation. They are between the ages
0-4 years and represent 5.8% of the total population.
•
Student: Responses with ‘Student’ as occupation is left as a group on its own. They are
between the ages of 5 – 18 years and are 19.5% of the population.
•
University Student: These are the university student in the town. PhD students are also
classified as university students. They are between the ages of 19 – 32 years. University
students are 6.9% of the total population.
•
Unemployed: 5.9% of the population reported to be unemployed. 71.5% of the
unemployed population are between the age 30-55 years.
•
Employed: 53.5% of the entire population reported to in a form of employment. This
is evenly distributed across the entire working age group (16-64 years)
•
Retired: These are people in the town that are retired they are 65 years and above and
represent 8.2% of the population
Figure 6: Occupation Age Distribution
Unemployment Statistics
5.9% of the population responded as unemployed in the population field. This only give us an
overview of the unemployment in the town. To get a detailed analysis of analysis of the
unemployment situation, we would have to use percentage unemployment and unemployment
rate. The percentage unemployment and unemployment rate of the town is calculated with the
formula below.
𝑇𝑜𝑡𝑎𝑙 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑 =
𝑈𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡 𝑟𝑎𝑡𝑒 =
𝑈𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙
𝑊𝑜𝑟𝑘𝑖𝑛𝑔 𝑎𝑔𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑙𝑎𝑏𝑜𝑢𝑟 𝑓𝑜𝑟𝑐𝑒
× 100 (Lee, 2021)
× 100 (Lee, 2021)
5,647 people in the town reported to be engaged in one form of employment. 625 individuals
reported to be unemployed, and the total number of working age population in the town is
7,218. Using the above formular, the percentage unemployment in the town is 8.7% while the
unemployment rate is 11%. This is more than what is considered a healthy or acceptable
unemployment rate 4-5% (Kim, 2022). This percentage indicates that investment in skills
acquisition and training should be strongly considered in the town.
There is more female unemployed than male in the town, so any investment in skills
empowerment and trainings for the unemployed should put into consideration the type that is
favourable with females.
Figure 7:Unemployed Age Distribution
Regular Commuters
To assess the type of investment that should be made in the town, it is important to analyze the
number of commuters in the town. Lodgers are people who rent from live with the homeowner.
447 people indicated to be lodgers in the town. The occupation of the lodgers are university
students and people who work in either of the two adjacent cities close to the town and who
like to benefit from the low rent in the town. There are 695 University students (including PhD
Students) who are not lodgers but reside with their families or rent in a house not shared with
a family and will also need to commute to the university. 5,276 people in the town are employed
and are not lodgers. These are permanent residence in the town or rent houses not shared with
a family. Looking at the type of occupation of the employed, it shows that there is a higher
percentage of specialists and professionals living in the town who will require to larger cities
for work. 25% (1,319) of this population is assumed to be employed outside of the town and
are commuters. To get the total number of commuters in the town, we use the formula below.
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑟𝑒𝑔𝑢𝑙𝑎𝑟 𝑜𝑓 𝑐𝑜𝑚𝑚𝑢𝑡𝑒𝑟𝑠 = 𝐿𝐷 + 𝑈𝑆 + 𝑂𝐸
Where LD = Number of lodgers
US = Number University Students
OE = Other Employed Commuters
Using the formula above, the number of regular commuters in the town is calculated to be
2,461. Which is approximately 23% of the town’s population. This show that there is a
considerable number of commuters to justify the construction of railway system in the town.
Household occupancy
Analysis of the relationships within household show that children tend to live with their parents
well into adulthood and some household have three generations living together. This could be
the house is big enough to accommodate an extended family indicating affluence of the
household.
Figure 8: Relationship to Head of House by Age
There are 3,546 households in the town with an average household occupancy of 3 with most
houses having only 2 occupants. This is a modest figure considering the average number of
bedrooms in a house is 3.2 (Statista, 2018). This shows that there is no pressing need for
investment in housing. Although this can be considered for future investment.
Birth Rate and Death Rate
There are 100 new births in the town. This is gotten from the number of residents that is less
than one year. To calculate the birth and death rate the formula below is used.,
𝐵𝑖𝑟𝑡ℎ 𝑟𝑎𝑡𝑒 =
𝑏𝑖𝑟𝑡ℎ𝑠
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
× 1000 (Watt-Watchers, 2022)
𝐷𝑒𝑎𝑡ℎ𝑠
𝐷𝑒𝑎𝑡ℎ 𝑟𝑎𝑡𝑒 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 × 1000 (Watt-Watchers, 2022)
Using the formular above the birth rate for the town is 9.5 births per 1000 of the population.
The death rate for the population could not be calculated because the number of deaths was not
recorded in the census. Analysing of the population pyramid above, it could be deduced that
there will be slight increase in population.
RECOMMENDATIONS AND CONCLUSION
Analysis of the census data showed that the town have a large population of people of working
age that are mostly students and employed individuals. But there is still a high employment
rate of 11%. 70% of the unemployed are between 30-50 years, indicating that they are most
probably individual with skills that are no longer in high demand. To remedy this, it is
recommended that skills acquisition centre be built on the unoccupied piece of land. The centre
will train unemployed individuals if skills that are in high demand.
Given the number of lodgers, and university students in the town, it is estimated 23% of the
population will commute regularly to adjacent cities. Included in this estimate is 25% of
employed population that is assumed to commute to work in the larger city. The number of
commuters justifies investment in rail network because a significant percentage of the
population will benefit from it. This will result in further increase in the population size in the
nearest future.
The study is limited by available data which does not include records of death which is needed
to appropriately calculate the death rate. The records for income were not captured in the data
collection. This would have allowed us to compute the disposable income and know the
affluence of the population, which is a good metric needed to understand the economy of the
town. A broader insight would have been achieved if a multi-year population data is provided
to see progression of the population metrics.
References
GOV.UK, 2018. Working age population. [Online]
Available at: https://www.ethnicity-facts-figures.service.gov.uk/uk-population-byethnicity/demographics/working-age-population/latest
[Accessed 04 December 2022].
GOV.UK, 2022. GOV.UK. [Online]
Available at: https://www.gov.uk/government/news/implementation-of-the-marriage-and-civilpartnership-minimum-age-act2022#:~:text=The%20Marriage%20and%20Civil%20Partnership%20(Minimum%20Age)%20Act%202
022%20received,on%20Monday%2027%20February%202023.&text=The%20Act%20
[Accessed 04 December 2022].
HackettT, I. W. C. a. C., 2022. Pew Research Center. [Online]
Available at: https://www.pewresearch.org/fact-tank/2022/08/31/global-population-skews-malebut-un-projects-parity-between-sexes-by2050/#:~:text=In%202021%2C%20the%20global%20sex,and%20following%20pregnancy%20and%20
childbirth.
[Accessed 04 December 2022].
Kim, P., 2022. The unemployment rate: A key health indicator for a country's economy. [Online]
Available at: https://www.businessinsider.com/personal-finance/unemployment-rate?r=US&IR=T
[Accessed 06 December 2022].
Lee, D., 2021. How To Calculate Unemployment Rate (And Why It's Important). [Online]
Available at: https://www.indeed.com/career-advice/career-development/how-is-unemploymentrate-calculated
[Accessed 06 December 2022].
Sharfman, A., 2022. Office for National Statistics. [Online]
Available at:
https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/families/articl
es/familiesandhouseholdsstatisticsexplained/-
[Accessed 04 December 2022].
Statista, 2018. Statista. [Online]
Available at: https://www.statista.com/statistics/-/average-number-bedrooms-new-britishhouses-/
[Accessed 06 December 2022].
Watt-Watchers, 2022. Activity: Population Math. [Online]
Available at: www.watt-watchers.com
[Accessed 06 December 2022].
Wikipedia, 2022. Wikipedia. [Online]
Available at: https://en.wikipedia.org/wiki/Sith
[Accessed 04 December 2022].