NCI’s Breast Cancer Analysis and Visualization Using Python
Breast cancer is the most common type of cancer among women worldwide. Although men can
also be affected by this type of cancer, it is relatively uncommon. Sharpening my Python skills, I
decided to use Python to work on this dataset.
Flow Chart
Data Collection
In this report, I analyzed a dataset obtained from Kaggle ( here )that shows information on female
patients with cancer between 2006 and 2010. The dataset was obtained from the 2017 November
update of the SEER Program of the National Cancer Institute. According to the source of the data,
Patients with unknown tumor size, examined regional LNs, positive regional LNs, and patients
whose survival months were less than 1 month were removed from the dataset.
Data Wrangling
Data wrangling or cleaning is the first thing that needs to be done when working on a dataset. (See
how i thoroughly cleaned this FIFA dataset here and here). After importing the necessary libraries
that would be needed. Firstly, I checked for the data quality;
While profiling the data, i saw that a column was wrongly spelled. It was corrected;
Null values were checked;
Since there are no null values in the dataset, I moved on to check for duplicates;
I found out that there’s a duplicated row. It was dropped;
I changed the datatype of some columns;
A def function was used to create a new column that shows the age categories of the patients after
using df[‘Age’].min and df[‘Age’].max to check for the minimum and maximum age of the
patients respectively;
A new column called year was created to change the survival month to years by dividing the
months by 12 and approximating to a whole number. Check out the concluding part of the
cleaning process on my github
Exploratory Data Analysis
After cleaning and preparing the datasets, I delved into the data and posed various inquiries to
acquire a deeper understanding of breast cancer patients. To get a proper insight, dead and living
patients were analyzed separately. The questions asked are;
1. What is the total number of patients?
2. How many patients are dead?
3. How many patients are alive?
4. What is the youngest and oldest age of the patients?
5. Which age category has the highest and lowest number of cancers?
6.
7.
8.
9.
Which demographic of race does the data show to be mostly affected by breast cancer?
How big are the tumor sizes in the patients based on their age range?
Are progesterone and estrogen-positive patients more or less?
What is the marital status of these patients and how long (survival period) do patients with
spouses live compared to patients without spouses?
10. The stage the cancer has gotten to, how likely would the patient respond to treatment?
11. Has the cancer metastasized to other parts of the body? how far has the spreading gone?
Data Visualization With Python
Using several charts, I got answers to all my exploratory questions. A dashboard showing all the
visuals was created. Check out the Python code I wrote to create this dashboard here. See below
the dashboard created;
Insights
A total of 4,023 patients were analyzed.
3,407 of these patients are currently alive while 616 are dead.
The oldest patient is 69 and the youngest is 30.
1,390 patients are in their fifties and 1201 of these patients are alive. This is closely
followed by patients In their sixties numbering 1,279 out of which 1035 are alive and 244
dead. This shows the effect of increasing age on the mortality rate of breast cancer as
there are more patients that are dead. There is also associated more severe morbidity in
patients in the higher age group.
More patients in their forties have breast cancer when compared with the patients in the
thirties because with increasing age, more patients are affected with higher prevalence in
advancing age.
3,412 patients are white. This could be as a result of the location the data was collected. It
could be a white dominated region.
The tumor size ranges from 10 to 140 (cm). However patients in their thirties have cancer
with large tumor sizes compared to patients with other age categories. This may be due to
exposure of the tumors to higher estrogen/progesterone levels.
There is a huge number of patients with positive estrogen and progesterone level. In the
index study, breast cancer size has a direct positive correlation with higher progesterone
and estrogen level. Breast cancer needs the presence of estrogen and progesterone to
thrive, thus treatment may be targeted to lowering the progesterone/estrogen level in the
patients.
The survival of patients with breast cancer is seen to range between 1 month to 8 years.
Separated couples have a survival period of less than 4 years while singles and the
widowed mostly die after a year. Thus having a partner has a positive correlation with the
survival rate in patients with breast cancer. This may be due to the support system created
by the partners.
60% of these patients have moderately differentiated cancer cells, meaning that the size
and shape of the cancer cells under the microscope moderately resemble the normal cells
of a well-healed breast. This is quite impressive because if these patients are well taken
care of and given adequate treatment, there are possibilities of the patient recovering with
time and with a reduction in metastasis and tumor recurrence. Likewise, 14.8% of
patients have well-differentiated cancer cells which means, cancer cells grow slowly and
this has a better prognosis. However, 24.9% and 0.3% of patients have poorly
differentiated and undifferentiated histological cancer variants respectively. Thus the
patients has microscopically disorganized cells with abnormal size and morphology.
They also exhibit increased mitotic
Activities (rapid cell division). They have poorer prognosis with higher propensity for
early metastasis and recurrence.
In all patients, whether alive or deceased, it is observed that the cancer cells might have
spread to various regions of the body. However, a significant proportion of these patients
(98.3% in living patients and 94.3% in dead patients) have localized cancer cells around
the breast and its surrounding areas. Only a small number of patients (1.7% in living
patients and 5.7% in dead patients) have distant metastasis.
Recommendations
Early detection is the best way to tackle the menace of breast cancer as early treatment can
be instituted before the cancer can spread to local and distant sites. I also strongly believe
that from age 30, periodic routine breast check-ups should be mandated for everybody,
most especially for people with a family history of breast cancer. Breast cancer awareness
programs should also be done regularly.
For patients with breast cancer, a support plan or system should be made available to
friends and families. In cases where there are no relatively close people to the patients,
the government or hospital should provide a paid care provider for these patients. Most
especially for single, widowed, and separated patients.
Treatment of breast cancer is multifactorial and breast cancer histological type, age, stage
of the cancer, and if the cancer is sensitive to hormones. Hormone therapy is
recommended for patients with positive progesterone and estrogen status. Of course, this
is determined by the stage of the cancer and the overall health of the patient.
Clinical trials, surgery, or systemic therapies are all recommended for patients with
different stages of differentiated breast cancer cells.
Immunotherapy or/and chemotherapy or any other type of therapy advised by the doctor is
recommended to reduce and stop the metastasis of these cancer cells to other parts of the
body. Also, adopting a healthy lifestyle can contribute to reducing the risk of cancer
spread. This lifestyle includes and is not limited to regular exercise, maintaining a
balanced diet, avoiding tobacco and excessive alcohol consumption, and managing stress.
In conclusion, it is advisable to not rely on home and self-medications. Patients are thus
advised to visit facilities where they can access multidisciplinary care.