SPSS Survey Report
Breast Cancer Analysis 2016
Can I predict breast cancer
DATA MINING PROJECT
Name | Course Title | Date
Page 1
Breast Cancer Analysis 2016
Contents
Executive Summary................................................................................................................................... 3
Introduction .............................................................................................................................................. 4
Analysis of Data......................................................................................................................................... 5
What analysis procedures/techniques would be used? ........................................................................... 5
Descriptive Statistics ................................................................................................................................. 6
Decision tree ............................................................................................................................................. 8
Testing Reliability .................................................................................................................................... 11
Reliability Statistics ................................................................................................................................. 12
Correlation .............................................................................................................................................. 12
Multicollinearity ...................................................................................................................................... 14
Multiple Linear Regression ..................................................................................................................... 15
Factor Analysis ........................................................................................................................................ 16
Cluster Analysis ....................................................................................................................................... 22
Conclusion ............................................................................................................................................... 25
References .............................................................................................................................................. 26
Appendix ................................................................................................................................................. 27
Page 2
Breast Cancer Analysis 2016
Executive Summary
The cancer research data (“BREAST CANCER WISCONSIN (ORIGINAL) DATA SET”) was
sourced from University of Wisconsin Hospitals. The data was collected from 1988 to 1991 in 8
separate groups. There were 700 observations recorded in the database. 9 separate indicators
were collected for each record that covered cytology of cells, sizes of tumors, indicators covering
functionality of patient’s cells and where the breast tumor was malignant or not. Starting from
description of the dataset, the dataset was analyzed for normality. Profiling of data was done and
later using decision tree analysis I was able to find out the most important indicators that impact
the presence and absence of malignant breast tumor. I also conducted reliability analysis to
establish credence and integrity of data for analysis. Using correlation and regression I was able
to establish relationship between dependent and independent variables. The independent
variable, class was recoded into malignant and benign sections and factor analysis was
performed to identify factors that benign and malignant tumor’s patients could be predicted into.
Later cluster analysis was performed to reduce the observations into salient observations for
malignant and benign tumor patients. The regression equation showed 85% probability of being
to determine the variance within sample. Further the regression equation may be able to predict
the presence of malignant or benign tumors using data for the identified indicators.
Page 3
Breast Cancer Analysis 2016
Analysis of Data
The data would be analyzed using SPSS v22. For the purposes of this project I will undertake a
description, testing of means, Classification, Regression, factor and cluster analysis.
I would add labels to dataset to make it more readable, there are no missing values so the dataset
does not require missing value handling. Next I will perform some basic descriptive and
frequency statistical tests to get a better understanding of the dataset. These will be followed by
correlation, regression, and factor and cluster analysis to answer the hypothesis: What is the
likelihood of predicting breast cancer based on examination of basic indicators? I will also
perform tree analysis and reliability analysis for the data set.
What analysis procedures/techniques would be used?
Data is already clean however, labels need to be assigned so the results are more readable. Data
exploration techniques will be used for signifying important aspects of the data. Descriptive
analysis will be used to identify the variables in terms of their Means, Minimum, Maximum,
Standard Deviation, Skeweness and Kurtosis values. Kurtosis and Skewness values would show
degree of normality and also by drawing histogram with normal curve superimposed would
clearly show which variables are normal. Decision tree analysis would be used for identifying
accuracy of the dataset and reliability analysis would be used to Reliability, which is the ability
of the questionnaire to consistently measure the topic being studied at different times and across
different populations. Since in this case the observations were taken across different times in 8
groups getting reliability assessed is important for credence of the analysis.
Correlation and then regression would be used to determine relationship between dependent and
independent variables and then principle component analysis and factor analysis would be used
Page 5
Breast Cancer Analysis 2016
for reducing variables into more pertinent factors than may explain linkage between onset of
cancer and indicators. Cluster analysis would be to identify homogenous grouping within the
dataset and then discriminant analysis may be run to identify goodness of fit of the model that
the cluster analysis found and thus profile the clusters.
Descriptive Statistics
Descriptive Statistics
Mean
Std.
Deviation
Variance
Skewness
Statistic
Statistic
Statistic
Statistic
Kurtosis
4.42
2.816
7.928
.593
Std.
Error
.092
Uniformity of Cell Size
3.13
3.051
9.311
1.233
.092
.099
.185
Uniformity of Cell Shape
3.21
2.972
8.832
1.162
.092
.007
.185
Marginal Adhesion
2.81
2.855
8.153
1.524
.092
.988
.185
Single Epithelial Cell Size
3.22
2.214
4.903
1.712
.092
2.169
.185
Bare Nuclei
3.46
3.641
13.255
1.016
.092
-.733
.185
Bland Chromatin
3.44
2.438
5.946
1.100
.092
.185
.185
Normal Nucleoli
2.87
3.054
9.325
1.422
.092
.474
.185
Mitoses
1.59
1.715
2.941
3.561
.092
12.658
.185
Clump Thickness
Statistic
-.624
Std.
Error
.185
Valid N (listwise)
The table shows the basic descriptive statistics for all independent variables. Clump thickness
has the highest mean value. The kurtosis and skewness values show the degree of normality.
Skewness represents the symmetry and Kurtosis the degree of flatness of the superimposed
normal curve. Negative kurtosis values would then show values flatter than normal curve and
positive values would show a more peaked curve than normal (McCormick, Salcedo and Poh,
2015).
Page 6
Breast Cancer Analysis 2016
Two variables are shown with superimposed normal curve to show the distribution. Almost in all
variables there are outliers that breach the normal curve. Further looking the class variable the
distribution of benign to malignant cases is 1:3 as seen in the pie chart below
Page 7
Breast Cancer Analysis 2016
Page 9
Breast Cancer Analysis 2016
For the decision tree inputs nine independent variables were specified, but only two were
included in the final model. The variables for Uniformity of Cell Shape, Marginal Adhesion,
Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli and Mitoses did not
make a significant contribution to the model, so they were automatically dropped from the final
model. The decision tree has selected clump thickness and uniformity of cell sizes.
The tree diagram is a graphic representation of the tree model. This tree diagram shows that:
1. Using the CHAID method, uniformity of cell sizes is the best predictor of subject’s
classification.
2. For subjects with uniformity of cell sizes at value 1 the next best predictor is clump
thickness
3. In the case of clump thickness the malignant nodes can be predicted for clump thickness
values of 6, 4, 9 and 10.
4. The model’s p values are significant as p < 0.05
5. The model’s risk of 0.073 indicates that the classification predicted by the model (benign or
malignant) is wrong for only 7.3% of the cases.
6. The results in classification table are consistent with the estimates of risk table. Table
confirms that model classifies 92.7% of the patients for presence of malignant tumors.
The model has one potential drawback i.e. for those patients with benign tumors, model predicts
benign for only 96.9% of them, i.e. false positive for 3.1%. Similarly for patients with malignant
tumors the prediction percentage is 84.6% which means it gives false negative for 15% of the
patients (“IBM SPSS Decision Trees 21”, 2012).
Page 10
Breast Cancer Analysis 2016
Testing Reliability
All of our analysis and predictions will not be accurate unless I can verify the reliability of the
survey questionnaire. Reliability can be explained as the ability of the questionnaire to
consistently measure the topic, in our case incidence of breast cancer, at different times and
across different populations. Since the data for the survey was collected over an extended period
and in 8 groups so verification of reliability is essential. I will be using Cronbach’s Alpha and
measuring number of items measured and their inter item correlation with resultant high
correlation between the different items indicating they are measuring the same thing. The value
of Cronbach’s alpha should be higher than 0.75 for making questionnaire reliable, while above
0.8 would ensure high reliability.
The Corrected Item-Total Correlation shows the relationship between overall total score and
responses on individual questions. A reliable question should have a value greater than 0.3 and
all of our variables satisfy this concern (Hinton et al. 2004), None of the variables, if removed
would improve Cronbach’s alpha significantly.
Item-Total Statistics
Scale
Mean if
Item
Deleted
Clump Thickness
26.41
Scale
Variance
if Item
Deleted
362.165
Corrected
Item-Total
Correlation
Squared
Multiple
Correlation
Cronbach
's Alpha if
Item
Deleted
.677
.532
.928
Uniformity of Cell Size
27.70
333.865
.892
.863
.915
Uniformity of Cell Shape
27.62
337.264
.884
.848
.916
Marginal Adhesion
28.02
353.673
.752
.595
.924
Single Epithelial Cell Size
27.61
372.945
.758
.608
.924
Bare Nuclei
27.37
329.020
.760
.695
.926
Bland Chromatin
27.39
362.949
.794
.660
.922
Normal Nucleoli
27.96
347.285
.756
.603
.924
Mitoses
29.24
406.286
.483
.280
.935
Class:
28.14
407.167
.900
.839
.930
Page 11
Breast Cancer Analysis 2016
The overall standardized alpha value of 0.943 indicates high reliability of the questions
Reliability Statistics
Cronbach's
Alpha Based on
Cronbach's
Standardized
Alpha
Items
.932
N of Items
.943
10
Correlation
Next I checked the correlations amongst the variables. The enclosed table shows inter item
correlations. The cells highlighted in yellow indicate correlations of 0.7 and more. Further
correlations of 0.8 and higher are highlighted in green. Please note that all correlations are
significant at 95% confidence level. Positive correlations indicate that the two variables which
are positively correlated would move together i.e. increase or decrease in one variable would
cause corresponding increase or decrease in the other variable. In the case of negative
correlations on the other hand increase in the value of one variable would lead to with a decrease
in values on another variable and vice versa (Greasly, 2008).
The strength of correlation is measured by the table provided by
(Greasly, 2008)
Hence in our case the strength of relationship is very strongly positive as seen in the following
table where significant correlation are both marked with asterisk.
Page 12