Shariq Anis | Freelancer Spss Survey Report

SPSS Survey Report

Breast Cancer Analysis 2016 Can I predict breast cancer DATA MINING PROJECT Name | Course Title | Date Page 1 Breast Cancer Analysis 2016 Contents Executive Summary................................................................................................................................... 3 Introduction .............................................................................................................................................. 4 Analysis of Data......................................................................................................................................... 5 What analysis procedures/techniques would be used? ........................................................................... 5 Descriptive Statistics ................................................................................................................................. 6 Decision tree ............................................................................................................................................. 8 Testing Reliability .................................................................................................................................... 11 Reliability Statistics ................................................................................................................................. 12 Correlation .............................................................................................................................................. 12 Multicollinearity ...................................................................................................................................... 14 Multiple Linear Regression ..................................................................................................................... 15 Factor Analysis ........................................................................................................................................ 16 Cluster Analysis ....................................................................................................................................... 22 Conclusion ............................................................................................................................................... 25 References .............................................................................................................................................. 26 Appendix ................................................................................................................................................. 27 Page 2 Breast Cancer Analysis 2016 Executive Summary The cancer research data (“BREAST CANCER WISCONSIN (ORIGINAL) DATA SET”) was sourced from University of Wisconsin Hospitals. The data was collected from 1988 to 1991 in 8 separate groups. There were 700 observations recorded in the database. 9 separate indicators were collected for each record that covered cytology of cells, sizes of tumors, indicators covering functionality of patient’s cells and where the breast tumor was malignant or not. Starting from description of the dataset, the dataset was analyzed for normality. Profiling of data was done and later using decision tree analysis I was able to find out the most important indicators that impact the presence and absence of malignant breast tumor. I also conducted reliability analysis to establish credence and integrity of data for analysis. Using correlation and regression I was able to establish relationship between dependent and independent variables. The independent variable, class was recoded into malignant and benign sections and factor analysis was performed to identify factors that benign and malignant tumor’s patients could be predicted into. Later cluster analysis was performed to reduce the observations into salient observations for malignant and benign tumor patients. The regression equation showed 85% probability of being to determine the variance within sample. Further the regression equation may be able to predict the presence of malignant or benign tumors using data for the identified indicators. Page 3 Breast Cancer Analysis 2016 Analysis of Data The data would be analyzed using SPSS v22. For the purposes of this project I will undertake a description, testing of means, Classification, Regression, factor and cluster analysis. I would add labels to dataset to make it more readable, there are no missing values so the dataset does not require missing value handling. Next I will perform some basic descriptive and frequency statistical tests to get a better understanding of the dataset. These will be followed by correlation, regression, and factor and cluster analysis to answer the hypothesis: What is the likelihood of predicting breast cancer based on examination of basic indicators? I will also perform tree analysis and reliability analysis for the data set. What analysis procedures/techniques would be used? Data is already clean however, labels need to be assigned so the results are more readable. Data exploration techniques will be used for signifying important aspects of the data. Descriptive analysis will be used to identify the variables in terms of their Means, Minimum, Maximum, Standard Deviation, Skeweness and Kurtosis values. Kurtosis and Skewness values would show degree of normality and also by drawing histogram with normal curve superimposed would clearly show which variables are normal. Decision tree analysis would be used for identifying accuracy of the dataset and reliability analysis would be used to Reliability, which is the ability of the questionnaire to consistently measure the topic being studied at different times and across different populations. Since in this case the observations were taken across different times in 8 groups getting reliability assessed is important for credence of the analysis. Correlation and then regression would be used to determine relationship between dependent and independent variables and then principle component analysis and factor analysis would be used Page 5 Breast Cancer Analysis 2016 for reducing variables into more pertinent factors than may explain linkage between onset of cancer and indicators. Cluster analysis would be to identify homogenous grouping within the dataset and then discriminant analysis may be run to identify goodness of fit of the model that the cluster analysis found and thus profile the clusters. Descriptive Statistics Descriptive Statistics Mean Std. Deviation Variance Skewness Statistic Statistic Statistic Statistic Kurtosis 4.42 2.816 7.928 .593 Std. Error .092 Uniformity of Cell Size 3.13 3.051 9.311 1.233 .092 .099 .185 Uniformity of Cell Shape 3.21 2.972 8.832 1.162 .092 .007 .185 Marginal Adhesion 2.81 2.855 8.153 1.524 .092 .988 .185 Single Epithelial Cell Size 3.22 2.214 4.903 1.712 .092 2.169 .185 Bare Nuclei 3.46 3.641 13.255 1.016 .092 -.733 .185 Bland Chromatin 3.44 2.438 5.946 1.100 .092 .185 .185 Normal Nucleoli 2.87 3.054 9.325 1.422 .092 .474 .185 Mitoses 1.59 1.715 2.941 3.561 .092 12.658 .185 Clump Thickness Statistic -.624 Std. Error .185 Valid N (listwise) The table shows the basic descriptive statistics for all independent variables. Clump thickness has the highest mean value. The kurtosis and skewness values show the degree of normality. Skewness represents the symmetry and Kurtosis the degree of flatness of the superimposed normal curve. Negative kurtosis values would then show values flatter than normal curve and positive values would show a more peaked curve than normal (McCormick, Salcedo and Poh, 2015). Page 6 Breast Cancer Analysis 2016 Two variables are shown with superimposed normal curve to show the distribution. Almost in all variables there are outliers that breach the normal curve. Further looking the class variable the distribution of benign to malignant cases is 1:3 as seen in the pie chart below Page 7 Breast Cancer Analysis 2016 Page 9 Breast Cancer Analysis 2016 For the decision tree inputs nine independent variables were specified, but only two were included in the final model. The variables for Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli and Mitoses did not make a significant contribution to the model, so they were automatically dropped from the final model. The decision tree has selected clump thickness and uniformity of cell sizes. The tree diagram is a graphic representation of the tree model. This tree diagram shows that: 1. Using the CHAID method, uniformity of cell sizes is the best predictor of subject’s classification. 2. For subjects with uniformity of cell sizes at value 1 the next best predictor is clump thickness 3. In the case of clump thickness the malignant nodes can be predicted for clump thickness values of 6, 4, 9 and 10. 4. The model’s p values are significant as p < 0.05 5. The model’s risk of 0.073 indicates that the classification predicted by the model (benign or malignant) is wrong for only 7.3% of the cases. 6. The results in classification table are consistent with the estimates of risk table. Table confirms that model classifies 92.7% of the patients for presence of malignant tumors. The model has one potential drawback i.e. for those patients with benign tumors, model predicts benign for only 96.9% of them, i.e. false positive for 3.1%. Similarly for patients with malignant tumors the prediction percentage is 84.6% which means it gives false negative for 15% of the patients (“IBM SPSS Decision Trees 21”, 2012). Page 10 Breast Cancer Analysis 2016 Testing Reliability All of our analysis and predictions will not be accurate unless I can verify the reliability of the survey questionnaire. Reliability can be explained as the ability of the questionnaire to consistently measure the topic, in our case incidence of breast cancer, at different times and across different populations. Since the data for the survey was collected over an extended period and in 8 groups so verification of reliability is essential. I will be using Cronbach’s Alpha and measuring number of items measured and their inter item correlation with resultant high correlation between the different items indicating they are measuring the same thing. The value of Cronbach’s alpha should be higher than 0.75 for making questionnaire reliable, while above 0.8 would ensure high reliability. The Corrected Item-Total Correlation shows the relationship between overall total score and responses on individual questions. A reliable question should have a value greater than 0.3 and all of our variables satisfy this concern (Hinton et al. 2004), None of the variables, if removed would improve Cronbach’s alpha significantly. Item-Total Statistics Scale Mean if Item Deleted Clump Thickness 26.41 Scale Variance if Item Deleted 362.165 Corrected Item-Total Correlation Squared Multiple Correlation Cronbach 's Alpha if Item Deleted .677 .532 .928 Uniformity of Cell Size 27.70 333.865 .892 .863 .915 Uniformity of Cell Shape 27.62 337.264 .884 .848 .916 Marginal Adhesion 28.02 353.673 .752 .595 .924 Single Epithelial Cell Size 27.61 372.945 .758 .608 .924 Bare Nuclei 27.37 329.020 .760 .695 .926 Bland Chromatin 27.39 362.949 .794 .660 .922 Normal Nucleoli 27.96 347.285 .756 .603 .924 Mitoses 29.24 406.286 .483 .280 .935 Class: 28.14 407.167 .900 .839 .930 Page 11 Breast Cancer Analysis 2016 The overall standardized alpha value of 0.943 indicates high reliability of the questions Reliability Statistics Cronbach's Alpha Based on Cronbach's Standardized Alpha Items .932 N of Items .943 10 Correlation Next I checked the correlations amongst the variables. The enclosed table shows inter item correlations. The cells highlighted in yellow indicate correlations of 0.7 and more. Further correlations of 0.8 and higher are highlighted in green. Please note that all correlations are significant at 95% confidence level. Positive correlations indicate that the two variables which are positively correlated would move together i.e. increase or decrease in one variable would cause corresponding increase or decrease in the other variable. In the case of negative correlations on the other hand increase in the value of one variable would lead to with a decrease in values on another variable and vice versa (Greasly, 2008). The strength of correlation is measured by the table provided by (Greasly, 2008) Hence in our case the strength of relationship is very strongly positive as seen in the following table where significant correlation are both marked with asterisk. Page 12