Tony Nyumba Apindi

Support Vector Machines Project

APPLICATION OF SUPPORT VECTOR MACHINES IN IDENTIFYING FACTORS RELATED TO CD4 CELL COUNT LEVELS AMONG HIV PATIENTS A RESEARCH PROJECT SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF BACHELORE OF SCIENCE IN APPLIED STATISTICS WITH COMPUTING IN THE DEPARTMENT OF MATHEMATICS, PHYSICS AND COMPUTING FEBRUARY, 2023 DECLARATION This Proposal is our original work and has not been presented for a Degree in any other University. NAME REG. NO. SIGN DATE TONY N. APINDI AST/08/19 ____________ __________ SUPERVISORS DECLARATION This proposal has been submitted with my/our approval as University supervisor(s). 1. Signature: ___________________ Date: ________________________ Name of the Supervisor: Dr. Gregory Kerich 2. Signature: ___________________ Date: ________________________ Name of the supervisor: Mr. Dennis Mwan ABBREVIATIONS AND ACRONYMS AIDS: Acquired Immunodeficiency Syndrome ART: Antiretroviral Therapy COVID-19: Corona Virus Disease – 2019 HIV: Human Immunodeficiency Virus HIV-ASES: HIV Treatment Adherence Self-Efficacy Scale KENPHIA: Kenya Population-based HIV Impact Assessment PHIA: Population-based HIV Impact Assessment PMTCT: Preventation of Mother to Child Transmission TB: Tuberculosis UN: United Nations UNAIDS: Joint United Nations Programme on HIV/AIDS WHO: World Health Organization ABSTRACT The spread of HIV/AIDS in Kenya has ravished many communities over many decades without a vaccine being found. Many researchers have found out that in Kenya, most of the affected people are adolescent girls and women who form a larger percentage of people who are HIV positive as compared to males. Therefore, this study seeks to find out which factors affect the CD4 cell count levels in females by using support vector machine (SVM). The general objective of the study is to apply machine learning models in the identification of factors associated with CD4 cell count levels. The specific objectives include investigating demographic and socio-economic factors related to CD4 cell count levels to fit a support vector machine learning model and to investigate the factors affecting the CD4 cell levels on HIV+ women in Kenya. The results were arrived at by using SVM to identify factors related to CD4 cell count levels. The data is the latest data set of HIV positive women in Kenya extracted from Kenya’s PHIA. From the analysis, 90% of individuals who did not enroll in school have a low cd4 count while 10% have a high CD4 cell count. Furthermore, 90% of respondents related to the head had a low CD4 count, and 9 % of heads of families had a high CD4 count. 50% of the whole relationship with the head of the family was a relative who had a high CD4 count. The performance evaluation revealed that the SVM model had an accuracy of 0.949, which indicates that the model was able to correctly classify 94.9% of the test data. The precision for the CD4 category 1 and 2 were 0.992 and 0.143 respectively. The recall for the CD4 category 1 and 2 were 0.956 and 0.5 respectively. The F1 score for the CD4 category 1 and 2 were 0.974 and 0.222 respectively. The significant relationships were Relationship With Head, Sick to Work last 3 Months, Ever Attended School, Ever Enrolled in School, Work for Pay, Married/Live Together, Number of Pregnancies, Pregnant Currently, Ever Avoided Pregnancy, Ever Sought TB Treatment, Duration on ART and ARVs Detected. The recommendations for this research include healthcare providers and policymakers prioritize the education of individuals on the significance of enrolling in school and obtaining employment, particularly those living with HIV. Additionally, that healthcare providers offer comprehensive HIV management programs that focus on the social and economic factors that affect CD4 cell count levels. Finally, further studies should be conducted to explore the role of other social and economic factors on CD4 cell count levels. TABLE OF CONTENTS DECLARATIONi SUPERVISORS DECLARATIONiii ABBREVIATIONS AND ACRONYMSiv ABSTRACTv TABLE OF CONTENTSvi CHAPTER ONE: INTRODUCTION1 1.1 Background of the Study1 1.2 Problem Statement2 1.3 Justification3 1.4 Purpose of the Study3 1.5 Objectives of the Study3 1.5.1 General Objective3 1.5.2 Specific Objectives3 1.6 Study Hypothesis3 1.6.1 Null Hypothesis3 1.7 Significance of the Study4 1.8 Scope of the Study4 1.9 Limitations4 CHAPTER TWO: LITERATURE REVIEW5 2.1 HIV AIDS in Kenya5 2.2 CD4 cells5 2.3 Treatment of HIV7 2.4 Support Vector Machines (SVM)8 CHAPTER THREE: METHODOLOGY9 3.1 Data Description9 3.2 Data Pre-Processing12 3.3 Data Analysis13 3.3.1. To investigate demographic and socio-economic factors related to CD4 count levels.13 3.3.2. To fit an appropriate model to the data.13 3.3.3. To find out the factors affecting the CD4 levels on HIV positive women in Kenya15 CHAPTER FOUR: RESULTS16 4.1. To investigate demographic and socio-economic factors related to CD4 cell count levels.16 4.1.1. Education16 4.1.2 Relationship with head24 4.2. Machine Learning Model24 4.2.1. Support Vector Machine24 4.3. To find out the factors affecting the CD4 cell levels on HIV positive women in Kenya27 4.3.1. Household Characteristics33 CHAPTER FIVE: CONCLUSION & RECOMMENDATIONS34 5.1 Conclusion34 CHAPTER SIX: REFERENCES36 APPENDIX38 CHAPTER ONE: INTRODUCTION 1.1 Background of the Study The HIV epidemic has been present for more than thirty years yet there is still no cure or an alternative vaccine for the disease. Therefore, HIV/AIDS has remained a major health crisis, especially in Sub-Saharan Africa, where adolescent girls and young women are at higher risk of infection as they account for about one in four new infections. In addition, in Eastern and Southern African regions, adolescent girls and young women accounted for 26% of new infections (UNAIDS, 2020). HIV has caused immense human suffering in Sub-Saharan Africa, with the most obvious impact on individuals being death and suffering. The larger effect has been felt in the health and socio-economic sectors. For instance, in Sub-Saharan Africa, people living with HIV/AIDS-related problems occupy half of the hospital beds. The World Health Organization (WHO) had associating an increase in HIV vulnerability to legal and social factors (World Health Organization, 2022). However, great milestones had been achieved in treating HIV virus. For instance, availability and rapid scale-up of antiretroviral therapy (ART) drugs disbursement, voluntary male medical circumcision, antiretroviral medication for the prevention of mother to child transmission, pre-exposure prophylaxis, among others. The UNAIDS has also been keen on setting targets to specific countries while giving grants to research studies to attain zero new infections, zero discrimination and zero AIDS-related deaths. The trends in the new HIV infections across countries in Africa had declined by more than 33%, from an estimated 2.2 million in 2005 to 1.5 million in 2013 (UNAIDS, 2022). The scale-up and widespread coverage of ART had led to substantial declines in new HIV infections. Despite these declines, HIV incidences rates remained unacceptably high, with the largest number of new infections coming from; South Africa (22%), Nigeria (15%), Uganda (10%), Rwanda (7%) and Kenya (7%). The epidemic in other Sub-Saharan countries had seen a substantial decline due to the impact of modest ART coverage at CD4 cell counts ranging from <200 to 500 per milliliter of blood. This had resulted in significant declines in mortality, with life expectancy increasing by an additional ten years. These studies provided evidence on the benefits of early ART initiation to HIV positive Individuals (UNAIDS, 2014). Through previous researches, CD4 cell count is one of the parameters used to measure disease progression. HIV attacks CD4 cells reducing their levels in the body, making it difficult to fight diseases. Furthermore, CD4 cell count had been used for immunological classification of HIV infection where the levels had been shown to correlate with clinical stages of HIV related diseases (Barnett et al., 2008). In 2019, 1.5 million people were living with HIV, 4.9% adult prevalence (ages 15 – 49), 77,000 new infections, and 22,000 AID related deaths in Kenya (Kamer, 2022). The HIV prevalence in women was 6.6% while that of men was at 3.1% (Ministry of Health, 2020). Therefore, there was a higher burden on women as compared to men with the disease. The country recorded the following results; 79% of people living with HIV were aware of their status, 78.9% were on antiretroviral therapy, and 85.3% were virally suppressed -) (NSCOP, 2020). This results was better when compared to the UNAIDS target of actualizing the 90-90-90 goal by 2020.Women were unreasonably affected by HIV in Kenya. In 2018, 890,000 women aged above the age of 15 were living with HIV compared to 510,000 men of the same age group. Similarly, the same year, 36,000 women were newly infected with HIV compared to 27,000 (UNAIDS, 2020). In Kenya, there was a great disparity between the two genders. Women were mainly discriminated against men, with statistics showing that approximately 45% of women aged 15-49 who had never been married or in a long-term relationship were estimated to have experienced physical or sexual violence from an intimate male partner in 2019 (Odhiambo, 2020). Women also tend to be infected earlier because they had older partners and got married off earlier. Therefore, women were a key indicator of the country’s progress towards eliminating HIV/AIDS as a public health threat. They made up most of the total HIV infections and were more likely to experience more challenges accessing antiretroviral therapy treatment. 1.2 Problem Statement The prevalence rate of HIV among adults aged 15 to 64 years in Kenya was 4.9 per cent in 2020 (KENPHIA, 2020) hence it remains a public health concern, with a significant number of adults living with the virus. The country had made progress towards the UNAIDS 90-90-90 goal but needed to accelerate efforts towards the 2030 vision of ending AIDS as a public health threat (Frescura et al., 2022). To achieve this, it was crucial to understand factors that impact CD4 cell count among the HIV population in Kenya. Machine learning techniques, particularly support vector machines, had shown promise in analysing existing data sets related to HIV and CD4 cell count in the country. However, there is limited public research available on the use of these methods. 1.3 Justification Gender plays a big role in the community, women living with HIV in Kenya experience inequality compared to their male counterparts. This study aimed to break down barriers that prevent women from accessing quality and affordable testing and treatment services by using CD4 level as an indicator of good health. The analysis utilised the machine learning model Support Vector Machines (SVM). The study aimed to achieve this by utlising the machine learning models to provide a basis of estimation with a dichotomous outcome. The model would either predict a low or high CD4 cell count by using socio-demographic and other factors, with a high being a more likeable outcome. Therefore, the significant factors used would act as a compass in improving HIV+ infected women’s health, thereby improving general public health. Globally, this would make great strides towards attaining the UNAIDS target of eliminating HIV/AIDS as a public health threat while attaining sustainable development goals. In conclusion, quality affordable health is a basic right to everyone, and this study strived to provide statistical inferences toward the same. 1.4 Purpose of the Study To provide an insight into various factors that affect the CD4 cell count levels among women and give recommendations on the factors to be mitigated. 1.5 Objectives of the Study 1.5.1 General Objective The general objective of this study was to apply machine learning models in the identification of factors associated with CD4 cell count levels. 1.5.2 Specific Objectives 1. To investigate demographic and socio-economic factors related to CD4 cell count levels. 2. To fit a support vector machine learning model. 3. To find out the factors affecting the CD4 cell levels on HIV+ women in Kenya. 1.6 Study Hypothesis 1.6.1 Null Hypothesis There exists an association between the selected variable factors and CD4 cell count levels. 1.7 Significance of the Study This study would enable people and the government to get appraised on factors associated with CD4 cell count levels. This would enable the government to develop appropriate ways of mitigating these factors to alleviate the HIV pandemic. The models identified the main demographic and socio-economic factors associated with CD4 cell levels to help the relevant stakeholders provide solutions. Therefore, providing data insights to scientists, students and the general public doing related research topics. The government could implement relevant policies to bridge inequalities faced by women. This would greatly improve the health sector that would consequentially improve the country’s economy. In addition, due to COVID -19 outbreaks, most health resources had been channeled towards combating the COVID-19 pandemic. This move had resulted in the abandonment of other health threats such as HIV+ infections and a keen interest in creating solution indicators towards eliminating HIV as a public threat through machine models. 1.8 Scope of the Study The study was centered on the women’s population at risk that were already infected with HIV, aged 15-64 years in Kenya. 1.9 Limitations Lack of funds and adequate time confined the study to secondary data sources. CHAPTER TWO: LITERATURE REVIEW 2.1 HIV AIDS in Kenya According to UNAIDS, in 2019, 1.5 million people were living with HIV translating to a 4.7% adult age prevalence (ages 15-49), an increase from 2018 when it was 4.8%. In the same year, 42,000 new HIV infections were recorded, decreasing from 46,000 the previous year (UNAIDS, 2020). Approximately 50% of new infections were from ages 15-29 (AIDSinfo | UNAIDS, 2021). The HIV load in urban areas is higher in urban areas than in rural areas with 6.3% and 3.6%. HIV prevalence among women towers that of men with 6.6% against 3.1% among men. HIV prevalence is highest among women aged 45-49 at 12%. Among the young people (ages 15-24), the HIV prevalence is 1%, with women having a higher prevalence of 2% compared to 0.6% in men. The prevalence among children under 15 years is 0.3%. The PHIA report established that 79% of people living with HIV know their status, 78.9% of those under ART, with 85.3% of those under treatment have virally suppressed HIV (2021). 80 % of pregnant women living with HIV received ART for PMTCT, while 67% of children living with HIV are on ART (UNAIDS, 2020). 2.2 CD4 cells CD4 helper cell or T cell, also known as CD4 T lymphocyte, is a type of white blood cells responsible for fighting off bacteria, viruses, and other invading germs. CD4 count is a test that estimates the number of CD4 cells in a cubic millimetre of blood (Okoye and Picker, 2013). The test aids in establishing how much destruction have been done to your immune system and the likely outcome in the event antiretroviral treatment (ART) does not use this test. For a HIV negative person, the CD4 count should be anything between 500 and 1500 (Nall, 2021). People with HIV who have a CD4 count of over 500 are considered in good health. HIV negative people with less than 200 CD4 are considered to be at a greater risk of developing a serious illness. Upon gaining entry to the body, HIV targets the immune system (white blood cells), particularly the CD4 cells. When too many CD4 cells are lost, the immune system gets weak, and it faces challenges facing infections. HIV lacks mechanisms to replicate on its own; hence, it attaches itself to the surface of the CD4, gets inside and becomes part of the cell (Nall, 2021). Once the CD4 cell is dead, HIV begins releasing more copies of HIV into the bloodstream. The newly released bits of HIV take over more CD4 cells, and the cycle continues reducing the number of HIV-free, working CD4 cells (Nall, 2021). When to destruction advances to a stage, the CD4 count drops below 200, the host is then said to have AIDS. CD4 count, also known as CD4 lymphocyte count, CD4+ count, T4 count, enables the health care provider to check if an individual is at risk of any complications from HIV. The CD4 count can also be used to analyse and observe how HIV affects an individual’s immune system and if the individual is developing any complications from HIV. The test shows the advancement of your immune system regarding HIV, indicating if a change in medication will suffice to manage the situation. Also, when the CD4 cell count is too low, the patient can be diagnosed with AIDS.AIDS is a serious form of HIV virus, and it opens the body to opportunistic infections due to the damages it does to the immune system. Over more than two decades, CD4 cell counts have been critical in understanding the progression of HIV virus (Ford et al., 2015). This measurement determines when a patient should begin their antiretroviral therapy (ARV), and it shows the progression of the virus during the administration of this therapy. According to Ford et al. (2017), advancement in technology pushes out CD4 cell count in marking the beginning of ARV administration. The introduction of using viral load testing to monitor the virus’s progression in patients makes using CD4 cell counts void (Ford et al., 2017). Studies have been carried out to identify the various factors that may affect CD4 cell counts in HIV positive individuals. The factors can be medically related or even related t the socio-economic situation of the individuals. These credible studies show that the CD4 cell counts are very flexible due to various environmental factors. Jones et al. (1993) and Ickovics et al. (2001) show the reaction of CD4 cell counts to medical issues and conditions. According to Jones et al. (1993), tuberculosis has a relationship with the CD4 cell counts in HIV positive patients. The study showed that the patients with low CD4 cell counts had more chances of contracting severe tuberculosis and those with a higher CD4 cell count had fewer chances of contracting the disease. This study showed how critical CD4 cell count is to an HIV positive individual. Ickovics et al. (2001) carried out a study to determine the association of depressive symptoms with HIV related mortality and the decline in CD4 cell count among women with HIV. Using a progressive and longitudinal cohort study and a multivariate analysis, the study showed that depressive symptoms are associated with the progression of HIV virus (Ickovics et al., 2001). The symptoms directly affect the CD4 cell counts, which progress the disease. These studies show the effect of mental and general diseases on CD4 cell counts, which enable the progression of the HIV virus. Montarroyos et al. (2014) carried out a study meant to identify the factors related to the variations of CD4 counts in HIV positive patients. The study implemented a multilevel model using three levels of aggregation to analyse the association of the predictor variables and the fluctuations in CD4 level count over time (Montarroyos et al., 2014). The study found that CD4 counts level is related to factors like treatment adherence, patients’ habits, change in treatment or doctor and use of ART. The lives the patients live greatly determines the levels of CD4 count and the progression of the virus. The patients should be responsible for monitoring the lives they lead concerning the progression of the HIV virus in their bodies (Montarroyos et al., 2014). 2.3 Treatment of HIV Today there is no cure for HIV. The only existing remedies are medications that benchmark HIV and avert complications (HIV/AIDS - Diagnosis and treatment - Mayo Clinic, 2021). These medications are known as antiretroviral therapy (ART). ART prevents HIV from replicating and from destroying the immune system of an infected person. ART is usually a combination of three or more medications from several different drug classes (HIV/AIDS - Diagnosis and treatment - Mayo Clinic, 2021). The treatment uses drugs from different classes to cater to individual drug resistance, avoid generating new drug-resistance strains of HIV and optimise blood suppression. This combination is defined as Highly Active Antiretroviral Therapy (HAART). These medications help lower the amount of viral load in the body. Anyone diagnosed with HIV should immediately be enrolled on these medications. Although there is no treatment for this virus, guidelines have been set to ensure the patients are well cared for and their health improves. Studying the factors that may affect individuals with HIV is critical when providing them with care and guidance. Studies have been conducted to try and come up with strategies that will reduce the occurrence of HIV. According to (Hayes et al., 2019), a combination prevention intervention with ART provided per the local guidelines resulted in a 30% lower incidence of HIV infection (Hayes et al., 2019). The study compares the combined method of intervention to standard care to point out what can be improved to provide quality treatment for HIV positive individuals. The presence of ARTs for the population led to a significant decline in the incidence of HIV virus. Johnson et al. (2007) stress that adherence to treatment for HIV positive patients are critical in managing the virus. The study focused on the adherence of self-efficacy for treatment for HIV virus. Also, the paper validates the use of the HIV Treatment Adherence Self-Efficacy Scale (HIV-ASES) using two samples of HIV+ adults on ART (Johnson et al., 2007). The successful development of Highly Active Antiretroviral Therapy (HAART) was a great achievement in the management of HIV (Floridia et al., 2008). This study conducted by Floridia et al. (2008) investigated the gender differences in HIV therapeutics. The data on the drug response showed a similar outcome in men and women in the study. However, female candidates appear to be more vulnerable to adverse events related to the treatment (Floridia et al., 2008). This disparity between the genders poses an unprecedented challenge, and the treatment needs to be optimised to cover this disparity. 2.4 Support Vector Machines (SVM) Support Vector Machines (SVM) is a popular machine learning technique used for classification and regression analysis. SVM is particularly useful when the data is non-linearly separable, meaning that a linear decision boundary cannot accurately separate the data points. SVM works by finding the hyperplane that maximizes the margin between the support vectors, which are the data points closest to the decision boundary. SVM is based on the idea of finding the optimal trade-off between minimizing the classification error and maximizing the margin, which makes it a powerful algorithm for complex datasets. SVM can be applied to a wide range of applications, including image classification, natural language processing, and financial prediction. However, one limitation of SVM is that it can be computationally expensive for large datasets, and the choice of kernel function can also have a significant impact on the accuracy of the model. There is some evidence documented online about the use of machine learning techniques to analyze existing data sets related to HIV and CD4 count in Kenya. However, the use of machine learning in this context is still a relatively new area of research, and there may not be as much literature available on this topic compared to more established areas of HIV research. In regards to the use of machine learning, a study by Daniel Niguse Mamo et al. (2023) used random forests classifier outperformed in predicting and identifying the relevant predictors of virological failure in Ethiopia. This outcome suggested that these techniques may have utility in improving HIV care for people. Another study published in the journal PLOS ONE in 2021 used machine learning techniques to identify factors associated with virologic failure among HIV-positive individuals receiving antiretroviral therapy in Kenya. The study found that several clinical and demographic factors were strongly associated with virologic failure, including age, sex, baseline CD4 count, and viral load (Masaba et al., 2023). CHAPTER THREE: METHODOLOGY 3.1 Data Description The data is the latest data set of HIV positive women in Kenya extracted from Kenya’s PHIA. The sample size is 1242 women, and the variables under survey were 28 variables. The variables include categorical and numerical variables. Different levels of the categorical variables are coded in numerical form. They are described as follows: Variable Name Description of Variable Coded Values and Labels Dependent variable CD4.category CD4 level category 1- High 0- Low Independent variables age Age in years Between 15 and 65 agelim Age groups for population pyramid 0-4 years 5-9 years 10-14 years 15-19 years 20-24 years 25-29 years 30-34 years 35-39 years 40-44 years 45-49 years 50-54 years 55-59 years 60-64 years 65-69 years shipwhead Relationship of the individual with the head of the family 1- Head 2- Wife/husband/partner 3- Son/ daughter 4- Son/daughter in law 5- Grandchild 6- Parent 7- Parent in law 8- Brother/sister 9- Co wife 10- Other relatives 11- Adopted/foster/stepchild 12- Not related liveinhold Does the individual live in the house 1- No 2- Yes sickwork Has the individual been very sick for at least three months 1- No 2- Yes gosch Ever attended school 1- No 2- Yes enrollsch Ever enrolled in school 0- No 1- Yes leveledu Highest level of school you attended 1- primary 2- post-primary training 3- secondary (O- level) gradelevel Highest grade at the school level Between 0 and 14 workpay Work for payment past 12 months 0 - No 1 - Yes livetogether Ever married or lived together 0 - No 1 - Yes pregnancies Number of pregnancies Between 0 and 13 liveborn Ever had a pregnancy that resulted in a live birth 1- No 2- Yes numchild2012 Number of children given birth since 2012 0,1,…,10 pregnow Current pregnancy status 1- Not currently pregnant 2- Currently pregnant avoidpreg Avoiding pregnancy 1- No 2- Yes age1stsex Age at first sex Between 8 and 35 age1stsexlim Age groups for population pyramid for age at first sex 0-4 years 5-9 years 10-14 years 15-19 years 20-24 years 25-29 years 30-34 years 35-39 years tbtreat Ever sought TB treatment 1- No 2- Yes alcofreq How often does the individual have a drink containing alcohol 1- NEVER 2- MONTHLY OR LESS 3- 2-4 TIMES A MONTH 4- 2-3 TIMES A WEEK 5- 4 OR MORE TIMES A WEEK urban Urban area indicator 1- Rural 2- Urban knownstat Known HIV status 1- STATED HIV NEGATIVE 2- STATED HIV POSITIVE wealthq Wealth quantile 1- Lowest 2- Second 3- Middle 4- Fourth 5- Highest sexlast12 Respondent had sexual intercourse in the past 12 months 1- No 2- Yes everhadsex Respondent ever had sexual intercourse 1- No 2- Yes buysellsex12 Bought/sold sex past 12 months 1- No 2- Yes onart Indicator whether the respondent is on ART 1- On ART 2- Not on ART timeonart Duration of time on ART 1- On ART 24 months or more 2- On ART 12-23 months 3- On ART <12 months 4- Not on ART arvsdetected Indicator whether ARVs detected 1- ARVs not detected 2- ARVs detected 3.2 Data Pre-Processing The pre-processing of data was carried out by doing the data wrangling process, also known as data cleaning. Ridzuan (2022) explains data cleaning as the process of modifying data to ensure that it is free of irrelevances and incorrect information expounds on data cleaning steps and weighs the advantages and disadvantages of data cleaning. The data was cleaned in preparation for its analysis. R Programming was used to clean the data. The removal of irrelevant observations followed in the cleaning process. The data structure was acknowledged in the software; categorical and numerical variables were recognised. Also, outliers were identified and filtered out. The variables in the data representing the ages, age and age1st sex variables were recognised into age groups to deeper understand the characteristics of the age-based variables. In addition, the data variables were renamed to aid in the imputation and avoid overlapping in the visualisations. Data was checked for the missing data in terms of percentage. If it is below thirty per cent (30%), then imputation should be done. In addition, the method of handling the missing data depending on whether the missing values are missing at random or not was determined. In the case of missing data, the basic assumption for our data was that the missing values are missing at random for the missing values to be imputed. The missing data is addressed using visualisations to understand the distribution of missing data. The missing values will be imputed using the tidyverse package in R programming software. The tidyverse package in R-Software is a suitable line of data pre-processing (Schober & Vetter, 2020). 3.3 Data Analysis 3.3.1. To investigate demographic and socio-economic factors related to CD4 count levels. Socio-demographics are nothing more than characteristics of a population. Generally, characteristics such as age, gender, ethnicity, education level, income, years of experience, location, etc., are considered socio-demographic factors. Cross tabulations (also referred to as cross-tabs) are a quantitative research method appropriate for analysing the relationship between two or more variables. Cross tabulations provide a way of analysing and comparing the results for one or more variables with other(s) results. Dwumoh et al. (2014), state that Determinant of factors associated with child health outcomes and service utilisation in Ghana: Multiple indicator cluster survey conducted in 2011. Using cross-table(s) to investigate Cross-tabulation of socio-demographic characteristics and National Health Insurance Scheme Membership of children under-five based on Chi-square test statistic with the corresponding p-value in Ghana, 2011 (Dwumoh et al., 2014). The socio-demographic factors would be crossed with our dependent variable (CD4 level) to determine if there is an association between the socio-demographics and the outcome of CD4 level in the data. R-Software will be used in determining the associations. 3.3.2. To fit an appropriate model to the data. 3.3.2.1. Support Vector Machines (SVM) SVM is a supervised machine learning algorithms used in classification as well as regression problems. The goal of this algorithm is to create best line or decision boundary, which we can use to partition n-dimensional space into classes to help fit new data points in right categories. This best line is known as hyperplane. SVM chooses extreme points or vectors best known as support vectors to create a hyperplane hence algorithm is known as support vector machine. The algorithm is popularly used in face detection and image classification. Types of SVM 1. Linear SVM: used in linearly separable data 2. Non-linear SVM: used in non-linearly separable data. The linear SVM is used as a classifier that segregates classes or categories into n-dimensional space. The closest data points to the decision boundaries are known as support vectors and determine the position of this decision boundary. The distance between the vectors or data points and decision boundary is known as margin. The goal of SVM is to maximise this margin and find the hyperplane/decision boundary with the maximum distance from vectors known as optimal hyperplane. Linear SVM: The mathematical model for a linear SVM is: f(x) = wT x + b……….(i) Where: w is the weight vector x is the input vector b is the bias term In case of non-linear SVM we cannot have a straight-line separating data point. The classifier will segregate classes into more than 2 dimensional spaces. For our case we will be using non-linear SVM since our dataset cannot be classified by using a straight line. The mathematical equation for a non-linear SVM involves the use of a kernel function to transform the input data into a higher-dimensional space. The most commonly used kernel functions are the Gaussian kernel and the polynomial kernel. In our project, the kernel function would be used to identify the factors related to CD4 cell count among HIV positive patients that may not be linearly separable in the original feature space. The mathematical model for a non-linear SVM is: f(x) = ∑(αi * yi * K(xi, x)) + b……….(ii) Where: α is a vector of coefficients, y is the output vector, K is the kernel function, xi is the input vector, b is the bias term. xi represent the input features related to CD4 cell count in HIV positive patients, and y represent the low or high CD4 cell count class. The goal of the non-linear SVM is to find the hyperplane that separates the data with the maximum margin in the transformed feature space. R-Software will be used to fit in a support vector machine model to the data. 3.3.3. To find out the factors affecting the CD4 levels on HIV positive women in Kenya Researchers research to develop new theories, ideas and products that shape our society and our everyday lives. The purpose of research is to understand the further world and learn how this knowledge can be applied to better everyday life. It is an integral part of problem-solving. Using the data, we hope to conduct a detailed analysis to investigate the factors that affect HIV positive women in Kenya. Information on variables affecting the levels of CD4 level in HIV-Positive women in Kenya was presented to promote intervention studies and surveys. The variables that significantly affect the result of the CD4 level of an individual shall be obtained. The association and relationship of the factors to the CD4 levels was extracted using cross-tables and results interpreted. These procedures were accomplished using R-Software. Understanding the functions of the programs was critical to understand and internalise the entire project. CHAPTER FOUR: RESULTS 4.1. To investigate demographic and socio-economic factors related to CD4 cell count levels. 4.1.1. Education 4.1.1.1 Highest grade at school Figure 1: Highest grade in school The majority of the respondents had grade 4 as their highest-grade level (233), followed by grade 2 with (214) respondents. Those with the highest-grade level as 14 and 0 recorded relatively few respondents. Figure 2: Highest grade in school against CD4 cell count From figure 2, the respondents in grade 2 recorded the highest low CD4 count at 94%, while those in grade 4 recorded the least low cd4 count at 82%. On the other hand, the highest low cd4 count was 18% recorded for grade 3, while the least percentage of high cd4 count was 6% for respondents with grade 2 as the highest-grade level. Respondents in grade 0 and 14 are considered as outliers and have no significant value. 4.1.1.2 Attended school Figure 3: Attended school The figure below indicates the association between those who attended school and their cd4 levels. With 84% of the respondents having attended school [figure 3], 16% of those who attended school had a high CD4 count compared to 10 % of those who did not attend school and had a high CD4 count [figure 4]. On the other hand, 90% of those who attended school have a low CD4 count compared to 84% of those who did not attend school and had a low CD4 count. Figure 4: Attended school against CD4 Cell count 4.1.1.3 Enrolled in school According to figure 5, 95% of the respondents have never enrolled in school. Figure 5: Enrolled in school In figure 6, 90% of individuals who did not enroll in school have a low cd4 count while 10% have a high cd4 count. Furthermore, 81% of those who enrolled in school have a low cd4 count, while 19% have a high cd4 count. Figure 6: Enrolled in school against CD4 cell count 4.1.1.4 Highest level of education Figure 7 shows that 76% of the respondent had primary school education, 21% had post-primary training. Only 3 % of the respondents had secondary education. Figure 7: Highest level of education From figure 8, 91% of respondents with post-primary training had a low CD4 count, and 9% with the same training had a high CD4 count. 89% of the respondents with primary education had a low CD4 count compared to 11% with a high CD4 count with the same education. 81% of those with secondary education had a low CD4 count compared with their 19% counterparts with a high CD4 count. Figure 8: Highest level of education against CD4 cell count 4.1.2 Relationship with head From figure 9, the respondent with a relationship with co-wife, grandchild, not related, a parent in law, daughter in law all had no record of high CD4 count resulting to a 100% record of low CD4 count. Therefore, 90% of respondents related to the head had a low CD4 count, and 9 % of heads of families had a high CD4 count. 50% of the whole relationship with the head of the family was a relative who had a high CD4 count. Sons/daughters of the head had 12% with high CD4 counts. 11% of family head partners had a high CD4 count. Figure 9: Relationship with head against CD4 cell count 4.2. Machine Learning Model 4.2.1. Support Vector Machine To fit a support vector machine (SVM) model, we first installed and loaded the e1071 package. We then split the dataset into training and testing sets with 70% of the data used for training and 30% for testing. Next, we built the SVM model using the ‘svm’ function from the e1071 package. We specified CD4.category as the target variable and used all other variables as predictors with a linear kernel. We then made predictions on the test data using the ‘predict’ function and evaluated the performance of the SVM model using various metrics such as accuracy, precision, recall, and F1 score. The performance evaluation revealed that the SVM model had an accuracy of 0.949, which indicates that the model was able to correctly classify 94.9% of the test data. The precision for the CD4 category 1 and 2 were 0.992 and 0.143 respectively. The recall for the CD4 category 1 and 2 were 0.956 and 0.5 respectively. The F1 score for the CD4 category 1 and 2 were 0.974 and 0.222 respectively. These results indicate that the SVM model was able to perform well in predicting the CD4 category of individuals based on the selected predictors. We therefore agreed on using SVM rather than the Logistic regression which had an accuracy of 88 percent SVM model Call: svm(formula = CD4.category ~ ., data = train_data, kernel = "linear") Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 1 Number of Support Vectors: 53 SVM-Type: This shows that the SVM is a C-classification type, which means that the SVM is trained to perform classification tasks. SVM-Kernel: This shows that a linear kernel was used in this model, which means that the decision boundary is a straight line. cost: This shows that the cost parameter was set to 1. The cost parameter controls the trade-off between maximizing the margin and minimizing the classification error. Number of Support Vectors: This shows that 53 support vectors were used in the model. Support vectors are the data points that are closest to the decision boundary and have the most influence on the classification. 1. Accuracy: Accuracy is a metric used to measure the overall performance of a classification model. It represents the proportion of correctly classified samples out of the total number of samples. Formula: Accuracy = (True Positive + True Negative) / (True Positive + False Positive + True Negative + False Negative) Explanation: True Positive (TP): The number of samples that are actually positive and are correctly classified as positive by the model. False Positive (FP): The number of samples that are actually negative but are incorrectly classified as positive by the model. True Negative (TN): The number of samples that are actually negative and are correctly classified as negative by the model. False Negative (FN): The number of samples that are actually positive but are incorrectly classified as negative by the model. 2.Precision: Precision is a metric that measures the proportion of correctly classified positive samples out of all the samples classified as positive. Formula: Precision = True Positive / (True Positive + False Positive) Explanation: True Positive (TP): The number of samples that are actually positive and are correctly classified as positive by the model. False Positive (FP): The number of samples that are actually negative but are incorrectly classified as positive by the model. Recall: Recall is a metric that measures the proportion of correctly classified positive samples out of all the actual positive samples. Formula: Recall = True Positive / (True Positive + False Negative) Explanation: True Positive (TP): The number of samples that are actually positive and are correctly classified as positive by the model. False Negative (FN): The number of samples that are actually positive but are incorrectly classified as negative by the model. F1 Score: F1 score is the harmonic mean of precision and recall. It provides a balanced measure between precision and recall, which is useful when the classes are imbalanced. Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall) Explanation: Precision: The proportion of correctly classified positive samples out of all the samples classified as positive. Recall: The proportion of correctly classified positive samples out of all the actual positive samples. # Calculate performance metrics table <- table(svm_pred, test_data$CD4.category) accuracy <- sum(diag(table)) / sum(table) precision <- diag(table) / colSums(table) recall <- diag(table) / rowSums(table) f1_score <- 2 * precision * recall / (precision + recall) Output Metric Low CD4 High CD4 Accuracy- - Precision- Recall- F1 Score- 4.3. To find out the factors affecting the CD4 cell levels on HIV positive women in Kenya The significant characteristics of the study participants are summarised in Table 1 below; CD4 Level Characteristics Total (N=1242) High (N=133) Low (N=1109) Relationship With Head brother/sister 24 (1.9%) 4 (3.0%) 20 (1.8%) Co-wife 1 (0.1%) 0 (0%) 1 (0.1%) grandchild 8 (0.6%) 0 (0%) 8 (0.7%) head 681 (54.8%) 62 (46.6%) 619 (55.8%) not related 6 (0.5%) 0 (0%) 6 (0.5%) other relative 20 (1.6%) 10 (7.5%) 10 (0.9%) parent 15 (1.2%) 1 (0.8%) 14 (1.3%) parent in law 1 (0.1%) 0 (0%) 1 (0.1%) partner 412 (33.2%) 47 (35.3%) 365 (32.9%) son/daughter 73 (5.9%) 9 (6.8%) 64 (5.8%) Son-in-law/daughter-in-law 1 (0.1%) 0 (0%) 1 (0.1%) Sick to Work last 3 Months no 1096 (88.2%) 103 (77.4%) 993 (89.5%) yes 146 (11.8%) 30 (22.6%) 116 (10.5%) Ever Attended School no 197 (15.9%) 31 (23.3%) 166 (15.0%) yes 1045 (84.1%) 102 (76.7%) 943 (85.0%) Ever Enrolled in School no 1180 (95.0%) 121 (91.0%) 1059 (95.5%) yes 62 (5.0%) 12 (9.0%) 50 (4.5%) Work for Pay no 919 (74.0%) 104 (78.2%) 815 (73.5%) yes 323 (26.0%) 29 (21.8%) 294 (26.5%) Married/Live Together no 90 (7.2%) 8 (6.0%) 82 (7.4%) yes 1152 (92.8%) 125 (94.0%) 1027 (92.6%) Number of Pregnancies Mean (SD) 4.15 (2.39) 3.62 (2.11) 4.22 (2.41) Median [Min, Max] 4.00 [0, 13.0] 4.00 [0, 9.00] 4.00 [0, 13.0] Pregnant Currently Currently not pregnant 1190 (95.8%) 121 (91.0%) 1069 (96.4%) Currently pregnant 52 (4.2%) 12 (9.0%) 40 (3.6%) Ever Avoided Pregnancy no 669 (53.9%) 96 (72.2%) 573 (51.7%) yes 573 (46.1%) 37 (27.8%) 536 (48.3%) Ever Sought TB Treatment no 1009 (81.2%) 99 (74.4%) 910 (82.1%) yes 233 (18.8%) 34 (25.6%) 199 (17.9%) Duration on ART < 12 months 118 (9.5%) 8 (6.0%) 110 (9.9%) 12-23 months 120 (9.7%) 18 (13.5%) 102 (9.2%) 24 months or more 346 (27.9%) 70 (52.6%) 276 (24.9%) Not on ART 658 (53.0%) 37 (27.8%) 621 (56.0%) ARVs Detected ARVs detected 916 (73.8%) 61 (45.9%) 855 (77.1%) ARVs not detected 326 (26.2%) 72 (54.1%) 254 (22.9%) Crosstab of HIV status and sex HIV Positive HIV Negative Total Male 63 17 80 Female- Transgender 1 0 1 Total- Crosstab of HIV status and marital status HIV Positive HIV Negative Total Married/Cohabiting- Single- Separated/Divorced 25 1 26 Widowed 2 2 4 Total- Crosstab of HIV status and age group HIV Positive HIV Negative Total- Total- Crosstab of HIV status and education level HIV Positive HIV Negative Total No education 24 6 30 Primary incomplete 60 15 75 Primary complete 57 18 75 Secondary incomplete 54 23 77 Secondary complete 30 10 40 Tertiary 10 7 17 Vocational/technical 1 0 1 Other 0 0 0 Total- CD4 Cell Count by Age Group CD4 Cell Count 18-24 years 25-34 years 35-44 years 45-54 years 55+ years < 200 18 (40.9%) 36 (37.5%) 52 (38.2%) 39 (34.2%) 26 (29.5%- (38.6%) 36 (37.5%) 50 (36.8%) 43 (37.7%) 34 (38.6%) >= 500 9 (20.5%) 21 (21.9%) 35 (25.7%) 33 (28.9%) 31 (35.2%) Table CD4 Cell Count by Gender CD4 Cell Count Female Male < 200 97 (36.7%) 74 (32.7%- (38.6%) 85 (37.6%) >= 500 68 (25.7%) 62 27.4%) Based on the tables above, we can see that several characteristics of the study participants are associated with their CD4 cell levels. For instance, being sick to work in the last three months and having ever sought TB treatment are associated with low CD4 levels. On the other hand, being on ART for 24 months or more, currently pregnant, and having ARVs detected are associated with high CD4 levels. Moreover, the table suggests that some characteristics do not have a significant relationship with CD4 levels, such as being a co-wife or a parent-in-law. Additionally, some characteristics are not evenly distributed among the high and low CD4 categories, such as the duration on ART and ever avoided pregnancy. Women who have been on ART for 24 months or more are more likely to have high CD4 levels than those who have been on ART for less than 24 months, and women who have ever avoided pregnancy are more likely to have low CD4 levels than those who have never avoided pregnancy. Overall, these findings suggest that various factors, such as ART adherence, TB treatment, and pregnancy status, are associated with CD4 cell levels in HIV-positive women in Kenya. Healthcare providers can use this information to target interventions and provide appropriate care to improve the CD4 cell levels of HIV-positive women in Kenya. 4.3.1. Household Characteristics 4.3.1.1 Relationship with the household head From Figure 1 below, most (55%) of the respondents were the heads and therefore produced 44.6% of the respondents with high CD4 level, as shown in Table 1. Only 5% were not related to the head of the household, and all of them recorded low CD4 levels. CHAPTER FIVE: CONCLUSION & RECOMMENDATIONS 5.1 Conclusion The general objective of this study was to investigate if the level of CD4 cells can be affected by social and economic factors. Using data from Kenya, we achieved our objective by performing descriptive statistics on the selected variables. Based on the analysis and findings of this study, it can be concluded that social and economic factors have a significant impact on the level of CD4 cells in individuals with HIV. Specifically, factors such as education, employment, and the ability to work can affect CD4 cell count, which plays a critical role in managing HIV. The results of our SVM model with a linear kernel were able to predict the CD4 category with high accuracy of 94.96% using a set of 27 predictors. The precision for low CD4 category was 99.24% and for high CD4 category was 14.29%, indicating that the model was highly accurate in predicting individuals who have a low CD4 count, but less accurate in identifying those who do not have a low CD4 count. The recall for low CD4 category was 95.62% and for high CD4 category was 50%, indicating that the model was highly effective in identifying individuals who have a low CD4 count, but less effective in identifying those who have a high CD4 count. The F1 score for low CD4 category was 97.40% and for high CD4 category was 22.22%. The SVM model performed well in predicting the CD4 category using the set of 27 predictors, with high accuracy, precision, and recall for low CD4 category and lower accuracy, precision, and recall for high CD4 category. This suggests that the model can effectively identify individuals who are at higher risk of having a low CD4 count, but may require further refinement to identify those who are at lower risk. Furthermore, this study emphasizes the crucial role of enrolling in school and having gainful employment in the management of HIV. Education and employment can provide individuals with HIV access to resources, which can help them better manage their condition and improve their overall quality of life. These findings can be valuable for informing clinical decisions and interventions related to HIV treatment and manage In conclusion, the findings of this study can be used to inform policymakers and healthcare providers on the importance of addressing social and economic factors in the management of HIV. This can help improve the effectiveness of HIV management programs and ultimately lead to better health outcomes for individuals living with HIV. Top of Form 5.2. Recommendations This study covered numerous issues that affect the public and, most importantly, a group of vulnerable people. Understanding what HIV positive individuals need is the first step to successfully managing this virus. Our research should shed light on the various challenges these people face in our communities and the effort to address these challenges. Although the data we used is for Kenyan women living with HIV, the information gathered in this study projects challenges faced by other people living with HIV worldwide. The government and relevant stakeholders should consider these recommendations in Kenya to improve the management of this virus. Our research pointed out some issues that would improve the management of HIV in Kenya. As pointed out by Anand et al. (2009), implementing comprehensive positive prevention measures will come a long way in reducing the impact of HIV in Sub-Saharan Africa. Based on the findings of this study, we recommend that healthcare providers and policymakers prioritize the education of individuals on the significance of enrolling in school and obtaining employment, particularly those living with HIV. This recommendation is based on our results, which showed that enrolling in school and being employed are positively correlated with CD4 cell count. Therefore, promoting education and employment opportunities for individuals living with HIV may have a positive impact on their CD4 cell count levels and, ultimately, their overall health. Moreover, we recommend that healthcare providers offer comprehensive HIV management programs that focus on the social and economic factors that affect CD4 cell count levels. This study has shown that these factors have a significant impact on the management of HIV, and therefore, healthcare providers must consider these factors when designing treatment plans for individuals living with HIV. Lastly, we recommend that further studies be conducted to explore the role of other social and economic factors on CD4 cell count levels. This study only focused on a limited number of variables, and there is a need for more research to be conducted to gain a deeper understanding of the social and economic factors affecting CD4 cell count levels. By doing so, we can enhance the current knowledge on HIV management and develop more effective strategies to manage this pandemic. CHAPTER SIX: REFERENCES Anand, P., Hunter, G., Carter, I., Dowding, K., Guala, F., & Van Hees, M. (2009). The Development of Capability Indicators. Journal of Human Development and Capabilities, 10(1), 125–152. https://doi.org/10.1080/- Barnett, D., Walker, B., Landay, A., & Denny, T. N. (2008). CD4 immunophenotyping in HIV infection. Nature Reviews Microbiology, 6(S11), S7–S15. https://doi.org/10.1038/nrmicro1998 Daniel Niguse Mamo, Tesfahun Melese Yilma, Makida Fekadie, Sebastian, Y., Tilahun Bizuayehu, Mequannent Sharew Melaku, & Agmasie Damtew Walle. (2023). Machine learning to predict virological failure among HIV patients on antiretroviral therapy in the University of Gondar Comprehensive and Specialized Hospital, in Amhara Region, Ethiopia, 2022. BMC Medical Informatics and Decision Making, 23(1). https://doi.org/10.1186/s- Elul, B., Basinga, P., Nuwagaba-Biribonwoha, H., Saito, S., Horowitz, D., Nash, D., Mugabo, J., Mugisha, V., Rugigana, E., Nkunda, R., & Asiimwe, A. (2013). High Levels of Adherence and Viral Suppression in a Nationally Representative Sample of HIV-Infected Adults on Antiretroviral Therapy for 6, 12 and 18 Months in Rwanda. PLoS ONE, 8(1), e53586. https://doi.org/10.1371/journal.pone- Frescura, L., Godfrey-Faussett, P., Feizzadeh A., A., El-Sadr, W., Syarif, O., & Ghys, P. D. (2022). Achieving the 95 95 95 targets for all: A pathway to ending AIDS. PLOS ONE, 17(8), e-. https://doi.org/10.1371/journal.pone- Kamer, L. (2022). AIDS-related deaths leading countries worldwide 2021. Statista. https://www.statista.com/statistics/281396/countries-with-highest-number-of-aids-deaths/ KENPHIA. (2020). KENPHIA Preliminary Report. Www.health.go.ke. https://www.health.go.ke/wp-content/uploads/2020/02/KENPHIA-2018-PREL-REP-2020-HR3-final.pdf Masaba, R., Woelk, G., Siamba, S., Ndimbii, J., Ouma, M., Khaoya, J., Kipchirchir, A., Boniface Ochanda, & Okomo, G. (2023). Antiretroviral treatment failure and associated factors among people living with HIV on therapy in Homa Bay, Kenya: A retrospective study. PLOS Global Public Health, 3(3), e-–e-. https://doi.org/10.1371/journal.pgph- Ministry of Health. (2020). Kenya’s National HIV Survey Shows Progress Towards Control of the Epidemic. Nairobi, 20th February 2020 – MINISTRY OF HEALTH. Health.go.ke. https://www.health.go.ke/kenyas-national-hiv-survey-shows-progress-towards-control-of-the-epidemic-nairobi-20th-february-2020/#:~:text=The%20Government%20today%20released%20preliminary Nsanzimana, S., Rwibasira, G. N., Malamba, S. S., Musengimana, G., Kayirangwa, E., Jonnalagadda, S., Fazito Rezende, E., Eaton, J. W., Mugisha, V., Remera, E., Muhamed, S., Mulindabigwi, A., Omolo, J., Weisner, L., Moore, C., Patel, H., & Justman, J. E. (2022). HIV incidence and prevalence among adults aged 15-64 years in Rwanda: Results from the Rwanda Population-based HIV Impact Assessment (RPHIA) and District-level Modeling, 2019. International Journal of Infectious Diseases, 116, 245–254. https://doi.org/10.1016/j.ijid- NSCOP. (2020). Division of National AIDS & STI Control Program | Fight Against HIV and AIDS. Www.nascop.or.ke. https://www.nascop.or.ke/#:~:text=In%202019%2C%20a%20total%20of Odhiambo, A. (2020, April 8). Tackling Kenya’s Domestic Violence Amid COVID-19 Crisis. Human Rights Watch. https://www.hrw.org/news/2020/04/08/tackling-kenyas-domestic-violence-amid-covid-19-crisis Ridzuan, F. (2022). A Review on Data Cleansing Methods for Big Data. Sciencedirect.com. https://www.sciencedirect.com/science/article/pii/S-/pdf?md5=c8d975a00d9baaf0fdbcf1c527ccc96a&pid=1-s2.0-S--main.pdf Schober, P., & Vetter, T. R. (2020). Missing Data and Imputation Methods. Anesthesia & Analgesia, 131(5),-. https://doi.org/10.1213/ane- UNAIDS. (2014). UNAIDS report shows that 19 million of the 35 million people living with HIV today do not know that they have the virus. Www.unaids.org. https://www.unaids.org/en/resources/presscentre/pressreleaseandstatementarchive/2014/july/-prgapreport UNAIDS. (2020). UNAIDS data 2020. Www.unaids.org. https://www.unaids.org/sites/default/files/media_asset/2020_aids-data-book_en.pdf UNAIDS. (2022). 2022 GLOBAL HIV STATISTICS. https://www.unaids.org/sites/default/files/media_asset/UNAIDS_FactSheet_en.pdf UNICEF. (2017). Recent study finds that over 50% of children in Rwanda are victims of sexual, physical or emotional violence. Www.unicef.org. https://www.unicef.org/rwanda/press-releases/recent-study-finds-over-50-children-rwanda-are-victims-sexual-physical-or-emotional World Health Organization. (2022). Vulnerable groups and key populations at increased risk of HIV. World Health Organization - Regional Office for the Eastern Mediterranean. https://www.emro.who.int/asd/health-topics/vulnerable-groups-and-key-populations-at-increased-risk-of-hiv.html APPENDIX #Reading Data and Extracting Only KENYA Data library(dplyr) library(readr) hivdt <- read_csv("C:/Users/HP/Downloads/phiacd4.csv") hivtz <- hivdt[hivdt$Kenya==1,]#subsetting my dataframe to only include rows where the Kenya column is equal to 1 head(hivtz) colnames(hivtz) hivtz2 <- select(hivtz,-c(31,32,33)) #remove Country labelled variables colnames(hivtz2) hivtz3<-hivtz2 #duplicate datasets #Number to Factor for hivtz2 dataset str(hivtz2) #check for type of datatype in columns numcol<-c(2:10,12,14:15,17:30) #number columns to be factors hivtz2[numcol]<-lapply(hivtz2[numcol],factor) #convert to categorical variables str(hivtz2) #check the structure of the dataset ``` ```{r} #Check for NAs sapply(hivtz2, function(x) sum(is.na(x)))#check for no. of NAs in columns mean(is.na(hivtz2))#overall missing data proportion apply(hivtz2,2,function(col)sum(is.na(col))/length(col))#per column missing data proportion nalist <- colnames(hivtz2)[apply(hivtz2,2,anyNA)] nalist #list of columns with NAs ``` ```{r} #install.packages("simputation") #Addressing the NAs library(visdat)#visualize data library(naniar)#visualizing and work with missing data library(simputation)#simple imputation library(tidyverse) colnames(hivtz2) hivtz2tbl <- as_tibble(hivtz2) #########MISSING DATA VISUALIZATIONS ######### hivtz2tbl %>% vis_dat()#types and na distr ``` ```{r} hivtz2tbl %>% vis_miss()#distr of missing na ``` ```{r} hivtz2tbl %>% gg_miss_upset()#upset plot- nas in columns and interaction ``` ```{r} #Fill missing data using mice - 5% max for imputation for per column library(tidyverse) library(mice) #install.packages("mice") set.seed(5) # Check for missing values pattern md.pattern(hivtz2) hivtz2$`Pregnacy status now`<- as.numeric(hivtz2$`Pregnacy status now`) # Impute missing values with mean hivtz2$`Pregnacy status now`[is.na(hivtz2$`Pregnacy status now`)] <- mean(hivtz2$`Pregnacy status now`, na.rm = TRUE) # Check for missing values after imputation sum(is.na(hivtz2$`Pregnacy status now`)) ###Exporting Dataset write.csv(hivtz2,"C:/Users/HP/Downloads/filtered phiacd4 (1).xlsx" ``` **DATA ANALYSIS** ```{r message=FALSE, warning=FALSE} library(readxl) df<-read_excel("C:/Users/HP/Downloads/filtered phiacd4 (1) (1).xlsx") head(df) numcol2<-c(2:11,13,15:16,18:30) #number columns to be factors df[numcol2]<-lapply(df[numcol2],factor) #convert to categorical variables str(df) colnames(df) library(gmodels) library(MASS) #AGE CrossTable(df$Age.at.first.sex,df$CD4.category, chisq = TRUE) #HH ARRANGEMENTS CrossTable(df$Relationship.with.family.head,df$CD4.category, chisq = TRUE) CrossTable(df$Respondent.live.in.household,df$CD4.category, chisq = TRUE) CrossTable(df$Ever.married.lived.together,df$CD4.category, chisq = TRUE) #EDUCATION STATUS CrossTable(df$Ever.attended.school,df$CD4.category, chisq = TRUE) CrossTable(df$Ever.enrolled.in.school,df$CD4.category, chisq = TRUE) CrossTable(df$Highest.level.of.education,df$CD4.category, chisq = TRUE) CrossTable(df$Highest.grade.at.that.school.level,df$CD4.category, chisq = TRUE) #ALCOHOL CrossTable(df$Alcohol.drink.frequency,df$CD4.category, chisq = TRUE) #URBAN CrossTable(df$Urban.area.indicator,df$CD4.category, chisq = TRUE) #WEALTHQ CrossTable(df$Wealth.quintile,df$CD4.category, chisq = TRUE) ``` **OBJECTIVE 2: To fit a binary logistic regression model** ```{r} #Dataset with numeric variables library(readr) mine <- df ``` ```{r} head(df) ``` ```{r} #Check and Remove Highly Correlated Columns library(dplyr) library(corrr) library(tidyverse) minenocd4 <- subset(mine, select = -CD4.category) #inenocd4<- select(minenocd4, -c(12,14,30)) minenocd4 <- mine[, -c(12, 14, 30)] ``` ```{r} #check unique values for dataset sapply(minenocd4, function(x) length(unique(x))) ``` ```{r} res.cor<-correlate(minenocd4, method = "pearson", use = "pairwise.complete.obs") res.cor res.cor %>% gather(-term, key = "colname", value = "cor") %>% filter(abs(cor)>0.85) ``` ```{r} #check unique values for dataset sapply(minenocd4, function(x) length(unique(x))) ``` ```{r} library(gtsummary) df <- select(minenocd4, -Bought.sold.sex.in.the.past.12.months) head(minenocd4) ``` ```{r} #Model Formulation modl<-glm(minenocd4$CD4.category ~ Age + Bought.sold.sex.in.the.past.12.months + Whether.ARVs.detected+ Duration.of.time.on.ART + On.ART + LAg..recent.long.term.infection+ Wealth.quintile,data = minenocd4,family = "binomial") print(summary(modl), signif.stars = TRUE) modl %>% tbl_regression(exponentiate = FALSE) ``` ```{r} # Split data into training and testing sets train_index <- sample(nrow(minenocd4), 0.8 * nrow(minenocd4)) trainer <- minenocd4[train_index, ] tester <- minenocd4[-train_index, ] ``` ```{r} #Final Model library(gtsummary) final_modl <- glm(trainer$CD4.category ~ Relationship.with.family.head + worksicklast3mon + attendedschool + enrolledschool + wrkpaymtlst12mon + marriedorlivedtogether + nopregnancies + pregstatusnw + avoidpregnancy + soughtTBtrtment + timeonART + ARVsdetected, data = trainer,family = "binomial") print(summary(final_modl), signif.stars = TRUE) final_modl %>% tbl_regression(exponentiate = TRUE) #Odds Ratio exp(coef(final_modl)) #Testing res<-predict(final_modl,tester, type="response") res #Confusion Matrix table(ActualValue = tester$CD4cat, PredictedValue = res > 0.5) ``` ```{r} #Finding the correct threshold for the model to reduce the false positive rate #Change res to training dataset res<-predict(final_modl,trainer, type="response") library(ROCR) ROCRPred <- prediction(res, trainer$CD4cat)#check prediction ROCRPerf <- performance(ROCRPred,"tpr","fpr") plot(ROCRPerf,colorize = TRUE,print.cutoffs.at = seq(0.1, by= 0.1))#check performance #tpr = true positive rate #fpr = false positive rate ``` ```{r} #Check the threshold res<-predict(final_modl,tester, type="response") table(ActualValue = tester$CD4cat, PredictedValue = res > 0.2) (180+9)/-)#using 0.2 = 75.9% table(ActualValue = tester$CD4cat, PredictedValue = res > 0.3) (200+7)/-)#using 0.3 = 83.13% table(ActualValue = tester$CD4cat, PredictedValue = res > 0.4) (215+4)/-)#using 0.4 = 87.95% #Using 0.4, which does not excessively reduce the efficiency but reduces false positive by one ``` **OBJECTIVE 2: To fit a support vector machine model** ```{r} # Install and load e1071 package library(e1071) ``` Before building the SVM model, we need to split the dataset into training and testing sets. We will use 70% of the data for training and 30% for testing. The following code splits the data into training and testing sets: ```{r} # Set seed for reproducibility set.seed(123) library(dplyr) #check unique values for dataset sapply(df, function(x) length(unique(x))) #remove the last variable because it has one level and and two which are not scalable not relevant df <- select(df, c(12,14,30)) # Split data into training and testing sets train_index <- sample(nrow(df), 0.7 * nrow(df)) train_data <- df[train_index, ] test_data <- df[-train_index, ] ``` Now, we can build the SVM model using the 'svm' function from the 'e1071' package. We will use a linear kernel and the default values for other parameters. The code for building the SVM model is as follows: ```{r} # Build SVM model svm_model <- svm(CD4.category ~ ., data = train_data, kernel = "linear") ``` In the above code, we specified CD4.category as the target variable and used all other variables as predictors. We also specified the kernel as 'linear'. We can now use the model to make predictions on the test data using the 'predict' function. The code for making predictions is as follows: ```{r} # Make predictions on test data svm_pred <- predict(svm_model, newdata = test_data) ``` Finally, we can evaluate the performance of the SVM model using various metrics such as accuracy, precision, recall, and F1 score. Here's the code for calculating these metrics: ```{r} #subset test_data<-test_data[0:122,] # Calculate performance metrics table <- table(svm_pred, test_data$CD4.category) accuracy <- sum(diag(table)) / sum(table) precision <- diag(table) / colSums(table) recall <- diag(table) / rowSums(table) f1_score <- 2 * precision * recall / (precision + recall) # Print performance metrics cat("Accuracy:", accuracy, "\n") cat("Precision:", precision, "\n") cat("Recall:", recall, "\n") cat("F1 Score:", f1_score, "\n") ``` In the above code, we first calculated the confusion matrix using the 'table' function. Then, we calculated accuracy, precision, recall, and F1 score using the confusion matrix. Finally, we printed the performance metrics to the console. **OBJECTIVE 3: To investigate the factors affecting the CD4 levels on HIV+ women in kenya** ```{r} library(ggplot2) library(psych)#describe() library(reshape2)#from wide format to long format library(scales) library(moments) library(lessR)#barcharts for categorical and normality detailed visual test library(DT) #generating datatables library(dplyr)#selecting and sorting library(tidyverse) library(ggpubr)#normality graphs library(gmodels)#Crosstables ``` ```{r} #Chart For Categorical Variables #Relationship with Head table(df$Relationship.with.family.head) prop.table(table(df$Relationship.with.family.head)) BarChart(Relationship.with.family.head, data = df, horiz = TRUE, sort = "-", stat = "count", main = "Relationship With Head",ylab = "Count", xlab = "Type of Relationship with Head") CrossTable(df$Relationship.with.family.head,df$CD4.category) ``` ```{r} #Time On ART table(df$Duration.of.time.on.ART) prop.table(table(df$Duration.of.time.on.ART)) BarChart(Duration.of.time.on.ART, data = df, sort = "-", stat = "proportion", main = "Time on ART for the Repondents",ylab = "Proportion", xlab = "Time Range") CrossTable(df$Duration.of.time.on.ART, df$CD4.category) ``` ```{r} #ARVs Detected table(df$Whether.ARVs.detected)#Put it as a statement in paper prop.table(table(df$Whether.ARVs.detected)) CrossTable(df$Whether.ARVs.detected, df$CD4.category) ``` ```{r} #Categorical Summaries #YES/NO CATEGORIES##BOOLEAN dfsickwork <- table(df$Sick.to.work.last.three.months) dfgosch <- table(df$Ever.attended.school) dfenrollsch <- table(df$Ever.enrolled.in.school) dfworkpay <- table(df$Work.for.payment.in.last.12.months) dflivetogether <- table(df$Ever.married.lived.together) dfavoidpreg <- table(df$Avoiding.pregnancy) dftbtreat <- table(df$Ever.sought.TB.treatment) ``` ```{r} df.cat2 <- rbind(dfsickwork,dfgosch,dfenrollsch,dfworkpay,dflivetogether,dfavoidpreg,dftbtreat) df.cat2 rownames(df.cat2)<-c("Sick to Work Last 3 Months","Ever Attended School","Ever Enrolled in School","Worked for Pay in Last 12 Months","Ever Married/Lived Together","Avoiding Pregnancy","Ever Sought TB Treatment") colnames(df.cat2)<-c("No","Yes") #Transform data to long format long <- melt(df.cat2, id.vars = c("No", "Yes")) long colnames(long)<-c("Variable","Condition","Value") colnames(long) # Grouped barplot using ggplot2 Variable <- long$Variable Value <- long$Value Condition <- long$Condition x<-Value/1242 ggplot(long, aes(x = Variable, y = x, fill = Condition, label = scales::percent(x))) + geom_bar(stat = "identity", position = "dodge") + scale_y_continuous(labels = function(x) paste0(x*100, "%")) + labs(x = "Variable", y = "Frequency (%)", title = "Boolean Categorical Variables") + theme_classic() + geom_text(position = position_dodge(width = .9), # move to center of bars vjust = -0.5, # nudge above top of bar size = 1.9) + coord_flip() ``` ```{r} #NUMERICAL DATA################# dfpreg<-describe(as.numeric(df$Number.of.pregnancies)) dfpreg datatable(dfpreg) #value of the Shapiro-Wilk Test is greater than 0.05, the data is normal. If it is below 0.05, the data significantly deviate from a normal distribution shapiro.test(as.numeric(df$Number.of.pregnancies)) #Density Plot ggdensity(as.numeric(df$Number.of.pregnancies), main = "Density plot of Number of Pregnancies", xlab = "Number of Pregnancies") #CROSSTABLE WITH CD4 library(dplyr) library(table1) colnames(df) df3 <- select(df, c(2,4,5,6,9,10,11,13,15,26,21,23)) colnames(df3) ``` ```{r} head(df3) ``` ```{r} labels <- list(variables=list(Relationship.with.family.head = "Relationship With Head", Sick.to.work.last.three.months = "Sick to Work last 3 Months", Ever.attended.school= "Ever Attended School", Ever.enrolled.in.school = "Ever Enrolled in School", Work.for.payment.in.last.12.months= "Work for Pay", Ever.married.lived.together = "Married/Live Together", Number.of.pregnancies = "Number of Pregnancies", Avoiding.pregnancy = "Ever Avoided Pregnancy", Ever.sought.TB.treatment = "Ever Sought TB Treatment", Ever.had.sexual.intercourse = "Duration on ART", Whether.ARVs.detected ="ARVs Detected"), groups=list("", "CD4 Level")) levels(df$CD4.category) <- c("High", "Low") strata <- c(list(Total = df), split(df, df$CD4.category)) dftbl3 <- table1(strata, labels, groupspan=c(1,2), rowlabelhead = "Characteristics", overall = "Total", caption = "CD4 Levels against the Characteristics", footnote = "CD4 Levels against the Significant Variables", data = df) print(dftbl3) ```

Scheduled maintenance