Ehijie Collins Agbadu | Freelancer Data Cleaning Report

Data Cleaning Report

DATA PREPROCESSING AND TRANSFORMATION Group 5B|0801-DV Associate internship INTRODUCTION Data preprocessing and transformation are crucial steps in the data analysis and machine learning workflow. These processes involve cleaning, organizing, and structuring raw data to make it suitable for analysis or training machine learning models. Proper data preprocessing can significantly impact the performance and reliability of the models. Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data to make it suitable for analysis. Some common steps in data preprocessing include Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and transformation. Data Integration: This involves combining data from multiple sources to create a unified dataset. Data integration can be challenging as it requires handling data with different formats, structures, and semantics. Techniques such as record linkage and data fusion can be used for data integration. USER DATA The User data which includes every user that has ever created an account on the Excelerate platform involves datasets with 8 column/column headings; Preferred sponsors of the user, gender of the user, country of residence during signup, degree or academic level of user, signup date and time of the user, city of residence, zip or postal code and the last column showing whether the user got the information through social media or not. The data has a total of 27,563 rows which is imperative to the same number of users with an account on Excelerate. OPPORTUNITY WISE DATA The Opportunity Wise Data comprises 17 different columns and 20323 rows. The rows were later reduced to 11482 after duplicates were removed. The profile ID column was used to remove the duplication because it is the most unique field among other fields. PREPROCESS THE DATASETS 1. Handling Outliers and Anomalies Opportunity Sign-Up and Completion Data: 1. select the rewards amount column 2. Copy and paste selected data into a new sheet and sort from A-Z from smallest to highest with 50 being the lowest and 2500 and null values(0) following suit. 3. use the Interquartile function to select cell A1:A:1449 to discover an outlier is any data value that is 1.Less than First Quartile(Q1) -1.5*IQR 2. Greater than third quartile +1.5*IQR Interquartile Range (IQR) = Third Quartile(Q3) - First Quartile(Q1) And also manually check or glance through the figures of rewards points which were as follows:$2500,$1000,$500,$250,$150,$120,$100,$80,$50,$0 with the majority being null values. 2. Normalize or Scale Relevant Features: USER DATA The null cells in the gender column were filled with the modal gender i.e. the gender with the highest frequency which is the "male" because of the data type; (gender) which is nominal data. The columns were filled up to ensure consistency and reduce discrepancies. Entire rows containing null cells in the "Current student status" were removed completely as the larger percentage of the rows with null entries in the column has null entries for the majority of the columns. Keeping these cells blank or null might affect data analysis or give skewed results if filled using statistical methods because null entries showed a very high frequency. The "City" column alphabetical case was adjusted as some were in uppercase while some were in lowercase. All null entries have been removed following the deletion of rows containing null current student status. OPPORTUNITY WISE DATA Duplicate entries were cleaned. In order to normalize and standardize the amount and skills of the rewards point numerical column, the column for status description was cleared of entries that were "rejected" or "withdrawn". The missing or null entries for reward amount and skill points were filled accordingly with the modal value of filled entries for each opportunity category type. Also, using the badge name column, the modal value also corresponds with the lowest possible points and reward amount for participation in each opportunity category type using the reward amount and skill point of cells with "completed" in the status description for each category type as reference. Categorical types of opportunities with no "completed" status in all entries were filled with 0 for the skill points and reward amount. 3. Addressing Data Quality Issues: USER DATA HANDLING NULLS The null cells in the gender column were filled with the modal gender i.e the gender with the highest frequency which is the "male" because of the data type; (gender) which is a nominal data. The columns were filled up to ensure consistency and reduce discrepancies. Entire rows containing null cells in the "Current student status" were removed completely as the larger percentage of the rows with null entries in the column has null entries for the majority of the columns. Keeping these cells blank or null might affect data analysis or give skewed results if filled using statistical methods because null entries showed a very high frequency. The "City" column alphabetical case was adjusted as some were in uppercase while some were in lowercase. All null entries have been removed following the deletion of rows containing null current student status. OPPORTUNITY WISE DATA HANDLING NULL VALUES Duplicate entries were cleaned. The column for status description was cleared of entries that were "rejected" or "withdrawn". The missing or null entries for reward amount and skill points were filled accordingly with the modal value of filled entries(completed opportunities) for each opportunity category type. The modal value also corresponds with the lowest possible points and reward amount for participation in each opportunity category type using the reward amount and skill point of cells with "completed" in the status description for each category type as reference. Categorical-type of opportunities with no "completed" status and no rewards awarded in all available entries were filled with 0 for the skill points and reward amount consecutively. 4. Feature Engineering: Data feature engineering is the process of leveraging on data to create or manipulate new variables that are not in the training set * Highlighted and copied out the columns opportunity name, skills point and reward amount. *Then use the sort and replace function to filter out the opportunity names having null skill points and null reward...thus, set the null value to zero. * Then sort and filter unique values this cleaned the rows from 10737 to 4749.....then start analysis *Then use the count If formula to count each occurrence of each unique opportunity, * Use the Average If formula to determine the average skills for each opportunity earned and the average reward for the opportunity as well. *And divide reward by skill points to get the ratio of reward to skill points *Then use the IF function to grade opportunity categorization based on skill points and use the function for opportunity categorization based on reward amount *Use the SUM IF function to calculate the total points for each opportunity and the total reward for each opportunity *Then do the percentage analysis of opportunity by skills point grade using the COUNTA function 5. One Hot Encoding: One-hot encoding is a technique used in machine learning and natural language processing to represent categorical variables as binary vectors. In this encoding scheme, each category or class is represented by a unique binary value, and all values are mutually exclusive. Here's how one-hot encoding works: Identification of Categories: Identify all the unique categories or classes in the categorical variable. Assigning Binary Values: Assign a unique binary value to each category. Typically, these binary values are 0 or 1. Creating Binary Vectors: Represent each data point by a binary vector of length equal to the number of unique categories. Each position in the vector corresponds to a category, and only one position will have the value 1, indicating the presence of that category. Process: Datasheet 1 - Performed one-hot encoding for the "Gender" column and created separate columns for "Male" and "Female," and created a new column indicating whether "True" or "False" is present in another column using a similar approach. - To create one-hot encoding columns for "Male" and "Female" based on the given data structure in Sheets, the data starts from cell A1, we used the following formulas: - The "Male" and "Female" columns start in column F, we used the following formulas: - For the "Male" column ("Male" is the Nth column, starting in cell F2): =IF(F2="Male", 1, 0) For the "Female" column ("Female" is the Oth column, starting in cell F2): =IF(F2="Female", 1, 0) - To create one-hot encoding for the “IsFromSocialMedia” and created columns for "TRUE" and "FALSE" based on the given data structure in Sheets, the data starts from cell A1, we used the following formulas: The "TRUE" and "FALSE" columns start in column M, we can use the following formulas: For the "TRUE" column ("TRUE" is the Pth column, starting in cell M2): =IF(M2="TRUE", 1, 0) For the "Female" column (assuming "Female" is the Qth column, starting in cell M2): =IF(M2="FALSE", 1, 0) FINAL OUTPUT : Process : Datasheet 2 -- Performed one-hot encoding for the "Gender" column and created separate columns for "Male" and "Female". -- To create one-hot encoding columns for "Male" and "Female" For the "Male" column ("Male" is the Kth column, starting in cell B2): =IF(B2="Male", 1, 0) For the "Female" column ("Female" is the Lth column, starting in cell B2): =IF(B2="Female", 1, 0) -- Performed one-hot encoding for the "Opportunity Category" column and created separate columns for 5 different categories: Internship, Event, Course, Competition, Engagement. -- For the "Internship" column ("Internship" is the Mth column, starting in cell H2): =IF(H2="Internship", 1, 0) -- For the "Event" column ("Event" is the Nth column, starting in cell H2): =IF(H2="Event", 1, 0) -- For the "Course" column ("Course" is the Oth column, starting in cell H2): =IF(H2="Course", 1, 0) -- For the "Competition" column ("Competition" is the Pth column, starting in cell H2): =IF(H2="Competition", 1, 0) -- For the "Engagement" column ("Engagement" is the Qth column, starting in cell H2): =IF(H2="Engagement", 1, 0) FINAL OUTPUT : 6. Documentation: 1. Handling Outliers and Anomalies - Identify outliers and anomalies 2. Normalize or Scale Relevant Features - Normalize or scale numerical features 3. Addressing Data Quality Issues - Identify and address missing values 4. Feature Engineering - Create new features 5. Data Transformation - Perform additional transformations - One-hot Coding - Outliers can distort analysis and negatively affect model performance. Identifying and handling them ensures a more accurate representation of the data. - Features with different scales can impact model training differently. Normalising or scaling ensures consistency and improves the model convergence. - Missing values can lead to biased or inaccurate results. Addressing data quality issues ensures the reliability of the analysis. - Feature engineering introduces new information to enhance the dataset, potentially improving model performance. - Data transformations, such as log transformations or encoding, help meet model assumptions, handle skewed distributions, and capture non-linear relationships. Transformations contribute to improved model fit and capturing complex patterns. -Many machine learning algorithms require | numerical input. One-hot encoding is a technique to convert categorical variables into a format suitable for numerical analysis. - Transform categorical variables into a binary format, creating binary columns for each category. This enables the algorithm to consider each category independently. -Handling Outliers and Anomalies: Outliers in sales data might distort the overall trend. Identifying anomalies helps ensure the accuracy of customer demographics analysis (Q2). Modified Rationale: Handling outliers and anomalies is crucial to maintaining the integrity of the data, ensuring that the overall sales trend and customer demographics insights are not skewed by extreme values. -Normalize or Scale Relevant Features: Scaling features is essential for accurate analysis of different product categories' contributions. Modified Rationale: Scaling relevant features ensures that the contributions of different product categories are comparable, aiding in the analysis of their impact on overall sales. -Addressing Data Quality Issues: Addressing data quality issues ensures the reliability of insights related to factors influencing sales. Modified Rationale: Ensuring data quality is critical to reliably identify and analyze factors like seasonality or promotional events that might influence sales. -Feature Engineering: Creating new features can enhance the dataset for better predictions of sales trends. Modified Rationale: Feature engineering contributes to a more nuanced understanding of the factors contributing to the overall sales performance trend. -Data Transformation (One-Hot Encoding): One-hot encoding enables the analysis of customer demographics, aligning with. Modified Rationale: The one-hot encoding of categorical variables, such as customer demographics, facilitates a detailed analysis of patterns and insights that may affect sales. CONDUCT INITIAL ANALYSIS Trends and Comparative Analysis Trend Analysis Trend analysis involves examining data over time to identify patterns, tendencies, or movements that may indicate a consistent direction or tendency in a particular variable. In the context of a sales performance dashboard, trend analysis is crucial for understanding how sales have been evolving. Trend analysis is a quantitative review of what happens over time. It entails the collection of data from multiple periods and plotting the information on a line graph for further analysis. There are three types of trend analysis: geographic, temporal, and intuitive. 1) Trend Analysis > Using User Data a) User Growth Overtime: Utilized timestamp information to visualize user sign-ups and identify periods of Growth. Outcome: • The period of Growth started on January 2023 where the highest Growth in signup Is in July 2023 • The period of decline in signup is from July 2023 - November 2023 b) Demographic Shift: Using the categorical variables (gender, IsFromSocailMedia) to observe the changes over Time. Outcome: • Users who indicated their Gender as "other" have the highest signup, followed by "Male" • Majority of users sign up using Social media platforms. 2) Trend Analysis > Using Opportunity Dataset a) Opportunity participation Trend: Analyze the trend (growth or decline) in the number of learners signing up for Opportunity over time Outcome: • The highest participation is on August 7 2023 over time. b) Completion Rate: Calculate and analyze the trend in the Completion rate of various opportunities( to identify which opportunities consistently show a high Completion rate over time) Outcome: • Created a new column from the status description, as "Completion Status) using the the IF Formula ( where "Reward Awards" = Completed else Not Completed). After which we calculated the Completion rate = Count of completed users/ Total number of User (1285 ÷ 11481 = 11.19%) • Internship has the highest Completion rate over time. On July 13th 2022 the internship opportunity has 100% Completion. Demographic comparative analysis was carried out using statistical chart representation for visualization to understand demographic differences among user groups. The distribution of student statuses according to regions and their completion rates. The demographic distribution of users spans 170 countries from all continents. Chart 1 shows the top 10 countries with the highest frequency of users, from India to Nepal and chart 2 shows the map distribution CHART 1 CHART 2 The chart below shows the demographic distribution of opportunity signups of the top five countries including India, which is the highest, Nigeria, USA, Pakistan and Ghana and also the completion rates per region. The rewards awarded chart below shows student statuses for completed opportunities which are also with their rewards awarded. With 1449 as the total number of completed opportunities, the graduate programme student showed the highest value with the high school students in the low with just 73 opportunities completed. The opportunity category chart shows the most preferred opportunity type using opportunity signup data. Internship opportunities showed the highest frequency by far with 7966 entries and Engagement with the lowest having 60 entries. The pie chart below shows the percentage of users preferred sponsors for each sponsor. CROSS DATA SET ANALYSIS A major challenge was faced in combining both datasets for combined analysis using different tools. Some were due to contrast in the number of rows and so on. The observation made using charts from both datasets however follows: - The user data shows that there are 27,563 users with an account while about just below average of the figure proceeded to sign up for opportunities(11482). - The demographic signup follows the same trend with the same top five countries for both datasets. India, USA, Pakistan, Ghana and Nigeria. - The current student status also showed a level of consistency with the graduate programme students having the highest population for both datasets. - The male gender has the highest level of gender for both datasets after the null entries were handled. CREATING WIREFRAME To create a wireframe, begin by defining the purpose and target audience of your project. Gather relevant information and sketch initial ideas on paper, focusing on the layout and structure. Choose a digital tool like Balsamiq, Adobe XD, or Figma for more detailed work. Establish the basic page structure, including the header, footer, navigation, and content areas. Add key elements such as buttons, forms, images, and text boxes, prioritizing functionality over visual design. Define user interactions and navigation, considering how users will move through the interface. Seek feedback from team members or stakeholders and refine the wireframe accordingly. Iterate through this process, making adjustments until you have a solid foundation. Annotate the wireframe to explain specific features and functionalities. Present the wireframe to stakeholders for approval, and once finalized, it can serve as a guide for designers and developers during the visual design and development phases. USER DATA DASHBOARD: OPPORTUNITY WISE DATA: ANNOTATIONS AND EXPLANATION USER DATA DASHBOARD: Data visualization: The dashboard uses a variety of charts and graphs to visualize the data, including bar charts, pie charts, and line graphs. This is a good way to present a variety of data in a way that is easy to understand. Data selection: The dashboard allows users to select a specific year from a dropdown menu. This is a good way to focus on the data that is most relevant to the user. User information: The dashboard shows some basic information about the user, such as their gender and the country they are from. This can be helpful for understanding the context of the data. OPPORTUNITY WISE DATA: The dashboard seems to be focused on opportunity data, with sections for the total number of users, total opportunities, completion rate, most completed opportunities, most popular opportunities, and top cities in the US. There is also a section for user gender and top 10 countries. The layout is fairly simple, with a sidebar for navigation and a main content area. The data is visualized using a variety of charts and graphs. FLEXIBILITY AND ITERATIONS *Embrace low-fidelity: Start with low-fidelity wireframes – simple sketches or diagrams focusing on layout and information hierarchy. This makes it easier and faster to iterate and adapt as new information emerges. *Utilize modular components: Design key elements like buttons, cards, and charts as independent, reusable components. This allows you to easily swap or rearrange them as needed without affecting the overall structure. *Focus on information architecture: Prioritize the organization and flow of information over visual details. This ensures the core structure remains adaptable even if the visual design evolves. *Leverage white space: Leave ample white space on your wireframes. This provides room for incorporating additional data points or functionalities without feeling cluttered. *Document assumptions and considerations: Clearly document any assumptions you make or design decisions you take. This transparency helps everyone understand the rationale behind the wireframe and facilitates smoother future adjustments. *Use digital wireframing tools: Consider using digital wireframing tools that allow easy editing and version control. This makes it simple to track changes, revert to previous iterations, and collaborate with stakeholders.