Data Cleaning and Title Optimization
DATA CLEANING AND TITLE OPTIMIZATION
Introduction
The data cleaning and preparation procedure carried out on the supplied Product Data
dataset will be documented in this report. The goal was to improve the quality of the data
by solving problems which include duplicate records, missing values and inconsistent
formatting. For better readability and SOE, a “SHORT_TITLE” function was created to
enhance the product titles.
Data Overview
Above is the first look of the Product Data dataset. The raw dataset consists of 3847 rows
and 6 columns which has missing values and inconsistency and some key columns which
are; PRODUCT_TITLE, Title, BULLET_POINTS, Description, PRODUCT_TYPE_ID,
PRODUCT_LENGTH, SHORT_TITLE.
Data Cleaning Optimization
Handling Missing Values
The dataset was inspected to identify inconsistency, missing values and inaccuracies the
following were identified and addressed
I.
BULLET_POINTS: Originally had 37% missing values. The missing values were
replaced with N/A to ensure consistency in the analysis
II.
Description: Contained null values, which were initially replaced with “None” for
proper missing values handling.
III.
PRODUCT_TYPE_ID and PRODUCT_LENGTH: Had 89 missing values each. The
missing values were cleared after applying the ‘Remove Duplicate’ function, and the
columns were converted to an integer format.
Removing Duplicate Entries
The dataset was checked for duplicate, 217 duplicates values were found and removed,
ensuring data integrity. The duplicate was removed using the PRODUCT_TITLE which
affected every other column.
Standardizing Columns Names
Column names were reviewed and kept consistent. Underscores were placed where
necessary and some columns were converted from lower case to uppercase. No major
changes were required.
Verifying Data Accuracy
Negative or Invalid Values: No negative or invalid were found in the numerical columns.
Data Types: correct formatting was ensured, converting PRODUCT_TYPE_ID and
PRODUCT_LENGTH to an integer.
Short Title Creation
Objective
The SHORT_TITLE featured was introduced to create brief SEO-friendly product titles that
maintain essential key word of the actual title. The following steps were taken to achieve
this:
•
Added a custom column to remove key details from the main title and removed
redundant words like "includes," "set of," and "for" by using this code;
Text.Replace(Text.Replace(Text.Replace([Title], " Set of ", ""), " for ", ""), " Includes
", "")
•
Added another custom column and ensured the title length remained between 30–
50 characters for optimization with by implementing an IF statement;
if Text.Length([SHORT_TITLE]) > 50 then Text.Start([SHORT_TITLE], 50) else
[SHORT_TITLE]
•
Lastly created the last column named “SHORT_TITLE” to add achieve this I extracted
the first 3 words and the last two words of the main title with this code;
Text.Combine
(
List.FirstN(Text.Split([Title], " "), 3)
&List.LastN(Text.Split([Title], " "), 2),
""
)
1.Image showing the final step in creating the "SHORT_TITLE"
Samples of Enhanced Titles
ORIGINAL TITLE
SHORT TITLE
ArtzFolio Tulip Flowers Blackout Curtain for
Door, Window & Room | Eyelets & Tie Back |
Canvas Fabric | Width 4.5feet (54inch) Height
5 feet (60 inch); Set of 2 PCS
ArtzFolio Tulip Flowers 2 PCS
Marks & Spencer Girls' Pyjama Sets
T86_2561C_Navy Mix_9-10Y
Marks & Spencer T86_2561C_Navy Mix_910Y
Mediterranean diet for beginners: 7Benefits
of mediterranean diet,7day plan and
70yummy easy recipes
Mediterranean diet for easy recipes
Star Trek 50th Anniversary Cereamic Storage
Jar
Star Trek 50th Storage Jar
PosterHub Pink Floyd The Wall Poster Matte
Finish Paper Print 12 x18 Inch (Multicolor) HS
- P076
PosterHub Pink Floyd - P076
Overview of Cleaned Dataset
The dataset was thoroughly cleaned after identifying issues that need fixing. Some of the issues
identified are;
•
•
•
•
•
•
Missing values in BULLET_POINTS: the action taken to replacing these missing values with
N/A.
Null Values in DESCRIPTION: the null values were replaced as None.
Missing values in PRODUCT_TYPE_ID and PRODUCT_LENGTH: the action taken was to
remove duplicates from the entire table.
Duplicate were found and removed; 271 duplicates were removed and we have 3541 rows.
The data types were standardized to ensure consistency
The lengthy titles were optimized to SHORT_TITLE
After completing the data cleaning and title optimization process, the dataset was loaded
into the excel sheet. Which shows the total rows and the total number of columns.
Total Columns: 3541
Total Rows: 7
Key Columns: PRODUCT_TITLE, TITLE, BULLET_POINTS, DESCRIPTION, PRODUCT_TYPE_ID,
PRODUCT_LENGTH, SHORT_TITLE
Insights Generated
Count of Product
4000
Count of PRODUCT_TITLE
3541
-
3541
-
-
Total
PRODUCT_TITLE PRODUCT_TYPE_
114676
13329
669086
13330
-
13294
-
13118
-
13169
-
13121
-
13294
-
13329
-
13117
-
13319
Grand Total
132420
Top 10 Product Type-
Conclusion
This report gives and overview of the data cleaning and title optimization process for the
Product Data dataset. The cleaning focused on handling duplicates, missing values and
inconsistencies.
A new column “SHORT_TITLE” was created to enhance SEO and readability by drafting key
words from the main Title. The column names were standardized and duplicates records
removed. After cleaning and optimization, the dataset is now reliable for further marketing
analysis and insights.