SHAHZAD ZUBAIRI
https://github.com/shzubairi https://www.linkedin.com/in/shahzadzubairi
DATA ANALYTICS PROJECTS PORTFOLIO
Understanding Customers and Predicting Profitability in RapidMiner for an Electronics Retailer
Investigation of customer buying patterns
Data mining – investigated relationships between different variables, such as customers’ regions, ages, and amount spent per transaction
Built decision trees to evaluate the relationships between variables
Machine learning – classifying where a transaction took place
Trained and tested a model
Applied machine learning on a classification problem to understand the relationship between items and amount spent
Predicting profitability
Prepared and pre-processed the data
Feature selection with correlation – built correlation matrix to understand the relationship between features
Trained and tested models with 3 algorithms – KNN, SVM and Gradient Boosted Trees
Optimized the models before making the final predictions
Presentation to senior management
Reported the findings to a non-technical audience in a simplified manner
Predictive Analytics in RStudio – Brand Preference, Sales, and Market Basket Analysis
Predicting brand preference using classification techniques
Used Caret package tools – data splitting, pre-processing, feature selection, model tuning resampling, variable importance estimation
Trained and tested models using different classifiers, i.e. C5.0 and Random Forest
Made prediction using the most optimized model
Predicting sales of different product types using regression techniques
Developed multiple regression models using SVM, Gradient Boosted Trees, and Random Forest to predict sales for PCs, Laptops, Netbooks, and Smartphones
Presented correlation matrix visualization to analyze the service and customer relationship with sales volume
Market basket analysis to discover associations between products
Applied Apriori algorithm to find association rules within the customers’ transactions dataset
Developed a model to find the optimal level of support and confidence
Plotted the results to visualize, observe certain patterns, and present to management
Deep Analytics and Visualization in RStudio – Time Series Analysis of Energy Data and Indoor Wi-Fi Locationing
Domain research and exploratory data analysis
Used RMySQL package to obtain and query the data
In data pre-processing applied data munging skills with the Lubridate package to create Date/ Time attribute and combine them as one
As part of the initial data exploration, worked on three areas – data documentation, assessing statistical summary, and proposing three recommendations
Visualize and analyze energy data by conducting time series analysis
Created several plots to visualize the data from different perspectives
Applied Forecast and tslm packages to forecast time series and plot the results
Used the decompose() function in order to remove seasonal components in time series
Used HoltWinters() function to make forecasts which are exponentially smooth
Presented the report with business insights to management
Evaluate techniques for Wi-Fi locationing
Classification problem – used 3 algorithms within the Caret package (Random Forest, C5.0, and KNN) to analyze wi-fi fingerprinting to determine a person’s location indoors
Used accuracy and kappa scores to interpret the results and make recommendations to the client
Data Science and Big Data – Sentiment Analysis of Smart Phones
Setting up computing environment with Python and Amazon Web Services (AWS)
Data preparation
Used Mapper, Reducer and Concatenate programs within Python
Sourced wet files from Common Crawl needed for data analysis
Ran an EMR job flow using EMR console
Set up S3 buckets
Used Cyberduck to upload mapper and reducer scripts to S3 bucket
Created EMR clusters
Ran the job flows using AWS CLI
Selected the wet file addresses
Ran CreateJsonFiles.py in Python to generate the json files
Ran the json files from CLI to create EMR clusters
Develop models to conduct sentiment of iPhone and Galaxy smart phones in RStudio
Set up parallel processing with doParallel library
Explored the data by plotting a histogram
Conducted data pre-processing and feature selection
Developed and evaluated models
Applied the final model to the data
Analyzed the results and reported the findings
Data Science with Python – Default of Credit Card Customers
Data science framework and environment
Defined a process using BADIR data to decision framework
Set up the environment in Python and Jupyter Notebook by installing the necessary tools
Prepare and explore data
In Jupyter Notebook – imported and prepared the data
Cleaned the data
Addressed any missing values
Addressed any redundancies and reduced the data
Discretized the data by binning some variables
Performed detailed Exploratory Data Analysis (EDA)
Libraries used – Pandas, NumPy, Matplotlib, Seaborn, SciPy.stats
Conducted detailed analysis of credit card customers’ data
Created several visualizations for storytelling of bivariate and multivariate analysis
Created a correlation coefficient matrix using Pandas
Uploaded the notebook with the findings on GitHub
Build and evaluate models
Built and evaluated classification models using three algorithms – Random Forest, Support Vector Machine, and Gaussian Naïve Bayes
Libraries used – Sci-Kit Learn, Pandas, NumPy, Matplotlib, SciPy.stats
Chose the right model based on the best accuracy results
Made predictions using the final model, evaluated the results, and made the final predictions for the report to management