Shahzad Husain Zubairi | Freelancer Portfolio Item #372048

SHAHZAD ZUBAIRI https://github.com/shzubairi  https://www.linkedin.com/in/shahzadzubairi DATA ANALYTICS PROJECTS PORTFOLIO Understanding Customers and Predicting Profitability in RapidMiner for an Electronics Retailer Investigation of customer buying patterns Data mining – investigated relationships between different variables, such as customers’ regions, ages, and amount spent per transaction Built decision trees to evaluate the relationships between variables Machine learning – classifying where a transaction took place Trained and tested a model Applied machine learning on a classification problem to understand the relationship between items and amount spent Predicting profitability Prepared and pre-processed the data Feature selection with correlation – built correlation matrix to understand the relationship between features Trained and tested models with 3 algorithms – KNN, SVM and Gradient Boosted Trees Optimized the models before making the final predictions Presentation to senior management Reported the findings to a non-technical audience in a simplified manner Predictive Analytics in RStudio – Brand Preference, Sales, and Market Basket Analysis Predicting brand preference using classification techniques Used Caret package tools – data splitting, pre-processing, feature selection, model tuning resampling, variable importance estimation Trained and tested models using different classifiers, i.e. C5.0 and Random Forest Made prediction using the most optimized model Predicting sales of different product types using regression techniques Developed multiple regression models using SVM, Gradient Boosted Trees, and Random Forest to predict sales for PCs, Laptops, Netbooks, and Smartphones Presented correlation matrix visualization to analyze the service and customer relationship with sales volume Market basket analysis to discover associations between products Applied Apriori algorithm to find association rules within the customers’ transactions dataset Developed a model to find the optimal level of support and confidence Plotted the results to visualize, observe certain patterns, and present to management Deep Analytics and Visualization in RStudio – Time Series Analysis of Energy Data and Indoor Wi-Fi Locationing Domain research and exploratory data analysis Used RMySQL package to obtain and query the data In data pre-processing applied data munging skills with the Lubridate package to create Date/ Time attribute and combine them as one As part of the initial data exploration, worked on three areas – data documentation, assessing statistical summary, and proposing three recommendations Visualize and analyze energy data by conducting time series analysis Created several plots to visualize the data from different perspectives Applied Forecast and tslm packages to forecast time series and plot the results Used the decompose() function in order to remove seasonal components in time series Used HoltWinters() function to make forecasts which are exponentially smooth Presented the report with business insights to management Evaluate techniques for Wi-Fi locationing Classification problem – used 3 algorithms within the Caret package (Random Forest, C5.0, and KNN) to analyze wi-fi fingerprinting to determine a person’s location indoors Used accuracy and kappa scores to interpret the results and make recommendations to the client Data Science and Big Data – Sentiment Analysis of Smart Phones Setting up computing environment with Python and Amazon Web Services (AWS) Data preparation Used Mapper, Reducer and Concatenate programs within Python Sourced wet files from Common Crawl needed for data analysis Ran an EMR job flow using EMR console Set up S3 buckets Used Cyberduck to upload mapper and reducer scripts to S3 bucket Created EMR clusters Ran the job flows using AWS CLI Selected the wet file addresses Ran CreateJsonFiles.py in Python to generate the json files Ran the json files from CLI to create EMR clusters Develop models to conduct sentiment of iPhone and Galaxy smart phones in RStudio Set up parallel processing with doParallel library Explored the data by plotting a histogram Conducted data pre-processing and feature selection Developed and evaluated models Applied the final model to the data Analyzed the results and reported the findings Data Science with Python – Default of Credit Card Customers Data science framework and environment Defined a process using BADIR data to decision framework Set up the environment in Python and Jupyter Notebook by installing the necessary tools Prepare and explore data In Jupyter Notebook – imported and prepared the data Cleaned the data Addressed any missing values Addressed any redundancies and reduced the data Discretized the data by binning some variables Performed detailed Exploratory Data Analysis (EDA) Libraries used – Pandas, NumPy, Matplotlib, Seaborn, SciPy.stats Conducted detailed analysis of credit card customers’ data Created several visualizations for storytelling of bivariate and multivariate analysis Created a correlation coefficient matrix using Pandas Uploaded the notebook with the findings on GitHub Build and evaluate models Built and evaluated classification models using three algorithms – Random Forest, Support Vector Machine, and Gaussian Naïve Bayes Libraries used – Sci-Kit Learn, Pandas, NumPy, Matplotlib, SciPy.stats Chose the right model based on the best accuracy results Made predictions using the final model, evaluated the results, and made the final predictions for the report to management