Snehitha Pothina
Data Scientist
Dallas, TX – 75081 |
-| -
PROFESSIONAL SUMMARY
Data Scientist with 5 years of experience designing and deploying end-to-end machine learning and AI solutions across
finance, healthcare, and technology sectors. Skilled in Python, Spark, Hadoop, AWS, and Snowflake, with proven
expertise in developing predictive models, NLP systems, and time series forecasts for structured and unstructured data.
Adept at building scalable data pipelines, implementing advanced analytics, and creating impactful dashboards using
Tableau and Python. Strong collaborator with a track record of translating business requirements into actionable insights
and driving data-driven strategies for stakeholders.
TECHNICAL SKILLS
•
Databases: MySQL, PostgreSQL, Oracle, H Base, Amazon Redshift, MS SQL Server 2016/2014/2012/2008 R2/2008,
Teradata, MongoDB, Snowflake
•
Statistical Methods: Hypothetical Testing, ANOVA, Time Series forecasting, Confidence Intervals, Bayes Law,
Principal Component Analysis (PCA), Dimensionality Reduction, Cross-Validation, Auto-correlation, A/B Testing,
Experimental Design, Hypothesis Testing
•
Machine Learning: Regression analysis, Bayesian Method, Decision Tree, Random Forests, Support Vector Machine,
Neural Network, Sentiment Analysis, K-Means Clustering, KNN and Ensemble Method
•
Deep Learning & NLP: Transformers (BERT, GPT, CLIP), Hugging Face, PyTorch, TensorFlow, Embeddings,
Attention Mechanisms, Tokenization, Multimodal Models, Transfer Learning, Sequence models (RNN, LSTM, GRU)
•
AI/GenAI: Generative AI, Large Language Models (LLMs), Prompt Engineering, Lang Chain , Semantic Kernel
•
Hadoop Ecosystem: Hadoop 2.x, Spark 2.x, Map Reduce, Hive, HDFS, Sqoop, Flume
•
Reporting Tools: Tableau Suite of Tools 10.x, 9.x, 8.x which includes Desktop, Server and Online, Server Reporting
Services (SSRS) Data Visualization: Tableau, Matplotlib, Sea born, ggplot2
•
Languages: Python (2.x/3.x), R, SAS, SQL, T-SQL
•
Operating Systems: PowerShell, UNIX/UNIX Shell Scripting (via Putty client), Linux and Windows
WORK HISTORY
Goldman Sachs - Dallas, TX
Data Scientist
•
Nov 2024 – Present
Developed and implemented predictive and classification models to analyze customer behavior, optimize decisionmaking, and enhance data-driven insights.
•
Collaborated with data engineers and operations team to implement ETL processes, writing and optimizing SQL
queries and using Hive to retrieve data from Hadoop clusters and Redshift to fit analytical requirements.
•
Performed univariate and multivariate analysis to identify underlying patterns and associations in the data, and used
F-Score, AUC/ROC, Confusion Matrix, MAE, and RMSE to evaluate different model performance.
•
Participated in feature engineering such as feature intersection generation, normalization, and label encoding with
Scikit-learn preprocessing, including data cleaning and feature scaling using pandas and NumPy in Python.
•
Conducted analysis on customer consuming behaviors and discovered value of customers with RMF analysis,
applying customer segmentation with clustering algorithms such as K-Means and Hierarchical Clustering.
•
Built regression models including Lasso, Ridge, SVR, and XGBoost to predict Customer Lifetime Value, and used
XGB classifier for categorical variables and XGB regressor for continuous variables, combining results using
Feature Union and Function Transformer methods in NLP.
•
Used Principal Component Analysis in feature engineering to analyze high-dimensional data.
•
Created deep learning models using TensorFlow and Keras by combining all tests as a single normalized score and
predicted residency attainment of students.
•
Formulated several graphs to show the performance of students by demographics and their mean score in different
USMLE exams.
•
Designed and implemented recommender systems utilizing collaborative filtering techniques to recommend courses
for different customers and deployed to AWS EMR cluster.
•
Utilized natural language processing (NLP) techniques to optimize customer satisfaction.
•
Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
•
Used generative AI techniques including BERT, GPT, and prompt engineering to automate insights extraction and
enhance customer satisfaction metrics.
•
Integrated reinforcement learning techniques and agent-based models to optimize recommendation systems and
personalize content delivery, contributing to real-time adaptive customer engagement strategies.
•
Developed and maintained MLOps pipelines using Airflow and CI/CD, automating model retraining, monitoring,
and ensuring seamless deployment of machine learning solutions into production.
Environment: AWS RedShift, EC2, EMR, Hadoop Framework, S3, HDFS, Spark (Pyspark, MLlib, Spark SQL), Python
3.x (Scikit- Learn/SciPy/Numpy/Pandas/NLTK/Matplotlib/Seaborn), Tableau Desktop (9.x/10.x), Tableau Server
(9.x/10.x), Machine Learning (Regressions, KNN, SVM, Decision Tree, Random Forest, XGboost, LightGBM,
Collaborative filtering, Ensemble), NLP, Teradata, Git 2.x, Agile/SCRUM
McKesson - Irving, TX
Data Scientist
May 2024 – Oct 2024
•
Performed exploratory data analysis (EDA) to uncover patterns, correlations, and trends in biological and medical
data.
•
Developed MapReduce and Spark Python modules for predictive analytics and machine learning in Hadoop on AWS
and wrote complex Spark SQL queries for business-driven data analysis.
•
Worked on data cleaning and ensured data quality, consistency, and integrity using Pandas and Numpy, and
participated in feature engineering such as feature intersection generation, normalization, and label encoding with
Scikit-learn preprocessing.
•
Developed and deployed computer vision models for object detection and image classification tasks, leveraging deep
learning frameworks such as TensorFlow and PyTorch to analyze medical images and automate quality control
processes.
•
Designed and implemented end-to-end pipelines for processing and annotating large-scale image datasets, optimizing
data ingestion, augmentation, and preprocessing workflows for computer vision training.
•
Applied transfer learning and convolutional neural networks (CNNs) to accelerate model development for anomaly
detection and pattern recognition in medical imaging data.
•
Designed and optimized high-throughput vision pipelines, enabling the analysis of millions of images for real-time
decision making.
•
Used big data tools like Spark (Pyspark, Spark SQL, MLlib) on AWS to conduct real-time analysis of loan default.
•
Conducted data blending and preparation using Alteryx and SQL for Tableau consumption, and published data sources
to Tableau Server.
•
Created multiple custom SQL queries in Teradata SQL Workbench to prepare optimized data sets for Tableau
dashboards, retrieving data from multiple tables using various join conditions for efficient and actionable visualization.
•
Deployed and managed machine learning and NLP models using Azure Machine Learning Studio and Azure
Synapse Analytics, ensuring scalable and secure integration of AI solutions into cloud-based business workflows.
Environment: CNN, Computer vision, MS SQL Server 2014, Teradata, ETL, SSIS, Alteryx, Tableau (Desktop 9.x/Server
9.x), Python 3.x(Scikit- Learn/SciPy/Numpy/Pandas), Machine Learning (Naïve Bayes, KNN, Regressions, Random
Forest, SVM, XGboost, Ensemble), AWS Redshift, Spark (Pyspark, MLlib, Spark SQL), Hadoop 2.x, MapReduce,
HDFS, SharePoint
Tanla Platforms – Hyderabad, Telangana, India
Data Scientist
Aug 2021 – July 2023
•
Gathered, analyzed, documented, and translated application requirements into data models, supporting the
standardization of documentation and adoption of best practices related to data and applications.
•
Participated in data acquisition with the Data Engineering team to extract historical and real-time data using Sqoop,
Pig, Flume, Hive, MapReduce, and HDFS, and wrote user-defined functions (UDFs) in Hive to manipulate
strings, dates, and other data.
•
Performed data cleaning, feature scaling, and feature engineering using pandas and NumPy in Python, and applied
clustering algorithms such as Hierarchical and K-means with Scikit-learn and Scipy.
•
Performed complex pattern recognition of automotive time series data and forecasted demand through ARMA and
ARIMA models and exponential smoothing for multivariate time series data.
•
Delivered and communicated research results, recommendations, and opportunities to managerial and executive
teams, and implemented techniques for priority projects.
•
Designed, developed, and maintained daily and monthly summary, trending, and benchmark reports repository in
Tableau Desktop, generating complex calculated fields, parameters, toggled and global filters, dynamic sets, groups,
actions, custom color palettes, and statistical analyses to meet business requirements.
•
Implemented a variety of Tableau visualizations and views such as combo charts, stacked bar charts, pareto
charts, donut charts, geographic maps, sparklines, and crosstabs.
•
Published workbooks and extracted data sources to Tableau Server, implemented row-level security, and scheduled
automatic extract refreshes to ensure up-to-date and secure reporting.
Environment: Machine learning (KNN, Clustering, Regressions, Random Forest, SVM, Ensemble), Linux, Python 2.x
(Scikit- Learn/SciPy/NumPy/Pandas), R, Tableau (Desktop 8.x/Server 8.x), Hadoop, Map Reduce, HDFS, Hive, Pig,
HBase, Sqoop, Flume, Oracle 11g, SQL Server 2012
Reckitt – Hyderabad, Telangana, India
Data Scientist
•
•
June 2020 – June 2021
Involved in developing and optimizing data integration processes, creating detailed financial reports, and
performing advanced statistical analyses to support business decision-making.
Used SSIS to create ETL packages to validate, extract, transform, and load data into Data Warehouse and Data
Mart.
•
Maintained and developed complex SQL queries, stored procedures, views, table-valued functions, Common
Table Expressions (CTEs), joins, and complex subqueries to provide comprehensive reporting solutions in
Microsoft SQL Server 2008 R2.
• Optimized the performance of queries with modification in T-SQL, removed unnecessary columns and redundant
data, normalized tables, established joins, and created indexes.
• Created SSIS packages utilizing Pivot Transformation, Fuzzy Lookup, Derived Columns, Condition Split,
Aggregate, Execute SQL Task, Data Flow Task, and Execute Package Task.
• Migrated data from SAS environment to SQL Server 2008 via SQL Integration Services (SSIS) and used
SAS/SQL to pull data out from databases and aggregate for detailed reporting based on user requirements.
• Designed and developed new reports and maintained existing reports using Microsoft SQL Reporting Services
(SSRS) and Excel to support the firm's strategy and management, including sub-reports, drill-down reports,
summary reports, parameterized reports, and ad-hoc reports.
• Used SAS for pre-processing data, SQL queries, data analysis, generating reports, graphics, and statistical analyses.
• Provided statistical research analyses and data modeling support for mortgage products, performing analyses such
as regression analysis, logistic regression, discriminant analysis, and cluster analysis using SAS programming.
Environment: SQL Server 2008 R2, DB2, Oracle, SQL Server Management Studio, SAS/ BASE, SAS/SQL,
SAS/Enterprise Guide, MS BI Suite (SSIS/SSRS), T-SQL, SharePoint 2010, Visual Studio 2010, Agile/SCRUM
EDUCATION
Texas Tech University - Lubbock, TX
Masters in Computer Science