Kamlesh Gupta
Bangalore - -Profile
Summary
Experience spanning over 8.5 years with Data Integration, Data Engineering and Build of
Large-Scale Data Warehouse and Data Lake implementations using various ETL, ELT,
BIGDATA and Cloud Technology
Core
Competencie
s
Professional
Experience
PySpark, AWS Glue, Redshift,
Core Java, Kafka, WhereScape ,
AWS CF, Scala, Java
Redshift, Hive, SQL, Athena
Data Vault2.0
Databricks
Current organization- Sept 2018 -till date
Project - Data Vault Datawarehouse Implementation for FMCG client
Technologies: AWS Glue, Redshift, WhereScape 3D and RED, S3, Athena, Spark,
AWS Code Pipeline, AWS CloudFormation, AWS Lambda, DynamoDB, AWS Step
Functions, Airflow,Wherescape
Business Functions:
Customer required to migrate the on- premises data of different business units and
on-prem Hadoop based data process and reports. The target was to achieve a
robust technical and functional model which can append new data inclusion in the
data product without altering the earlier existing models.
Developed Data Vault 2.0 Data Warehouse on Redshift.
Performed Data Modeling and Developed Data Processing pipelines using AWS
Bigdata Solutions [ EMR, GLUE, STEP FUNCTION, lambda]
Developed Data Lake and various Data Ingestion pipelines using Glue Catalog,
Athena.
Created AWS Glue Spark Jobs for data transformations of formats like CSV, JSON,
and XML.
Optimize the of the AWS Glue process and Redshift Queries.
CICD for AWS Glue using AWS CF, AWS CODE Build, AWS Code Deploy.
Used Dynamo Db for config storage and retrievals.
Migrate the existing Scheduling process to Airflow.
Building the WhereScape 3D Data Vault2.0 models’ mappings to incorporate new
changes, creating mappings and other changes in Information Hub Console to
incorporate new functionalities, Involved in Redshift warehouse implementations
for this strategy.
Building the WhereScape Red based pipelines for Auto capturing of the delta load
and change capture and Automation of 3D based Models deployments.
WhereScape RED process for deployment and automation of Data Vault entities like
Load, Stage, HUB’s, Link’s, Satellites, SOCV and SOV.
Project - Data Analysis platform development and bot development for Telecom
Client.
Technologies: Scala, Shell Scripts, Kafka, Elastic Search, AWS EMR, Athena, Spark,
AWS Code Pipeline, AWS CloudFormation, AWS Lambda, DynamoDB, AWS Step
Functions
The basic function of this application is to migrate the process from Teradata to
AWS. The Motive is to use S3 for storing and processing the files. Explicitly using
Spark, Scala Spark, Athena, Elastic Search.
Developing Data Processing pipelines using AWS Bigdata Solutions EMR, GLUE,
STEP FUNCTION, lambda.
Develop the AWS EMR jobs to create the processing pipelines.
Responsibility to write the custom UDF’s in Spark for handling XML and JSON data.
Involved in gathering the requirements, designing, development and testing. With
different data sources and data type find the strategies and frameworks for the
implementations.
Calculating the percentage difference of month and yearly sales, seasonal sales.
Reading and writing data to various file formats.
Elastic Search integration with the streaming Kafka consumer to load the User
account data for Optimized search using Elastic search. Implemented the Rest Auth
module along with the end-to-end pipeline using Java and Spark. Implementation of
AES 56 encryption and X509 Certificates custom module using the Elastic search
JAVA API’S.
CICD for AWS Glue using AWS CF, AWS CODE Build, AWS Code Deploy.
Project - Data Onboarding for US biopharmaceutical company on Azure
Technologies: Pyspark, Azure Synapse Analytics, Cosmos DB , Databricks, Netapp,
Power BI,Spark ,SQL
Solutioning the Azure platform development of Data processing and analysis of
various file formats using Databricks.
Setting the Azure SFTP and automating the files onboarding to the server.
Develop the ETL pipelines for various file formats onboarding and analysis fo files of
various types and formats like SAS7BDAT.
Developing data bricks jobs for files processing related to various Studies and from
various vendors.
Utilizing Cosmos db for the data quality management and setting configurations for
process
Developing the Azure Synapse serverless DW.
Managing the security of blinded and unblinded files of all the studies.
Developing Clean patient trackers common data models related to various studies
and programs.
DXC Technologies (Formerly HPE ), February 2017 - Sept 2018
Project : Analytics development Platform development , Feb 2017- Sept 2018
Technologies : Sqoop, Hive, Spark, Scala, Oozie, Shell Scripts
The basic function of this application is to
Migration to Azure Data Lake using Azure Data Factory V2.
Log Analyzer for different application logs generate by the different applications. It
includes the log file transformation and analyzing in the spark (Scala, PySpark) and
generating the proper hive or the desired formatted output later data for the
process mining.
Migrate the data from SQL Server, Alteryx tool generated files to Hadoop and
making it analytically presentable for BI. The Motive is to use Hadoop for storing
and processing the files for growth Analytics development.
Key Roles and Responsibilities:
Import survey data files from different providers and store in Hadoop using Slurper
and PyNotify.
Worked on Spark Data Frame API & SparkSQL as part of transformations
Transformations and processing are done primarily in file formats Parquet and ORC.
Configuring Azure Cloud for the Azure data factory pipelines.
Provide the Hive structure to the files and merging with the base and historical files
for processing.
As per the business analytics requirements perform the analysis on the processed
data and generate the KPIs’.
Schedule the Oozie jobs on the defined timeframes, monitor, and resolve the query
errors.
Infosys Technologies ,Bangalore ,February 2014 – January 2017
Project :
Technologies
:
Teradata and Streaming data to Hadoop Offloading
Sqoop, Hive, Spark, Scala, Kafka, Flume, Oozie, Shell Scripts
Converting the existing Teradata ODI codes into corresponding Big data solutions using
Cloudera platform for the development and delivery. So the idea is to reverse
engineering the code to understand and replace it with corresponding Big Data
solutions using Spark Sql, Java, hive and Sqoop. Thereby utilizing the speedy data
processing to enhance the ticket booking and reduce the booking failures in near realtime.
Key Roles and Responsibilities:
Worked on Spark Data Frame API & SparkSQL as part of transformations
Worked on Sqoop to get the History Data from Teradata to Hadoop
Involved in writing hive queries and improved the Hive queries performance by
implementing partitioning and bucketing based on different term levels
Analyzed structured data and created table using HIVE
Persisting the streaming data of Kafka analyzed by spark in Cassandra.
Monitored the SPARK History Server for the performance optimization of the Spark
DataFrames.
Project : Data migration and Analysis for Telecom Client
The basic function of this application is to migrate the process from Teradata to
Hadoop. The Motive is to use Hadoop for storing and processing the files. Explicitly
using Hive.
Key Roles and Responsibilities:
o With different data sources and data type find the strategies and frameworks
for the implementations.
o Involved in gathering the requirements, designing, development and testing.
o Designing Hive tables like partitioned and non-partitioned tables.
o Calculating the percentage difference of month and yearly sales, seasonal and
functional sales.
o Reading and writing data to various file formats.
o Proficient in Relational Database Management Systems (RDBMS)
o Extensive knowledge in SQL (DDL, DML, DCL) and in Design and Normalization
of the database tables.
o Extensive experience in writing stored procedures, Triggers, Functions Indexes
and views.
o Extensive Knowledge of advance query concepts (e.g. group by, having clause,
union so on)