Example of Copy-writing
What is Data Science Vs Data Engineering?
This is like asking the question what is science vs engineering?
CD, DVD, and streaming technologies hadn’t even arrived on the scene yet when the age of mass data collection began. Since the late1960s, at a time when the newest and most popular technologies were the color television, cassette and 8-track tape players, and the microwave oven – data collection, storage, analytics, and how information is used, have gone through remarkable changes that few people could have envisioned back in those days.
A plethora of new occupations have been born and have evolved with the speedy advance of computer technology; programmers, system administrators, hardware and software experts, data entry personnel, networking specialists, data analysts, machine learning engineers, and last but not least data scientists and data engineers.
Meanwhile in the past decade or two, companies such as Google, Facebook, Snap-Chat, Viber, WhatsApp, Tik Tok, just to name a few, have jumped onto the Big Data bandwagon scooping up vast quantities of information collected by their applications and software, in order to sell things to almost everyone who uses electronic devices. Ways of handling the seemingly overwhelming quantity of data had to be devised, and are ever evolving as technology exponentially grows. To grow, these companies had to hire many types of data scientists;
In this article we’ll look at some of the overlap between data science and data engineering, but primarily we’ll examine what is a data scientist vs data engineer?
While data science and data engineering have some overarching similarities and functions, at the same time they are two distinctly different disciplines. So what is a Data Scientist really? Data engineers service data scientists by providing the structure needed for their work. Data scientists depend on the systems built by data engineers to work their magic.
The flow of data is sort of like a river. Picture the Nile River as a giant flow of data. Some of it is used for agricultural irrigation, some of it for tourism, and some of it for use by Egyptian citizens for domestic use; drinking water, bathing, sewage, etc. In addition, it’s also used to generate electricity in the form of hydroelectric dams. Each of these uses has its own unique purpose, and different data needs to be produced to manage it. Data scientists decide what data is available, or how to create it, and how it can be used to manage the various resources and purposes for its use. Data engineers then build the infrastructure based on the models that data scientists have created. All the irrigation canals, water treatment plants, hydroelectric dams, drinking water distribution services, and sewage removal networks are distinctly different systems, generating various types of data that can be used to manage those systems.
What is data science?
Wikipedia defines it, in part, this way:
“Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structured and unstructured data.”
What that means is that Data Science includes many different sub-specialties. Each sub-specialty has some overlap with the others. Some of the raw data they use is organized and some of it is not.
It can become confusing; What is a data scientist vs machine learning engineer? What is a data scientist vs data analyst? What about the difference between data analyst and data scientist – and the difference between data science and data analytics?
To put it in simpler terms: Data scientists build digital models of problems and their solutions using whatever data is available. How to become data scientist? One must learn to use the basic tools of the trade.
They all use the following principles and tools to accomplish their work:
The scientific method is fundamental to Data Science
All branches of science; Chemistry, Biology, Medical research, Agriculture, and so forth use this method to organize their work, and data science is no exception. It helps control and manage the processes and systems in a logical way in order to produce a desired result.
1. Make an observation.
2. Ask a question.
3. Form a hypothesis, or an explanation that can be tested.
4. Make a prediction according to the hypothesis.
5. Test the prediction.
6. Iterate: use the results to make another hypothesis.
7. Start the cycle over again.
Processes are important too
Examples of some of the processes:
While the processes are almost always unique to a specific goal attainment, they also usually have similarities. Each process can be different to the one before and after, and according to the needed outcome.
Algorithms used in the field of Data Science
Then there are some standard algorithms that are commonly used, as well as bespoke ones tailored to the task.
Common Machine Learning Algorithms
This is a list of machine learning algorithms commonly used by data scientists. These algorithms can be used to solve virtually any data problem:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. SVM
5. Naive Bayes
6. kNN
7. K-Means
8. Random Forest
9. Dimensionality Reduction Algorithms
10. Gradient Boosting algorithms
1. GBM
2. XGBoost
3. LightGBM
4. CatBoost
Systems must be created to deal with various issues
Then there are the systems. As in the example given at the beginning of this article, about the Nile River, each situation can have various one-of-a-kind networks of interconnected systems to solve problems and to manage a wide variety of issues.
From scientific exploration and discovery done by companies such as Space-X to the businesses that bring products to your doorstep, like Amazon and the US Postal Service, data science is now used to manage those services to and make them more efficient both logistically and economically.
Here are some (but not all) of the systems used by Amazon for example:
Supply chain optimization (I) - Sites are chosen according to what location will make the rest of the process better, time-wise and economically.
Supply chain optimization (II) - Routes are laid out to make the process of delivery better, faster, and less expensive.
Supply chain optimization (III) -Gasoline fill-ups for the tractor-trailers that move the goods from the factories to the warehouses, and mechanical upkeep are optimized.
Supply chain optimization (IV) - Traffic jams and other hindrances are avoided for both tractor-trailers and delivery vehicles.
Pricing and profit optimization - Products are grouped into categories for pricing and accounting purposes.
Fraud detection for credit card transactions are monitored to prevent fraud, and the systems used for purchases are made safe from hackers.
Smart search engine technology helps customers find their desired products quickly and easily.
Multivariate testing - This is basically search engine optimization, to discover which search engines work best.
Recommendation engine - To track user activity and propose advertisements for products that they will be more likely to buy.
Customer segmentation - Used for sales and marketing purposes.
Advertising optimization - Nowadays Native Advertising uses this method to optimize ad campaigns.
Inventory forecasting - For deciding the quantity of products to keep on hand in warehouses based on historical sales and distribution data.
Sales forecasting - For predicting targets, production, and a wide range of other issues.
HR analytics - These systems do everything from predicting who will need to be hired at any given time to predicting employee theft.
Payments analytics - For making payments to authors, vendors, publishers, and others, while maximizing profits.
Competitive analysis - For tracking competitor activity, and detecting current and future trends in the field.
Ad Relevancy Algorithm - For deciding which ads do better, and which web pages to publish them on.
Watch these YouTube videos for more:
https://www.youtube.com/watch?v=X3paOmcrTjQ&ab_channel=Simplilearn
https://youtu.be/xC-c7E5PK0Y
Now that we have discussed what data scientists do, let’s look into data engineering and see what some of the considerations are, and some of the ways that data engineers deliver their parts of the solutions. What is Data Engineering?
Data Engineering
Let us use the Wikipedia definition again for the sake of comparison:
“Information engineering (IE), also known as Information technology engineering (ITE), information engineering methodology (IEM) or data engineering, is a software engineering approach to designing and developing Information Systems.”
Wow! It has a lot of different names. How to become data engineer? What really is a data engineer’s job description? What can an entry level data engineer expect to do?
Explained in simpler terms: Data engineers design and build the various systems that will actually collect, store, process, and produce some useful output.
Let’s start with some of the standard skills and tools they use to do their jobs.
Main Data Engineering Skills.
Data Engineering.
Basic Language: Python.
Extensive Knowledge of Operating Systems.
Complete Database Knowledge – SQL and NoSQL.
Data Warehousing – Hadoop, MapReduce, HIVE, PIG, Apache Spark, Kafka.
Basic Machine Learning Familiarity.
Two of the most important elements of a solution that a data engineer creates are storage, and which OS to use.
Whether it’s for an Amazon warehouse or a local accounting firm, it is critical to not only store information, but also to make sure it is automatically backed up using more than one state-of-the art technique, and to design systems so that various users can retrieve data in a useful and interactive format. What are some of the data engineering definitions? We’ll learn some of them next.
Data storage is also referred to as Data Warehousing.
Another vital part of the scheme is often what is referred to as an ETL process.
ETL stands for Extract, Transform, and Load. Extracting data (for example from a production process such as sheet glass-making where a sensor detects flaws in the glass so that it can be dealt with later down the line) is often performed by sensors. The data needs to be sent from the sensor to the storage device (for documentation) and then be processed and sent to whatever method is used to remove the imperfection.
Deciding on the actual sensor and what data language to use is a main consideration for the Data Engineer. Some languages work better for some things and other languages for others.
Two of the common languages used for ETL are SQL and JVM.
In addition to ETL the software program also needs to be able to flag errors in the detection process itself (in the case of glass making for example) for automatic or human intervention.
Some vital considerations for an ETL process are:
Configuration: To accurately program the data pipeline.
UI, Monitoring, Alerts: To catch errors in the ETL process
Backfilling: To redo Whatever the ETL process was recording at the time of the error.
Sensors, Data Scientists, Data Engineers, and the IoT
Sensors have been used for production processes for a long time, but with the age of computers many of them have been updated to produce not a surge of electricity, expansion of metals to complete a circuit, or heating and expansion of some fluid (like Mercury), but rather packets of data.
The newest application is the IoT (Internet of Things). If you have happened to be living in a cave for ten or twenty years, isolated from the world, let me briefly explain what it is. We have an internet now which is not for people, but for the sensors of things. Smart Houses, Cars, Appliances, whatever runs on electricity and has moving parts can be found in the realm of IoT these days. Many of these applications require Data Scientists and Data Engineers to create devices and systems to function on it. The Data Scientists will, of course build the models, and the Data Engineers will build the systems to make them work.
Here is a list of some of the most common sensors that have been digitized, and are used by Data Engineers in today’s projects:
Temperature sensors
Proximity sensor
Water quality sensor
Chemical sensor
Gas sensor
Smoke sensor
Level sensors
Image sensors
Take a look at this video for more dynamic examples:
https://youtu.be/a_rhr4jtZtY
Using these and other sensors here are but a few of the things Data Engineers create:
Data Pipelines
Data Modeling for a Streaming Platforms
Data Modeling
Data Lakes
Data Warehouses
Examples Data Engineering Projects by Market Leaders
Now let’s examine some examples of Data Engineering to see what exciting projects are underway at market leaders.
General Motors
Everyone has heard of the robotic assembly line used to produce cars at GM (An example of data engineering in itself) but have you heard about the plans to electrify driving in China by 2025?
In 2010 GM’s vision of the future for China was revealed at the World Expo in Shanghai. The company plans to invest $20M in infrastructure to electrify driving there, not only in Shanghai, but nationwide. The plans include not only the building of roads and the manufacture of cars, but will extend to recharging stations, and databases needed to automate the whole system, including logistics for moving consumer goods around the nation. It’s part of its ‘Drive to 2030’ Vision. Although full autonomous driving capabilities aren’t part of the plan, a form of driver assist is. This is part of China’s commitment to reduce its carbon output.
Space-X
Virtually every system on the Space-X Falcon Heavy rocket required a team of Data Scientists and a team of Data Engineers. From the propulsion system to the toilet, everything produces data that is critical and needs to be managed. Some of them are important life/safety issues, and some are health and recycling issues.
Of course the immediate end goal is to put people on Mars. This will expand the roles of Data Scientists and Data Engineers exponentially. Now, the aims are to just test and perfect the systems for short durations, but when humans are able to make the journey to the red planet there is going to be a lot more data that needs to be managed. Everything from food, to psychology issues, to entertainment is going to produce data that will be managed by systems designed by Data Engineers.
https://www.express.co.uk/news/science/-/spacex-crew-dragon-iss-how-do-astronauts-go-to-the-toilet
The Future of Data Science and Engineering
It all really started with the tabulating machine invented by Herman Hollerith for the 1890 US Census. Little did Mr. Hollerith know the world of data dependence that he would help create, and how it would manifest itself 130 years later. The transition from vacuum tubes to transistors in the 1960s marked the puberty of the Data Era. The invention of Arpanet in the 1970s for military use and the expansion of that system in the 1990s marked the coming of age, or the adulthood of the information age. The spawning of the need for vast amounts of data became the narcotic of our society.
Whatever your position on the modern state of data, its collection and use, for good or for bad (in my mind the good outweighs the bad) there are several undeniable truths. It isn’t going away; Big Data has become ubiquitous and the need to manage it fundamental and critical.
It has created dozens of new jobs titles since the 1970s, and will continue to do so in the future. With the IoT in full swing, and Quantum Computing on the horizon, surely Data Science and Data Engineering are bound to rush forward hand in hand. A dozen more sub-specialties have already been created in the past decade. They have blended and mixed into an almost homogenized field. Who knows what the sub-categories of the future will be?
In researching this piece, I found many articles in which Data Engineers referred to themselves as Data Scientists and vice versa. However, this was due to the perception that they had to be masters of both, and not the fusion of the two jobs. Although experts might become both, the two fields; Data Science and Data Engineers will always rightly be divided into separate disciplines.
I hope this article has explained the differences, as well as been entertaining, informative, and hopeful.