Climate Data Analysis and Prediction using Machine Learning
Climate Data Analysis and Prediction
Using Machine Learning
Individual Project
[University Name]
[Department Name]
ABSTRACT
Climate change is one of the most important global concerns of our day, and
understanding the dynamics of greenhouse gases and their influence on global
temperatures is critical. The goal of this research is to use machine learning
algorithms to assess past climate data and anticipate future trends. The collection
contains a variety of climate-related information from several sources, including
greenhouse gas concentrations, temperature records, and radiative forcing.
i
Statement of Originality
This is to certify that, except where specific reference is made, the work described
within this project is the result of the investigation carried out by myself, and that
neither this project, nor any part of it, has been submitted in candidature for any
other award other than this being presently studied.
Any material taken from published texts or computerized sources have been fully
referenced, and I fully realize the consequences of plagiarizing any of these sources.
Student Name:
Student Signature:
Registered Course of Study: Computer Science - ...........................
Date of Signing: ...................................
ii
TABLE OF CONTENTS
Number………………………………………………………………...Page
[Department Name] .............................................................................. 1
ABSTRACT ................................................................................................. i
Statement of Originality .............................................................................. ii
TABLE OF CONTENTS ...........................................................................iii
LIST OF FIGURES ..................................................................................... v
LIST OF TABLES ...................................................................................... vi
CHAPTER 1: INTRODUCTION ................................................................ 8
1.1 Introduction ........................................................................................ 8
1.2 Problem Statement ............................................................................. 9
1.3 Background & Scope ......................................................................... 9
1.4 Aims & Objectives ........................................................................... 10
1.4.1 Aim ............................................................................................ 10
1.4.2 Objectives .................................................................................. 10
1.4.2 Industrial Context ...................................................................... 11
1.5 Deliverables ...................................................................................... 12
1.6 Tools Used........................................................................................ 12
CHAPTER 2: BACKGROUND AND LITERATURE REVIEW ............ 13
2.1 Persistence of climate changes due to a range of greenhouse gases ... 13
2.2 Biofuels, Greenhouse Gases and Climate Change .............................. 16
CHAPTER 3: METHODOLOGY ............................................................ 18
3.1 Machine Learning ............................................................................ 18
3.2 Development Methods and Tools .................................................... 20
3.2.1 Pandas ........................................................................................ 22
3.2.2 NumPy ....................................................................................... 22
3.2.3 Matplotlib................................................................................... 22
3.2.4 Scikit-Learn ............................................................................... 22
CHAPTER 4: PROJECT MANAGEMENT ............................................. 23
4.1 Milestone Analysis ........................................................................... 23
4.2 Analysis and Deliverables ................................................................ 24
4.3 Physical Resources ........................................................................... 27
CHAPTER 5: IMPLEMENTATION ........................................................ 27
5.1 Label Encoding ................................................................................ 27
5.1.1 Advantages of Label Encoding: ................................................. 27
5.2 Regression Models ........................................................................... 28
5.2.1 Decision Tree Regressor ............................................................ 28
5.2.2 Random Forest Regression ........................................................ 29
iii
5.2.3 Gradient Boosting Machines ..................................................... 30
5.3 Steps of Model Building .................................................................. 31
5.3.1 Dataset ....................................................................................... 31
5.3.2 Splitting the Dataset ................................................................... 31
5.3.3 Training and Testing .................................................................. 31
5.3.4 Supervised Machine Learning ................................................... 32
5.3.5 Regression Models ..................................................................... 32
5.5 Evaluation Methods.......................................................................... 32
5.5.1 Mean Absolute Error (MAE) ..................................................... 32
5.5.2 Mean Squared Error (MSE) ....................................................... 32
5.5.3 Root Mean Squared Error (RMSE) ........................................... 33
5.6 Data Collection ................................................................................. 33
5.7 Data Pre-processing and Feature Engineering ................................. 34
5.8 Data Visualization ............................................................................ 34
5.9 Training and Testing of Different Models on CO2 Dataset ............. 37
5.9.1 Using Support Vector Machine Algorithm ................................ 38
5.9.1.1 Support Vector Machine Algorithm’s Training Evaluation: .. 38
5.9.2 Using Random Forest Algorithm ............................................... 39
5.9.2.1 Random Forest Training Evaluation: ...................................... 39
5.9.3 Using Time Series Model (Simple Moving Average) ............... 40
5.9.3.1 Time Series Model (Simple Moving Average) Training
Evaluation: .......................................................................................... 41
5.10 Training and Testing of Different Models on Methane Dataset .... 41
5.10.1 Using Support Vector Machine Algorithm .............................- Support Vector Machine Algorithm’s Training Evaluation: 42
5.10.2 Using Random Forest Algorithm ............................................- Random Forest Training Evaluation: .................................... 43
5.10.3 Using Time Series Model (Simple Moving Average) ............- Time Series Model (Simple Moving Average) Training
Evaluation: .......................................................................................... 44
5.11 Training and Testing of Different Models on Earth Global
Temperature Dataset .............................................................................. 45
5.11.1 Using Support Vector Machine Algorithm .............................- Support Vector Machine Algorithm’s Training Evaluation: 46
5.11.2 Using Random Forest Algorithm ............................................- Random Forest Training Evaluation: .................................... 47
5.12 Training and Testing of Different Models on Concentration of Ozone
Dataset .................................................................................................... 48
5.12.1 Using Support Vector Machine Algorithm .............................- Support Vector Machine Algorithm’s Training Evaluation:- Support Vector Machine Algorithm’s Training Evaluation: 49
iv
5.12.1.3 Support Vector Machine Algorithm’s Training Evaluation: 50
5.12.2 Using Random Forest Algorithm ............................................- Random Forest Training Evaluation: ...................................- Random Forest Training Evaluation: ...................................- Random Forest Training Evaluation: .................................... 52
5.13 Prediction using Trained Models ................................................... 53
5.13.1 Forecasting the Concentration of Carbon Dioxide (CO2) ....... 53
5.13.2 Forecasting the Concentration of Methane (CH4) ................... 56
5.13.3 Forecasting the Earth Global Temperature .............................. 58
5.13.4 Forecasting the Concentration of Ozone.................................. 61
5.14 Limitations of Regression Models ................................................. 65
5.14.1 Limitation of Decision Tree Algorithm ................................... 65
5.14.2 Limitations of Random Forest ................................................. 65
5.14.3 Limitations of Gradient Boosting Machine (GBM)................. 65
5.15 Source Code of the Project ............................................................. 65
CONCLUSION & FUTURE WORK ........................................................ 70
REFERENCES .......................................................................................... 72
LIST OF FIGURES
Number………………………………………………………………………. Page
Figure 2.1. 1 Surface Warming Temperature ........................................................ 15
Figure 3.1. 1 Types of Machine Learning .............................................................. 19
Figure 3.2. 1 Software Development Cycle ........................................................... 21
Figure 3.2. 2 Iterative Model ................................................................................. 21
Figure 4.2. 1 Forecasting System flow diagram. ................................................... 25
Figure 5.2.1. 1 Decision Tree Working Diagram .................................................. 28
Figure 5.2.2. 1 Working of Random Forest ........................................................... 30
Figure 5.2.3. 1 Working of GBM........................................................................... 31
Figure 5.8. 1 Concentration of Greenhouse Gas Radiative Forcing ...................... 35
v
Figure 5.8. 2 Ozone Layer Data Over the Years .................................................... 36
Figure 5.8. 3 Concentration of Halogen Compounds Over Time .......................... 37
Figure 5.13.1. 1 Prediction of CO2 Concentration for 5 years .............................. 55
Figure 5.13.2. 1 Prediction of CH4 Concentration for 5 years .............................. 58
Figure 5.13.3. 1 Land Average Temperature Prediction Graph ............................. 61
Figure 5.13.4. 1 Total column: SBUV Concentration Graph ................................ 64
LIST OF TABLES
Number ............................................................................................................. Page
Table 4.1. 1 Table Milestone Analysis .................................................................. 24
Table 5.6. 1 Sample of Dataset .............................................................................. 34
Table 5.9.1.1. 1 SVM Training Evaluation for CO2.............................................. 38
Table 5.9.2.1. 1 Random Forest Training Evaluation for CO2 .............................. 39
Table 5.9.3.1. 1 Simple Moving Average Training for CO2 ................................. 41
Table 5.10.1.1. 1 SVM Evaluation for Methane .................................................... 42
Table 5.10.2.1. 1 Random Forest Training Evaluation for Methane ..................... 43
Table 5.10.3.1. 1 Evaluation of Simple Moving Average for Methane ................. 45
Table 5.11.1.1. 1 SVM Training for Global Temperature ..................................... 46
Table 5.11.2.1. 1 Random Forest Evaluation for Global Temperature .................. 47
Table 5.12.1.1. 1 SVM Evaluation for Ozone ....................................................... 49
Table 5.12.1.2. 1 SVM Evaluation for Ozone on Troposphere ............................. 49
vi
Table 5.12.1.3. 1 SVM Evaluation for Ozone on Stratosphere ............................. 50
Table 5.12.2.1. 1 Random Forest Evaluation for Ozone on SBUV ....................... 51
Table 5.12.2.2. 1 Random Forest Evaluation for Ozone on Troposphere ............. 52
Table 5.12.2.3. 1 Random Forest Evaluation for Ozone on Stratosphere.............. 52
Table 5.13.1. 1 Prediction of CO2 concentration levels ........................................ 54
Table 5.13.2. 1 Prediction of CH4 concentration levels ........................................ 57
Table 5.13.3. 1 Land Average Temperature Prediction (2021 to 2025) ................ 59
Table 5.13.4. 1 Total column: SBUV Concentration Prediction ........................... 62
vii
CHAPTER 1: INTRODUCTION
1.1 Introduction
One of the most pressing international problems of the twenty-first century is
climate change, which is fuelled by a variety of environmental causes. Making wise
judgments and developing practical methods to lessen the negative effects of climate
change need an understanding of the complex interrelationships between
greenhouse gas concentrations, temperature changes, and radiative forcing. This
research explores the area of analysing and forecasting climate data, utilizing
machine learning to understand past trends in the environment and provide accurate
predictions.
Climate change is our era's defining global concern, with far-reaching implications
for the planet's ecosystems, communities, and economy. Making educated decisions
and creating effective climate policies need an understanding of the intricate
interactions between the many elements that contribute to climate change, such as
greenhouse gas concentrations, temperature changes, and radiative forcing.
Climate change refers to the periodic alteration of the Earth's climate because of
variations in the atmosphere as well as interactions between the atmosphere and
numerous geologic, chemical, biological, and geographic elements that are part of
the Earth system.
The atmosphere is a fluid that is always moving and active. Solar radiation,
continent positions, ocean currents, the location and orientation of mountain ranges,
atmospheric chemistry, and vegetation on the land surface are just a few of the
variables that affect the planet's physical characteristics as well as its rate and
direction of motion.
The goal of this project, "Climate Data Analysis and Prediction Using Machine
Learning," is to conduct a thorough investigation of historical climate data to
identify previous trends and create forecasts for the future. This project attempts to
decipher the complex linkages within climate data and deliver insightful
information about the Earth's changing climate by leveraging the capabilities of
machine learning algorithms.
8
1.2 Problem Statement
The challenge is to use historical climate records and powerful machine learning
algorithms to assess previous climate patterns, anticipate future climate scenarios,
and give significant insights into the dynamics of climate change.
By developing a data science project in Python using Jupiter Notebook, we will
illustrate in this paper how we intend to approach this problem.
The project, which is focused on the pressing subject of climate change, aims to use
cutting-edge machine learning methods and historical climate information to
address the following problems:
1.3 Background & Scope
The Climate change is one of the world's most critical issues today. The general
opinion among scientists is that the Earth's climate is changing significantly
because of human activity, notably the production of greenhouse gases (GHGs)
including carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O). The
effects of these changes include altered ecosystems, increased frequency, and
severity of extreme weather events, melting ice caps, and rising global
temperatures.
Climate science, environmental data analysis, and cutting-edge machine learning
methods must all be used in a multidisciplinary manner to comprehend and solve
climate change. To do this, this research makes full use of data-driven insights and
predictive modelling to analyse past climate data in-depth and anticipate future
climatic conditions.
The project's backbone is the availability of enormous datasets that have been
gathered over many years and encompass key climate indicators including GHG
concentrations, global temperatures, radiative forcing, and more. These databases
give academics and decision-makers crucial insights into the dynamics of climate
change, enabling them to make wise choices and create efficient mitigation and
adaptation plans.
9
The initiative also acknowledges the value of data visualization in explaining
difficult climate facts to a wider audience. To increase public understanding,
educate the public, and inspire group action in the fight against climate change,
visual representations of climate data and model forecasts are crucial.
1.4 Aims & Objectives
1.4.1 Aim
The goal of this project is to apply advanced data analysis and machine learning
techniques to conduct a comprehensive investigation of historical climate data, with
a major focus on greenhouse gas concentrations, global temperature changes, and
radiative forcing. The project takes a diverse approach to obtain better
understanding of the mechanisms behind climate change and its causes, ultimately
advancing knowledge of the planet's changing climate. The project's goal is to offer
useful tools and insights that can help climate scientists, policymakers, and the
public in solving the urgent issues of climate change mitigation and adaptation. This
is done by using the power of data-driven modelling and visualization.
1.4.2 Objectives
The following are the key goals of the project on climate data analysis and
machine learning models:
• Develop an effective climate analysis system using cutting-edge machine
learning techniques. From historical climate data, our system ought to be
able to extract intricate patterns and trends that conventional statistical
methods might not be able to identify.
• Design a flexible climate modeling system capable of integrating critical
climatic variables and factors into the analysis. These factors allow for a
thorough evaluation of climate dynamics and include greenhouse gas
concentrations, radiative forcing, temperature anomalies, oceanic data,
and more.
• Design the climate analysis system to be flexible and adaptable to
evolving climate data and scientific advancements. It should have the
capacity to incorporate new datasets and variables, ensuring that it
remains up-to-date and relevant in a rapidly changing climate research
landscape.
10
•
•
Develop machine learning models capable of providing highly accurate
climate projections. These models should excel in predicting time-series
climate data by considering historical climate records as crucial input
features, enhancing the precision of future climate forecasts.
The climate data analysis and machine learning models project aspire to
enhance our understanding of climate change dynamics and provide
critical tools for informed decision-making in the face of this global
challenge.
1.4.2 Industrial Context
The project is being done within the larger industrial framework of climate science
and environmental management. The project's main areas of concentration are
climate data analysis and machine learning models, but it also has applications and
significance in many other businesses and sectors that have a stake in the
information and decision-making connected to climate.
Accurate analysis and forecasting of climate trends are crucial for sectors that
largely rely on climate data, such as renewable energy, agriculture, construction,
insurance, and disaster management. Climate information is used by these
industries to manage building projects, plan agricultural cycles, assess risk, and
respond to natural calamities.
The project advances our understanding of climate science via research and
academics. It may be utilized by researchers and climate scientists as a tool for
investigating climate patterns, confirming climate models, and creating creative
responses to the problems associated with climate change.
This research on climate analysis and machine learning has an industrial context
that spans several industries and sectors. It acts as a fundamental part of the larger
ecosystem of multidisciplinary research, environmental management, policy
formation, and climate science. Its findings and conclusions might have an
influence on a wide range of stakeholders, including corporations, governments,
researchers, and the public. All these parties have a stake in understanding and
tackling the crucial issue of climate change.
11
1.5 Deliverables
A climate data analysis system based on machine learning that conducts
comprehensive climate data analysis and forecasts future climate patterns and trends
is the project's deliverable.
1.6 Tools Used
Jupiter Notebook, the Python programming language, and several machine learning
packages are the tools utilized for this project.
12
CHAPTER 2: BACKGROUND AND
LITERATURE REVIEW
2.1 Persistence of climate changes due to a range of greenhouse
gases
Emissions of a broad range of greenhouse gases of varying lifetimes contribute to
global climate change. Carbon dioxide displays exceptional persistence that renders
its warming nearly irreversible for more than 1,000 y. Here we show that the
warming due to non-CO2greenhouse gases, although not irreversible, persists
notably longer than the anthropogenic changes in the greenhouse gas
concentrations themselves. We explore why the persistence of warming depends
not just on the decay of a given greenhouse gas concentration but also on climate
system behaviour, particularly the timescales of heat transfer linked to the ocean.
For carbon dioxide and methane, nonlinear optical absorption effects also playa
smaller but significant role in prolonging the warming. In effect, dampening factors
that slow temperature increase during periods of increasing concentration also slow
the loss of energy from the Earth’s climate system if radiative forcing is reduced.
Approaches to climate change mitigation options through reduction of green-house
gas or aerosol emissions therefore should not be expected to decrease climate
change impacts as rapidly as the gas or aerosol lifetime, even for short-lived
species; such actions can have their greatest effect if undertaken soon enough to
avoid transfer of heat to the deep ocean.
Carbon dioxide, methane, nitrous oxide, and other greenhouse gases increased over
the course of the 20th century due to human activities. The human-caused increases
in these gases are the primary forcing those accounts for much of the global
warming of the past fifty years, with carbon dioxide being the most important single
radiative forcing agent (1). Recent studies have shown that the human-caused
warming linked to carbon dioxide is nearly irreversible for more than 1,000 y, even
if emissions of the gas were to cease entirely (2–5). The importance of the ocean
intaking up heat and slowing the response of the climate system to radiative forcing
changes has been noted in many studies (e.g., refs. 6 and 7). The key role of the
ocean’s thermal lag has also been highlighted by recent approaches to proposed
metrics for comparing the warming of different greenhouse gases (8, 9).Among the
13
observations attesting to the importance of these effects are those showing that
climate changes caused by transient volcanic aerosol loading persist for more than
5 y (7, 10), and apportion can be expected to last more than a century in the
ocean(11–13); clearly these signals persist far longer than the radiative forcing
decay timescale of about 12–18 mo. for the volcanic aero-sol (14, 15). Thus, the
observed climate response to volcanic events suggests that some persistence of
climate change should be expected even for quite short-lived radiative forcing
perturbations. It follows that the climate changes induced by short-lived
anthropogenic greenhouse gases such as methane or hydrofluorocarbons (HFCs)
may not decrease in concert with decreases in concentration if the anthropogenic
emissions of those gases were to be eliminated. In this paper, our primary goal is to
show how different processes and timescales contribute to determining how long
the climate changes due to various greenhouse gases could be expected to remain
if anthropogenic emissions were to cease. Advances in modeling have led to
improved Atmosphere-Ocean General Circulation Models (AOGCMs) as well as
to Earth Models of Intermediate Complexity (EMICs). Although a detailed
representation of the climate system changes on regional scales can only be
provided by AOGCMs, the simpler EMICs have been shown to be useful,
particularly to examine phenomena on a global average basis. In this work, we use
the Bern 2.5CCEMIC (see Materials and Methods and Text), which has been
extensively intercompared to other EMICs and to complex AOGCMs (3, 4). It
should be noted that, although the Bern2.5CC EMIC includes a representation of
the surface and deep ocean, it does not include processes such as ice sheet losses or
changes in the Earth’s albedo linked to evolution of vegetation. However, it is
noteworthy that this EMIC, although parameterized and simplified, includes 14
levels in the ocean; further, its global ocean heat uptake and climate sensitivity are
near the mean of available complex models, and its computed timescales for uptake
of tracers into the ocean have been shown to compare well to observations (16). A
recent study (17) explored the response of one AOGCM to a sudden stop of all
forcing, and the Bern 2.5CC EMIC shows broad similarities in computed warming
to that study (see Fig. S1), although there are also differences in detail. The climate
sensitivity (which characterizes the long-term absolute warming response to a
doubling of atmospheric carbon dioxide concentrations) is 3 °C for the model used
here. Our results should be considered illustrative and exploratory rather than fully
quantitative given the limitations of the EMIC and the uncertainties in climate
sensitivity.
14
Figure 2.1. 1 Surface Warming Temperature
Fig. 1 shows the computed future global warming contributions for carbon dioxide,
methane, and nitrous oxide for a midrange scenario (23) of projected future
anthropogenic emissions of these gases to 2050. Radiative forcings for all three of
these gases, and their spectral overlaps, are represented in this work using the
expressions assessed in ref. 24. In 2050, the anthropogenic emissions are stopped
entirely for illustration purposes. The figure shows nearly irreversible warming for
at least 1,000 y due to the imposed carbon dioxide increases, as in previous work.
All published studies to date, which use multiple EMICs and one AOGCM, show
largely irreversible warming due to future carbon computed surface warming
obtained in the Bern 2.5CC model due to CO2, CH4, and N2O emission increases
to 2050 following a “midrange “scenario (called A1B; see ref. 23) followed by zero
anthropogenic emissions thereafter. The gases are changed sequentially in this
calculation to explicitly separate the contributions of each. The bumps shown in the
calculated warming are due to changes in ocean circulation, as in previous studies
15
(5, 26, 39). The main panel shows the contributions to warming due to CO2, N2O,
and CH4. The inset shows an expanded view of the warming from year2000 to
2200.
2.2 Biofuels, Greenhouse Gases and Climate
Change
Biofuels are fuels produced from biomass, mostly in liquid form, within a time
frame sufficiently short to consider that their feedstock (biomass) can be renewed,
contrarily to fossil fuels. This paper reviews the current and future biofuel
technologies, and their development impacts (including on the climate) within given
policy and economic frameworks. Current technologies make it possible to provide
first generation biodiesel, ethanol, or biogas to the transport sector to be blended
with fossil fuels. Still under-development 2nd generation biofuels from
lignocellulose should be available on the market by 2020. Research is active on the
improvement of their conversion efficiency. A ten-fold increase compared with
current cost-effective capacities would make them highly competitive. Within
bioenergy policies, emphasis has been put on biofuels for transportation as this
sector is fast-growing and represents a major source of anthropogenic greenhouse
gas emissions. Compared with fossil fuels, biofuel combustion can emit less
greenhouse gases throughout their life cycle, considering that part of the emitted
CO2 returns to the atmosphere where it was fixed from by photosynthesis in the first
place. Life cycle assessment (LCA) is commonly used to assess the potential
environmental impacts of biofuel chains, notably the impact on global warming.
This tool, whose holistic nature is fundamental to avoid pollution trade-offs, is a
standardised methodology that should make comparisons between biofuel and fossil
fuel chains objective and thorough. However, it is a complex and time-consuming
process, which requires lots of data, and whose methodology is still lacking
harmonisation. Hence the life-cycle performances of biofuel chains vary widely in
the** literature. Furthermore, LCA is a site- and time-independent tool that cannot
consider the spatial and temporal dimensions of emissions and can hardly serve as
a decision-making tool either at local or regional levels. Focusing on greenhouse
gases, emission factors used in LCAs give a rough estimate of the potential average
emissions on a national level. However, they do not consider the types of crops, soil
or management practices, for instance. Modelling the impact of local factors on the
determinism of greenhouse gas emissions can provide better estimates for LCA on
16
the local level, which would be the relevant scale and degree of reliability for
decision-making purposes. Nevertheless, a deeper understanding of the processes
involved, most notably N2O emissions, is still needed to improve the accuracy of
LCA. Perennial crops are a promising option for biofuels, due to their rapid and
efficient use of nitrogen, and their limited farming operations. However, the main
overall limiting factor to biofuel development will ultimately be land availability.
Given the available land areas, population growth rate and consumption behaviours,
it would be possible to reach by 2030 a global 10% biofuel share in the transport
sector, contributing to lower global greenhouse gas emissions by up to
1 GtCO2 eq per year (IEA, 2006), provided that harmonised policies ensure that
sustainability criteria for the production systems are respected worldwide.
Furthermore, policies should also be more integrative across sectors, so that changes
in energy efficiency, the automotive sector and global consumption patterns
converge towards drastic reduction of the pressure on resources. Indeed, neither
biofuels nor other energy source or carriers are likely to mitigate the impacts of
anthropogenic pressure on resources in a range that would compensate for this
pressure growth. Hence, the first step is to reduce this pressure by starting from the
variable that drives it up, i.e., anthropic consumptions.
17
CHAPTER 3: METHODOLOGY
3.1 Machine Learning
The development of machine learning has been a breakthrough in the control of
intricate connections between inputs and related outputs. It tackles the difficulties
of handling a wide range of situations that could occur throughout the creation of a
system that analyses data to offer valuable insights. To do this, a thorough
examination and study of the aspects of the input data are necessary to train the
system successfully.
With this process, our system is trained using real-world datasets that include the
necessary elements to predict future responses accurately. The system develops
throughout training by learning from its mistakes, producing more accurate and
trustworthy outcomes.
However, it is essential to comprehend what machine learning includes and its
function before getting into the specifics of this technique. The process of building
a system that can provide outcomes based on trained and learned information is
known as machine learning. It enables algorithms to decide based on knowledge
gained, effectively allowing them to learn from the data that is accessible.
With the use of machine learning, computers may acquire knowledge without
explicit programming, giving them a human-like character by being able to
comprehend circumstances and settings in order to make wise judgments. With
applications expanding to several sectors beyond our expectations, this scientific
field has emerged as one of the most exciting technologies. Machine learning is
being actively used in many fields, making it a breakthrough development in the
study of computers.
Machine learning often comes in four flavours:
●
Supervised Machine Learning
●
Unsupervised Machine Learning
●
Semi-Supervised Machine Learning
18
●
Reinforcement Machine Learning
Figure 3.1. 1 Types of Machine Learning
I have made the decision to move forward with the Machine Learning model
approach for our inventory forecasting system after doing a thorough review of the
options. This strategy, as opposed to previous approaches, enables more
convenience and accuracy in processing the necessary data in today's data-driven
environment.
I must consider all the variables and elements that have a substantial impact on
sales variations in order to develop a machine learning model that works. As a
result, we have decided to tackle this difficulty using a machine learning-based
strategy.
Three distinct machine learning models, each with specific advantages, will be
tested for our inventory forecasting system:
•
Decision Trees: A tree-based model that divides data into branches according to
characteristics, allowing for simple interpretation and comprehension of
decision-making processes.
19
•
Random Forest: Random Forest is a form of ensemble learning that blends
several decision trees to improve accuracy and reduce overfitting.
•
Gradient Boosting: A method for building several weak learners successively,
each of which fixes the flaws in the one before it.
We'll also look at various models including Support Vector Machines (SVM),
Neural Networks, Time Series Analysis (ARIMA, SARIMA, etc.), Long ShortTerm Memory (LSTM), XGBoost, and Prophet.
A sizable and varied training dataset is required to create a trustworthy machine
learning model. In the coming sections of this article, we will go into more depth
about these machines learning models, highlighting their unique qualities and
possible uses in our inventory forecasting system.
3.2 Development Methods and Tools
Using Python as our development platform, we have decided to design our
inventory forecasting model using a machine learning-based methodology. Python
is a great option for constructing machine learning solutions and for quick
prototyping because of its widespread use and broad library support for data
science applications.
We will use a methodical, sequential approach to create our machine learning
model:
•
•
•
•
•
Inspection of the dataset: Analyze the past sales data in detail and extract the
necessary elements for precise prediction.
Data analysis: Using Python's data science modules, investigate correlations
and trends in the sales data based on the attributes that were extracted.
Data processing: Convert the data into a numerical data matrix that may be
used to train a machine learning model. Encode any characteristics that are not
numerical into a numerical format before training.
Pipelines: Build pipelines that input the processed data into three distinct
machine learning models (Decision Trees, Random Forest, and Gradient
Boosting). Set the training process for each model into motion.
Model Evaluation: Assess the behavior of the trained models using the training
data.
20
Figure 3.2. 1 Software Development Cycle
Figure 3.2. 2 Iterative Model
This rigorous approach will help us create a strong inventory forecasting model that
optimizes sales estimates for the cafeteria industry while considering a variety of
influencing elements. Knowledge of Python-3 and the necessary Python libraries
for data science and machine learning is needed for the effective development of
21
our inventory forecasting system. Following are the libraries we'll be using in our
implementation:
3.2.1 Pandas
A free, open-source toolkit for data manipulation, Pandas makes a variety of datarelated activities easier to do, including data entry, normalization, merging datasets,
visualization, statistical operations, analysis, and more.
3.2.2 NumPy
NumPy is a well-known toolkit that provides effective data structures for working
with arrays. Medical computing, machine learning, data analytics, and other
applications benefit greatly from its swift and efficient processing of
multidimensional arrays.
3.2.3 Matplotlib
Matplotlib is a cross-platform statistics visualization and graphical plotting package
that easily works with NumPy. It offers an approachable replacement for MATLAB
by enabling the incorporation of visual graphs and illustrations into Python-coded
GUI systems.
3.2.4 Scikit-Learn
Scikit-Learn, sometimes referred to as sklearn, is an effective and flexible Python
toolkit for machine learning applications. It provides several different models and
techniques for classification, regression, clustering, dimensionality reduction, and
other tasks.
The Python interfaces provided by Scikit-Learn for effective data modelling are
clear and consistent. Scikit-Learn is built upon NumPy, SciPy, and Matplotlib.
Feature extraction, selection, cross-validation, ensemble techniques, supervised and
unsupervised learning algorithms, and other crucial capabilities are supported.
These libraries will enable us to efficiently handle data processing, feature
extraction, and machine learning model training. Our cafeteria business will be able
to maximize income and improve inventory management thanks to the combination
22
of Python and these libraries, which will allow us to build a strong and precise
inventory forecasting system.
CHAPTER 4: PROJECT MANAGEMENT
We used a Gantt chart to successfully monitor our progress during this project. The
Gantt chart was chosen since it is a popular project management tool used in many
different sectors. It helped in planning and coordinating project operations by
enabling us to show the timetable of various tasks and their interdependencies
visually.
The milestone system played a significant role in the planning process. Milestones
acted as important milestones and objectives, directing the development of our
project. Every milestone served as a significant achievement or the conclusion of an
essential stage, ensuring that we kept on course and adhered to critical deadlines.
A thorough overview of the project's schedule and deliverables was given thanks to
the collaboration between the Gantt chart and milestone system. The Gantt chart
assisted in visualizing the different jobs and their durations, allowing for efficient
resource allocation and the detection of possible bottlenecks. As the project moved
from one stage to the next, the milestone system served as strategic anchors,
denoting significant accomplishments.
This mix of technologies helped us stay organized and in a clear direction
throughout the project's lifespan. The successful completion of the inventory
forecasting system for the cafeteria company was ultimately due to the methodical
methodology that enabled effective communication between team members and
enabled timely modifications to project timetables and priorities.
4.1 Milestone Analysis
Finding important turning points or noteworthy occasions within a project's
timetable is known as a milestone analysis. These turning points are significant
benchmarks that show development and the end of crucial stages. Project managers
23
may maintain the project's timeline, deal with possible concerns quickly, and
recognize achievements as they are made by monitoring and analyzing milestones.
Milestone
Requirements
Milestone-1 (Initial Report)
In this milestone, we needed to come up with all the
research details of our project and state the
methods we had decided to follow to complete our
project.
Milestone-2 (Project Report 2)
In this milestone, we had to come up with a proofof-concept forecasting model that shows that we are
on the correct track to complete the full project.
Milestone-3 (Final Report
Submission)
By this milestone, we needed to come up with the
final climate data analysis model and submit our
final report for the project.
Final Presentation
Here, we finally present our work and demonstrate
the climate data analysis model that we have
created.
Table 4.1. 1 Table Milestone Analysis
This project was initially divided into a total of four milestones, each with a specific
objective that needed to be accomplished. The table contains the requirements and
milestones that must be met.
The table contains the fundamental summaries that were anticipated at the
conclusion of each milestone. Now that we had established the objectives for each
milestone, we could start working on finishing the project.
4.2 Analysis and Deliverables
In this research, the dataset needed to train the machine learning model was gathered
from the Kaggle public database. The initial dataset was intended to be used for
training machine learning models that can forecast future order volume for a
business model akin to a restaurant.
24
We utilize the pandas, NumPy, and matplotlib libraries of the python programming
language to analyze the dataset. We estimate all the various parameters and their
relationships using these libraries' statistical modeling procedures, which are
provided in these libraries.
We next use the matplotlib software to visualize all these data using a plot. These
visually represented findings were analyzed, and they served as input for the feature
extraction procedure previously in this study.
The libraries that will be utilized to create our implementation of the inventory
forecasting system are those that were previously discussed. The fundamental class
diagram strategy, which was employed to create our forecasting system, is described
here.
Figure 4.2. 1 Forecasting System flow diagram.
25
The illustration shown up above shows the complete procedure. In this case, raw
climate data is originally gathered and processed using several climatic variables
and parameters. This raw climate data is then turned into insightful information
using the fundamental algorithm and libraries like Pandas, NumPy, Matplotlib, and
scikit-learn. The program makes use of these methods and libraries to transform the
raw climate data into a useful dataset, improving the general accuracy of the climate
analysis and machine learning system.
•
•
•
•
•
•
•
•
Climate data here refers to all the information pertaining to the different elements
and variables that make up the climate.
Climate variables include elements like temperature, humidity, greenhouse gas
concentrations, and other metrics that have a do with the climate.
The term "intensity" describes the size or number of climatic variables at a
particular period.
Additional information, such as the location, the timing, and certain climatic
occurrences, are included in other data.
A database is a complete set of climatic information that include all the variables
and factors.
The term "raw data" describes the original, unprocessed climatic data, which
might not be useful on its own.
The raw climatic data is processed and converted into a useful dataset using Python
tools and algorithms.
Future climatic patterns and trends are predicted in the last stage. This is how the
program works to increase decision-making in machine learning and climate
analysis while also increasing overall accuracy.
For this application to be successful, accurate and trustworthy climatic data are
required. There may be disparities in the final forecasts if the initial data is inaccurate
or contains mistakes. Therefore, acquiring accurate and trustworthy climate data is
essential to improving the overall accuracy and efficiency of the machine learning
and climate analysis system.
26
4.3 Physical Resources
We will need a thorough historical dataset of climate data that includes a range of
climatic variables and characteristics. This dataset will operate as the fundamental
source of data from which we will extract all the components required for our
machine learning models to be trained.
Our main goal is to build a reliable machine learning and climate analysis system
that can anticipate future climate patterns and trends. We will be able to estimate
and predict a variety of measures connected to the climate using this system, which
is based on machine learning algorithms, improving our understanding of climate
dynamics.
We’ll use the free Jupyter Notebook Python code editor tool to make our research
and data analysis easier. In addition, for data processing and manipulation, we’ll use
fundamental Python modules like Pandas and NumPy, along with Matplotlib for
data visualization. With the help of these tools and libraries, we will be able to
display and evaluate the climatic data for the project’s data science component
efficiently.
CHAPTER 5: IMPLEMENTATION
5.1 Label Encoding
Label Encoding refers to converting the labels into a numeric form to convert them
into the machine-readable form. Machine learning algorithms can then decide in a
better way how those labels must be operated. It is an important pre-processing step
for the structured dataset in supervised learning.
5.1.1 Advantages of Label Encoding:
Scikit-learn provides a very efficient tool for encoding the levels of categorical
features into numeric values. Label Encoder encode labels with a value between 0
27
and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns
the same value to as assigned earlier.
5.2 Regression Models
5.2.1 Decision Tree Regressor
Decision trees may be used for both classification and regression applications and
are non-parametric models. Because the model is non-parametric, the number of
parameters (or weights) does not rely on the number of features in the dataset.
Decision trees benefit from this property, which makes them very adaptable and
able to handle datasets with many attributes without dramatically increasing
computing complexity.
Decision trees are flexible because, depending on the nature of the issue, they can
generate categorical (discrete) and numerical (continuous) predictions. Decision
trees output class labels for classification problems while producing numerical
values for regression tasks.
Figure 5.2.1. 1 Decision Tree Working Diagram
28
5.2.2 Random Forest Regression
Random forests or random decision forests is an ensemble learning method for
classification, regression and other tasks that operates by constructing a multitude
of decision trees at training time. For classification tasks, the output of the random
forest is the class selected by most trees. For regression tasks, the mean or average
prediction of the individual trees is returned. Random decision forests correct for
decision trees' habit of overfitting to their training set. Random forests generally
outperform decision trees, but their accuracy is lower than gradient boosted trees.
However, data characteristics can affect their performance.
The first algorithm for random decision forests was created in 1995 by Tin Kam Ho
using the random subspace method, which, in Ho's formulation, is a way to
implement the "stochastic discrimination" approach to classification proposed by
Eugene Kleinberg.
An extension of the algorithm was developed by Leo Bierman and Adele Cutler,
who registered "Random Forests" as a trademark in 2006 (as of 2019, owned by
Minitab, Inc.). The extension combines Bierman’s "bagging" idea and random
selection of features, introduced first by Ho and later independently by Amit and
Geminin order to construct a collection of decision trees with controlled variance.
Random forests are frequently used as "Blackbox" models in businesses, as they
generate reasonable predictions across a wide range of data while requiring little
configuration.
29
Figure 5.2.2. 1 Working of Random Forest
5.2.3 Gradient Boosting Machines
The sophisticated machine learning approach known as a gradient boosting machine
(GBM) is utilized for both classification and regression problems. A powerful
predictive model is produced by GBMs, an ensemble learning technique that
successively integrates the predictions of several weak learners (usually decision
trees). Gradient Boosting Machines (GBMs) start by fitting a weak learner to the
training data; this weak learner is often a shallow decision tree with a small number
of levels.
Although this initial weak learner performs mediocrely as a whole, it generates
predictions. The residuals, or discrepancies between the actual target values and the
predictions given by the first weak learner, are then computed by the GBM. These
residuals represent the mistakes or inconsistencies that must be fixed. The residuals
obtained in the preceding phase are then fitted by the GBM to a second weak learner.
To correct the mistakes produced by the first learner, this second learner focuses on
recognizing and learning from the patterns found in the residuals.
30
Figure 5.2.3. 1 Working of GBM
5.3 Steps of Model Building
5.3.1 Dataset
Climate variables, historical climate data, geographic information, emissions data,
atmospheric conditions, climate change metrics, and trend analysis are among the
features of the dataset for climate analysis and machine learning. These factors are
essential for efficient climate research and machine learning as they help us
comprehend historical climate patterns, forecast future trends, evaluate
environmental implications, and take reasoned actions to slow down climate
change.
5.3.2 Splitting the Dataset
Divide the data into training and testing part we have divide in 70 -30 ratio 80
percent for training and 10 percent for testing of the data and 10 percent for data
validation.
We have split the data into 80 percent training 10 percent for testing and 10 percent
for validating the results.
5.3.3 Training and Testing
Training is done using training data which is received from dataset. this training will
help the system to learn from the data about the pattern and various relationship.
Testing of data is done to test whether the training phase has been successful or not
31
the testing data is used to test the data after the training this ensure that the prediction
or calculation done by the machine learning algorithm is right or wrong.
5.3.4 Supervised Machine Learning
It is used to train the algorithm to perform same task on various data to extract
various pattern and relationship from them. Supervised learning provide data with
e.g., and result to train the algorithm.
5.3.5 Regression Models
•
•
•
Decision Tree
Random Forest Regression
Gradient Boosting Machine (GBM)
5.5 Evaluation Methods
This model shows the numeric value after calculation and prediction made by
various algorithm.
5.5.1 Mean Absolute Error (MAE)
In statistics, mean absolute error (MAE) is a measure of errors between paired
observations expressing the same phenomenon. Examples of Y versus X include
comparisons of predicted versus observed, subsequent time versus initial time, and
one technique of measurement versus an alternative technique of measurement.
MAE is calculated as the sum of absolute errors divided by the sample size.
(1)
5.5.2 Mean Squared Error (MSE)
The mean squared error (MSE) or mean squared deviation (MSD) of an estimator
(of a procedure for estimating an unobserved quantity) measures the average of the
32
squares of the errors—that is, the average squared difference between the estimated
values and the actual value. MSE is a risk function, corresponding to the expected
value of the squared error loss.
The fact that MSE is almost always strictly positive (and not zero) is because of
randomness or because the estimator does not account for information that could
produce a more accurate estimate. In machine learning, specifically empirical risk
minimization, MSE may refer to the empirical risk (the average loss on an observed
data set), as an estimate of the true MSE (the true risk: the average loss on the actual
population distribution).
(2)
5.5.3 Root Mean Squared Error (RMSE)
The root mean square (RMS or rms or rms) is defined as the square root of the mean
square (the arithmetic mean of the squares of a set of numbers). The RMS is also
known as the quadratic mean and is a particular case of the generalized mean with
exponent 2. RMS can also be defined for a continuously varying function in terms
of an integral of the squares of the instantaneous values during a cycle.
(3)
5.6 Data Collection
To acquire data, pertinent information and datasets must be obtained from several
sources, including historical climate data, climatic variables, geographic
information, emissions data, atmospheric conditions, and other variables that may
have an influence on climate patterns and trends. A thorough data collection is
needed to train the machine learning models and build a precise system for climate
analysis and machine learning.
33
year
co2
ch4
-
0.092
0.039
0.031
co2_eq
_ppm_
total-
-
0.097
0.042
0.034
1.747
-
0.107
0.102
0.044
0.036
-
0.107
0.046
-
0.113
0.048
0.42
n2o
cfc12 cfc11 15_minor
total
aggi_1990_1
0.785
aggi_
chan
ge
NaN
385
0.807
2.2
1.786
388
0.825
1.8
0.038
1.818
390
0.84
1.5
0.041
1.859
394
0.859
1.9
Table 5.6. 1 Sample of Dataset
5.7 Data Pre-processing and Feature Engineering
Preparing and cleaning raw data so that it is appropriate for further analysis is the
first stage in the data analysis pipeline. It entails activities including dealing with
missing data, getting rid of duplicates, scaling features, and encoding categorical
variables.
5.8 Data Visualization
In climate analysis and machine learning, the process of using charts, graphs, and
other graphical representations to convey information about climate variables,
emissions, atmospheric conditions, and other relevant factors is known as data
visualization. With the help of this visual representation, stakeholders may get
crucial insights, identify patterns or trends in the climatic data, and draw informed
judgments. The correlation matrix, which shows the connections and dependencies
between different climatic parameters and the concentration of all substances,
provides a visual understanding of how different climate variables are connected
and have an influence on one another in the climate system. Making defensible
judgments on methods for mitigating and adapting to climate change is made easier
because to this method, which improves our knowledge of the intricate interactions
that take place inside the climate system.
34
Figure 5.8. 1 Concentration of Greenhouse Gas Radiative Forcing
A The graph displays, from 1979 to 2015, the radiative forcing, expressed in Watts
per square meter (W m2), for major greenhouse gases. A distinct gas, such as CO2,
CH4, N2O, CFC-12, CFC-11, and other minor gases, as well as the overall radiative
forcing, are represented by each line.
The following are significant findings from the graph:
• Over time, CO2 shows a continuous rise in radiative forcing, indicating a
considerable contribution to global warming.
• Radiative forcing is affected differently by various gases, some of which
exhibit varied effects.
• The "Total" line depicts the total radiation forcing from all gases, which
sums up the effect on the planet's energy balance.
This graph illustrates how various greenhouse gases affect radiative forcing,
highlighting their contributions to climate change and global warming.
35
Figure 5.8. 2 Ozone Layer Data Over the Years
The line graph showing the variations in ozone concentration in the entire column,
troposphere, and stratosphere of the Earth's atmosphere over a period of years may
be created. The y-axis shows the ozone concentration, while the x-axis shows the
years from the dataset. One of the atmospheric layers is shown by each line on the
graph.
These are some important conclusions to draw from the graph:
• Trends in the amount of ozone in the stratosphere, troposphere, and overall
column during the chosen years.
• Any changes or trends in the ozone content of each atmospheric layer.
• Alterations in ozone levels in various atmospheric strata and their
connection.
Studying the Earth's ozone layer and its possible effects on climatic and
environmental conditions requires knowing how ozone levels have changed over
time in various layers of the atmosphere, which is made easier by this representation.
36
Figure 5.8. 3 Concentration of Halogen Compounds Over Time
The line plot showing the change in concentration of several halogen compounds
over time. The graph illustrates how the concentrations of each halogen component
have varied over time by depicting each as a distinct line. The years are shown on
the x-axis, while the concentration of each component is shown on the y-axis. The
comparison of concentration patterns for various halogen compounds during the
given time is made possible by this depiction.
5.9 Training and Testing of Different Models on CO2 Dataset
The dataset is split into two subsets after the data preparation step: the training set
and the testing set. A fraction of the pre-processed data is present in the training set,
which is utilized to train the machine learning model. The trained model's
performance is assessed on the testing set, while its applicability to fresh, untried
data is evaluated.
For overfitting to be avoided, when the model memorizes the training data but fails
to perform effectively on fresh data, the training and testing sets are crucial.
Depending on the size of the dataset, a typical split ratio for training and testing sets
is 70–30 or 80–20.
37
5.9.1 Using Support Vector Machine Algorithm
Among the most effective machine learning methods, Support Vector Machines
(SVM) are utilized for classification and regression problems. They effectively
separate data points by determining the best hyperplane to optimize the margin
between various kinds of data. SVM can handle non-linear data by applying a kernel
function to translate it into a higher-dimensional space. The greatest margin
hyperplane is chosen, improving generalization, and lowering overfitting. To define
the decision boundary, support vectors—the data points that are closest to the
hyperplane—are crucial. The SVM contains a hyperparameter called "C" that allows
for flexible model adjustment by balancing margin width and classification
accuracy. SVM is renowned for its resilience and adaptability for both linear and
nonlinear classification problems. It can manage a variety of data sources and
classification tasks.
Here is the code for the support vector machine algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Support Vector Regression
svr_model = SVR(kernel='linear')
svr_model.fit(X_train, y_train)
svr_pred = svr_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, svr_pred, "Support Vector Regression")
5.9.1.1 Support Vector Machine Algorithm’s Training Evaluation:
Evaluation of support vector machine algorithm training.
Mean Squared Error:
13.23
R-squared:
0.97
Table 5.9.1.1. 1 SVM Training Evaluation for CO2
With a low mean squared error (MSE) of 13.23, the SVM model used in climate
analysis makes precise predictions with few mistakes. The model also has a high Rsquared (R2) value of 0.97, which shows that it accurately explains 97% of the
38
variance in climate data and highlights its great fit and trustworthy forecasting skills.
The SVM model essentially shows how well it predicts factors linked to the climate.
5.9.2 Using Random Forest Algorithm
An ensemble learning technique used in supervised learning is Random Forest. It
mixes several models to solve complicated issues that might not be amenable to just
one ML model. Both classification and regression jobs can use it.
Multiple decision trees are constructed in Random Forest on various subsets of the
dataset, and predictions are averaged to increase accuracy. The method collects
predictions from each decision tree and utilizes majority vote to arrive at the final
prediction rather than depending just on one decision tree. The model's accuracy
increases, and the chance of overfitting decreases as the number of trees increases.
Here is the code for the random forest algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest Regression
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, rf_pred, "Random Forest Regression")
5.9.2.1 Random Forest Training Evaluation:
Evaluation report of random forest algorithm.
Mean Squared Error:
3.36
R-squared:
0.99
Table 5.9.2.1. 1 Random Forest Training Evaluation for CO2
39
With a Mean Squared Error (MSE) of 3.66, predictions are very accurate and have
few prediction mistakes. 99% of the variation in the climatic data is explained by
the model, according to the R-squared (R2) value of 0.99. This denotes a strong
match and trustworthy forecasting ability for climate-related variables.
5.9.3 Using Time Series Model (Simple Moving Average)
A mathematical method for smoothing time-series data is the moving average
algorithm. It entails taking a collection of sequential data points and computing the
average of a predetermined number of nearby points, referred to as the "window" or
"period." Following that, the centre data point inside that frame is given this
averaged value. Each data point in the series is subjected to the same procedure,
yielding a fresh set of values that have been smoothed.
To make underlying trends or patterns more obvious, the moving average is
particularly helpful for eliminating noise or volatility in data. It is frequently used to
analyse and show data over time, aiding the detection of trends and patterns within
noisy datasets, in a variety of disciplines, including finance, signal processing, and
climate study. Depending on the particulars of the data and the objectives of the
study, other moving average variants, such as simple moving averages (SMA) and
exponential moving averages (EMA), might be used.
Here is the code for the moving average algorithms’ evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Time Series Model (Simple Moving Average)
rolling_mean = y_train.rolling(window=5).mean().iloc[-1]
ts_pred = np.full_like(y_test, fill_value=rolling_mean)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, ts_pred, "Time Series (Moving Average)")
40
5.9.3.1 Time Series Model (Simple Moving Average) Training
Evaluation:
Evaluation results of simple moving average algorithm.
Mean Squared Error:
544.40
R-squared:
-0.08
Table 5.9.3.1. 1 Simple Moving Average Training for CO2
The moving average algorithm's predictions contain quite big mistakes, as shown
by the MSE value of 544.40, and the model is not doing a good job of capturing the
patterns or trends in the data, as indicated by the negative R2 value of -0.08. These
findings suggest that the moving average algorithm is not a good fit for this specific
dataset and that different modelling approaches could be better suitable for
predicting or smoothing the data.
5.10 Training and Testing of Different Models on Methane Dataset
Using several machine learning methods to create predictive models is what is meant
by "training and testing of different models" in the context of the methane dataset.
These models are taught using historical methane data during the training phase, and
their performance is evaluated during the testing phase by comparing their
predictions to measured methane concentrations. To determine the best model for
predicting methane levels in the provided dataset.
5.10.1 Using Support Vector Machine Algorithm
Powerful machine learning methods called Support Vector Machines (SVM) are
used to solve regression issues, such as the interpretation of data relating to methane
concentration. In the area of methane concentration prediction, SVM excels at
finding the ideal hyperplane that optimizes the margin between various data points.
The durability and adaptability of SVM in handling both linear and nonlinear
regression problems make it appropriate for a variety of scenarios involving the
prediction of methane concentration. It can handle various data sources and
regression tasks related to methane concentration analysis with ease.
41
Here is the code for the support vector machine algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Support Vector Regression
svr_model = SVR(kernel='linear')
svr_model.fit(X_train, y_train)
svr_pred = svr_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, svr_pred, "Support Vector Regression")
5.10.1.1 Support Vector Machine Algorithm’s Training Evaluation:
Evaluation of support vector machine algorithm training.
Mean Squared Error:
-
R-squared:
-
Table 5.10.1.1. 1 SVM Evaluation for Methane
The SVM model used to analyse methane concentration data shows considerable
prediction errors and a lack of explanatory power, with a strikingly high mean
squared error (MSE) of 86,341,224.65 and a markedly low R-squared (R2) value of
-11,709.77. The increased MSE, which denotes erroneous predictions with
significant departures from actual values, indicates this model performs badly. The
significantly negative R2 value suggests that the model does not satisfactorily
explain the variance in the methane concentration data, raising questions about the
model's accuracy in predicting climate-related variables.
5.10.2 Using Random Forest Algorithm
In supervised learning for the study of methane datasets, Random Forest, an
ensemble learning approach, is used. This approach mixes numerous models to
tackle difficult problems that a single machine learning model might not be able to
42
handle well. For classification and regression tasks relating to methane
concentration analysis, Random Forest can be used.
The accuracy of the model tends to rise as the ensemble size grows while the danger
of overfitting decreases. As a result, Random Forest is particularly useful for
improving the accuracy of methane concentration estimates and decreasing the
likelihood that the methane dataset analysis would overfit.
Here is the code for the random forest algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest Regression
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, rf_pred, "Random Forest Regression")
5.10.2.1 Random Forest Training Evaluation:
Evaluation report of random forest algorithm.
Mean Squared Error:
464.53
R-squared:
0.94
Table 5.10.2.1. 1 Random Forest Training Evaluation for Methane
The Random Forest algorithm's predictions for methane concentration in climate
analysis have a low Mean Squared Error (MSE) of 464.53, making them very
accurate with few prediction mistakes. The model successfully explains 94% of the
variation in the methane concentration data, as indicated by the R-squared (R2)
value of 0.94, suggesting a good fit and dependable forecasting capabilities for
climate-related variables. For estimating methane concentrations in the context of
climate studies, the Random Forest algorithm seems to be a reliable option.
43
5.10.3 Using Time Series Model (Simple Moving Average)
A mathematical method called the moving average algorithm is used in methane
concentration analysis to smooth time-series data. It entails averaging a
predetermined number of nearby data points, referred to as the "window" or
"period," by taking the average of a sequence of consecutive data points on methane
concentration. The noise and oscillations in the methane concentration data are
reduced by this procedure, making it simpler to spot underlying trends and patterns.
Simple moving averages (SMA) and exponential moving averages (EMA) are two
examples of moving average variations that may be used in the analysis of methane
datasets depending on the unique properties of the data on methane concentrations
and the goals of the study. With the use of these moving average approaches, it is
possible to improve the accuracy of predictions for methane concentration as well
as get a better understanding of trends and variations in methane concentration over
time.
Here is the code for the moving average algorithms’ evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Time Series Model (Simple Moving Average)
rolling_mean = y_train.rolling(window=5).mean().iloc[-1]
ts_pred = np.full_like(y_test, fill_value=rolling_mean)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, ts_pred, "Time Series (Moving Average)")
5.10.3.1 Time Series Model (Simple Moving Average) Training
Evaluation:
Evaluation results of simple moving average algorithm.
44
Mean Squared Error:
8028.06
R-squared:
-0.09
Table 5.10.3.1. 1 Evaluation of Simple Moving Average for Methane
With a relatively high Mean Squared Error (MSE) of 8028.06, suggesting
considerable prediction mistakes, the Simple Moving Average method performs sub
optimally in forecasting methane concentration. Additionally, the model's negative
R-squared (R2) value of -0.09 indicates that it has trouble accurately capturing the
patterns or trends in the data. These findings show that the Simple Moving Average
methodology is not suitable for this dataset, and alternative modelling methods
should be considered to enhance trend capture and forecast accuracy for methane
concentration data in climate analysis.
5.11 Training and Testing of Different Models on Earth Global
Temperature Dataset
Using the "berkeley_earth_globaltemperatures" dataset, "training and testing of
different models" refers to the process of creating prediction models using a variety
of machine learning approaches. These models are trained using historical
temperature data, and during the testing phase, their performance is evaluated by
contrasting temperature forecasts with actual observations. Finding the best accurate
model to predict temperature variations in the "berkeley_earth_globaltemperatures"
dataset is the goal.
5.11.1 Using Support Vector Machine Algorithm
The study of climate data frequently makes use of Support Vector Machines (SVM),
especially for datasets like Berkeley Earth Global Temperatures. These machine
learning algorithms are quite good at classifying and regressing issues involving
climatic variables.
SVMs are excellent in classifying data by locating the best hyperplane that optimizes
the margin between various types of climatic data. Their adaptability is
demonstrated by their proficiency in handling many kinds of climate data sources
and by the accuracy with which they identify both linear and nonlinear climate
trends.
45
SVMs show to be a reliable and flexible machine learning technique in the context
of analyzing climate data using the Berkeley Earth Global Temperatures dataset.
They are important tools in climate study and prediction because they help identify
and comprehend intricate climatic patterns.
Here is the code for the support vector machine algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Support Vector Regression
svr_model = SVR(kernel='linear')
svr_model.fit(X_train, y_train)
svr_pred = svr_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, svr_pred, "Support Vector Regression")
5.11.1.1 Support Vector Machine Algorithm’s Training Evaluation:
Evaluation of support vector machine algorithm training.
Mean Squared Error:
18.61
R-squared:
-0.00
Table 5.11.1.1. 1 SVM Training for Global Temperature
The "berkeley_earth_globaltemperatures" dataset was analysed using an SVM
model, although it shows significant prediction errors and a lack of explanatory
ability. The R-squared (R2) score is close to zero (-0.00), and the mean squared error
(MSE) is noticeably large (18.61). A high degree of forecast error and considerable
departures from actual temperature readings are both indicated by the increased
MSE. Additionally, the virtually zero R2 value raises concerns about the model's
capacity to accurately forecast temperature-related variables in the
"berkeley_earth_globaltemperatures" dataset since it suggests that the model does
not adequately explain the variance in the temperature data.
46
5.11.2 Using Random Forest Algorithm
In supervised learning scenarios, such as the study of climate data utilizing datasets
like Berkeley Earth Global Temperatures, Random Forest, an ensemble learning
approach, finds useful uses. To successfully handle complicated problems that a
single machine learning model might not be able to solve, it makes use of the
potential of integrating many models.
For classification and regression tasks in the context of analysing climate data,
Random Forest is flexible and effective. With the help of several decision trees'
predictions combined, it excels in increasing model accuracy. The performance of
the model tends to grow with the number of trees in the forest, and the danger of
overfitting, when the model fits noise rather than patterns, reduces. The Berkeley
Earth Global Temperatures dataset uses Random Forest as a useful technique to
improve the precision and resilience of climate data analysis and prediction.
Here is the code for the random forest algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest Regression
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, rf_pred, "Random Forest Regression")
5.11.2.1 Random Forest Training Evaluation:
Evaluation report of random forest algorithm.
Mean Squared Error:
22.81
R-squared:
-0.23
Table 5.11.2.1. 1 Random Forest Evaluation for Global Temperature
47
The "berkeley_earth_globaltemperatures" dataset's temperature forecasts made by
the Random Forest method show a remarkably low Mean Squared Error (MSE) of
22.81, suggesting a high degree of accuracy with few prediction mistakes.
Furthermore, the model successfully explains 23% of the temperature data variation,
as shown by the R-squared (R2) value of -0.23. As a result, it is possible that the
Random Forest method may not be the best option for predicting variables linked to
temperature in the "berkeley_earth_globaltemperatures" dataset given its weak
explanatory power and high MSE.
5.12 Training and Testing of Different Models on Concentration of
Ozone Dataset
Using the "concentration of ozone" dataset, the idea of "training and testing of
different models" entails the creation of prediction models using a variety of
machine learning approaches. In the testing phase, the performance of these models
is evaluated by comparing projected ozone levels with actual observations. The
models are trained using historical ozone concentration data. Our understanding of
ozone dynamics and trends will be improved by determining the model that can
anticipate ozone concentration fluctuations in the dataset with the greatest degree of
accuracy.
5.12.1 Using Support Vector Machine Algorithm
When applied to classification and regression tasks using the "concentration of
ozone" dataset, Support Vector Machines (SVM) are among the most potent
machine learning algorithms. SVM is very good at separating data points by finding
the best hyperplane that optimizes the separation between different data categories.
SVM demonstrates robustness and adaptability, addressing both linear and
nonlinear classification problems. It works well when managing a variety of data
sources and various classification tasks when used to ozone concentration analysis.
Here is the code for the support vector machine algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Support Vector Regression
48
svr_model = SVR(kernel='linear')
svr_model.fit(X_train, y_train)
svr_pred = svr_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, svr_pred, "Support Vector Regression")
5.12.1.1 Support Vector Machine Algorithm’s Training Evaluation:
Evaluation of support vector machine algorithm training on SBUV column.
Mean Squared Error:
Mean Absolute Error:
5.53
1.91
R-squared:
-0.74
Table 5.12.1.1. 1 SVM Evaluation for Ozone
An SVM model was used to assess the "concentration of ozone" dataset, in particular
the SBUV column. The results, however, reveal a low explanatory ability and
significant prediction mistakes. Indicating large forecast errors with considerable
departures from the actual ozone concentrations, the mean squared error (MSE) is
noticeably high at 5.53. In addition, the R-squared (R2) score is nearly zero (-0.74),
indicating that the model has difficulty adequately explaining the variation in the
data on ozone concentration. These findings throw doubt on the SVM model's ability
to predict variables linked to ozone inside the "concentration of ozone" dataset.
5.12.1.2 Support Vector Machine Algorithm’s Training Evaluation:
Evaluation of support vector machine algorithm training on Troposphere column.
Mean Squared Error:
Mean Absolute Error:
R-squared:
-
Table 5.12.1.2. 1 SVM Evaluation for Ozone on Troposphere
49
With a Mean Squared Error (MSE) of 0.53 and predictions that are relatively
accurate with few mistakes, the SVM model applied to the "concentration of ozone"
dataset, notably in the Troposphere column, produces promising results. Another
indicator of prediction accuracy is the Mean Absolute Error (MAE), which is 0.63.
The model's ability to describe the variation in ozone concentration data in the
Troposphere column, however, is only moderate, as indicated by the R-squared (R2)
value of 0.64. Even though it's not a perfect match, it indicates that the SVM model
does a respectable job of forecasting ozone concentrations in this dataset column.
5.12.1.3 Support Vector Machine Algorithm’s Training Evaluation:
Evaluation of support vector machine algorithm training on Stratosphere column.
Mean Squared Error:
6.06
Mean Absolute Error:
1.99
R-squared:
-0.17
Table 5.12.1.3. 1 SVM Evaluation for Ozone on Stratosphere
The SVM model, namely in the Stratosphere column, produces a noticeably larger
Mean Squared Error (MSE) of 6.06 when applied to the "concentration of ozone"
dataset. This shows that the ozone concentrations in the Stratosphere column
predicted by the model have bigger errors and greater departures from real values.
A measure of the magnitude of prediction mistakes, the Mean Absolute Error
(MAE) is 1.99.
In addition, the R-squared (R2) value is -0.17, indicating that the model has
difficulty explaining the variation in the data on ozone concentration in the
Stratosphere column. This low R2 value raises questions regarding the SVM model's
precision and potency in forecasting ozone levels in this dataset column. When used
with the Stratosphere column of the "concentration of ozone" dataset, the SVM
model generally seems to perform badly.
5.12.2 Using Random Forest Algorithm
Random Forest is a supervised learning approach for ensemble learning that is
commonly used in classification and regression applications. With this strategy,
50
numerous models are combined to tackle complicated issues that a single machine
learning model might not be able to handle well.
Using numerous decision trees, Random Forest can improve accuracy in the context
of the "concentration of ozone" dataset. The model's precision tends to rise with the
number of trees in the forest. The danger of overfitting may also be decreased by
using more trees, which improves the model's ability to generalize when forecasting
ozone concentrations.
Here is the code for the random forest algorithm's evaluation.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest Regression
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, rf_pred, "Random Forest Regression")
5.12.2.1 Random Forest Training Evaluation:
Evaluation report of random forest algorithm on SBUV column data.
Mean Squared Error
Mean Absolute Error
5.07
1.88
R-squared
-0.59
Table 5.12.2.1. 1 Random Forest Evaluation for Ozone on SBUV
The SBUV column of the "concentration of ozone" dataset generated by the Random
Forest model displays a Mean Squared Error (MSE) of 5.07, suggesting reasonably
accurate predictions with few prediction mistakes. A further indicator of prediction
accuracy is the Mean Absolute Error (MAE), which is 1.88. The R-squared (R2)
score of -0.59, however, indicates that the model has difficulty explaining the
51
variation in the data on ozone concentrations, highlighting its limits in identifying
underlying patterns or trends in the dataset. Alternative modelling techniques may
be taken into consideration considering these findings to enhance predictions
regarding ozone concentrations in the SBUV column.
5.12.2.2 Random Forest Training Evaluation:
Evaluation report of random forest algorithm on Troposphere column data.
Mean Squared Error
0.30
Mean Absolute Error
R-squared
0.35
0.80
Table 5.12.2.2. 1 Random Forest Evaluation for Ozone on Troposphere
The Mean Squared Error (MSE) of the Random Forest model applied to the
"concentration of ozone" dataset, especially in the Troposphere column, is 0.30. This
suggests that the model's forecasts of ozone concentrations in the troposphere
column are largely correct and include few errors. Given the tiny size of the
prediction errors, the Mean Absolute Error (MAE) is 0.35.
A further indication that the model does a good job of describing the variance in the
ozone concentration data in the Troposphere column is the R-squared (R2) value of
0.80. Due to the Random Forest model's remarkable ability to capture underlying
patterns and trends in the dataset, as seen by the high R2 value, strong prediction
abilities may be achieved. Overall, it appears that the Random Forest method is a
good fit for predicting ozone concentrations in the Troposphere column of the
"concentration of ozone" dataset.
5.12.2.3 Random Forest Training Evaluation:
Evaluation report of random forest algorithm on Stratosphere column data.
Mean Squared Error
Mean Absolute Error
-
R-squared
-6.44
Table 5.12.2.3. 1 Random Forest Evaluation for Ozone on Stratosphere
52
With a value of 38.52, the Mean Squared Error (MSE) of the Random Forest model
applied to the Stratosphere column of the "concentration of ozone" dataset is
noticeably higher. This suggests significant forecast errors and a departure from the
actual values of ozone concentration in the Stratosphere column. The extent of
prediction mistakes is also shown by the Mean Absolute Error (MAE), which is
3.20.
In addition, the R-squared (R2) value is -6.44, which indicates that the model has
difficulty explaining the variation in the data on ozone concentration in the
Stratosphere column. The Random Forest model performs poorly in this column, as
seen by the significantly negative R2 value, and it is unable to identify any
significant patterns or trends in the dataset. To enhance forecasts of ozone
concentrations in the Stratosphere column of the "concentration of ozone" dataset,
additional modelling strategies may be considered.
5.13 Prediction using Trained Models
5.13.1 Forecasting the Concentration of Carbon Dioxide (CO2)
Future CO2 levels in the atmosphere may be predicted using the trained machine
learning models. Based on input data including historical CO2 levels, potential
influencing variables (e.g., emissions, changes in land use), and other pertinent
information, the models will produce projections of CO2 concentration levels for
certain years.
The expected CO2 concentration levels over the next five years may be seen by
using the machine learning models to make projections. These forecasts are
graphically shown, demonstrating the anticipated changes in CO2 concentration.
The objective is to predict CO2 level trends and changes and obtain understanding
of its possible influence on climate over the next five years.
# Extend the year range for forecasting (e.g., predict for the next 5 years)
next_years = np.arange(X_test['year'].max() + 1, X_test['year'].max() + 6)
X_forecast = pd.DataFrame({'year': next_years})
# Use the trained models to make forecasts
lr_forecast = lr_model.predict(X_forecast)
53
rf_forecast = rf_model.predict(X_forecast)
svr_forecast = svr_model.predict(X_forecast)
# Time Series Model (Simple Moving Average) - continue with the rolling mean
rolling_mean = y.rolling(window=5).mean().iloc[-1]
ts_forecast = np.full(5, rolling_mean)
# Display the forecasts for the next 5 years
forecast_df = pd.DataFrame({
'Year': next_years,
'Linear Regression': lr_forecast,
'Random Forest Regression': rf_forecast,
'Support Vector Regression': svr_forecast,
'Time Series (Moving Average)': ts_forecast
})
# Create a DataFrame for visualization
forecast_df = pd.DataFrame({
'Year': next_years,
'Linear Regression': lr_forecast,
'Random Forest Regression': rf_forecast,
'Support Vector Regression': svr_forecast,
'Time Series (Moving Average)': ts_forecast
})
Here is a depiction of the upcoming years with our application employing
regulations as the focus.
Random
Support
Time Series
Linear
Forest
Vector
(Moving
Year Regression Regression Regression Average-
-
-
-
-
Table 5.13.1. 1 Prediction of CO2 concentration levels
Forecasting CO2 trends and changes will help scientists learn more about how they
could affect the climate over the next five years. Visualization for the CO2
concentration is given below.
# Plot the forecasts for the next 5 years
54
plt.figure(figsize=(12, 6))
plt.plot(forecast_df.index, forecast_df['Linear Regression'], label='Linear Regression',
marker='o')
plt.plot(forecast_df.index, forecast_df['Random Forest Regression'], label='Random Forest
Regression', marker='o')
plt.plot(forecast_df.index, forecast_df['Support Vector Regression'], label='Support Vector
Regression', marker='o')
plt.plot(forecast_df.index, forecast_df['Time Series (Moving Average)'], label='Time Series
(Moving Average)', marker='o')
plt.xlabel('Year')
plt.ylabel('CO2 Concentration')
plt.title('CO2 Concentration Forecasts for the Next 5 Years')
plt.legend()
plt.grid(True)
plt.show()
Figure 5.13.1. 1 Prediction of CO2 Concentration for 5 years
The graph shows the results of four different regression techniques—Linear
Regression, Random Forest Regression, Support Vector Regression, and Time
Series (Moving Average)—for the years 2021 to 2025. The table offers a
comparison of these estimates for each approach, each of which offers its own
unique set of predictions. To estimate future CO2 concentration levels, it is
important to evaluate and compare how well various regression models work.
55
5.13.2 Forecasting the Concentration of Methane (CH4)
Future methane (CH4) levels in the atmosphere may be predicted using the
developed machine learning models. Forecasts of methane concentration levels are
produced by these models using pertinent data, historical methane levels, and
potential influencing variables (such as emissions sources and environmental
conditions).
We can depict the predicted methane concentration levels over the following five
years by using machine learning algorithms. These projections, which visually
depict the projected fluctuations in methane content, are made. The goal is to predict
trends and changes in methane levels to obtain knowledge about how they could
affect the climate over the next five years.
# Predict for the next 5 years (2021 to 2025)
future_years = np.arange(2021, 2026).reshape(-1, 1)
lr_predictions = lr_model.predict(future_years)
rf_predictions = rf_model.predict(future_years)
svr_predictions = svr_model.predict(future_years)
ts_predictions = np.full((5,), rolling_mean) # Use the last rolling mean for simplicity
# Create a DataFrame to store the predictions
predictions_df = pd.DataFrame({
'Year': future_years.flatten(),
'Linear Regression': lr_predictions,
'Random Forest Regression': rf_predictions,
'Support Vector Regression': svr_predictions,
'Time Series (Moving Average)': ts_predictions
})
Here is a depiction of the upcoming years with our application employing
regulations as the focus.
56
Linear
Random Forest
Year Regression Regression
Support
Vector
Regression
Time Series
(Moving
Average)
2021
-
-
-
547.42
-
-
-
-
547.42
-
-
-
547.42
-
-
-
547.42
-
-
-
547.42
Table 5.13.2. 1 Prediction of CH4 concentration levels
Forecasting CH4 trends and changes will help scientists learn more about how they
could affect the climate over the next five years. Visualization for the CH4
concentration is given below.
# Set the figure size
plt.figure(figsize=(12, 8))
# Plot the predictions for each model
plt.plot(predictions_df['Year'],
predictions_df['Linear
Regression'],
label='Linear
Regression', marker='o')
plt.plot(predictions_df['Year'],
predictions_df['Random
Forest
Regression'],
label='Random Forest Regression', marker='o')
plt.plot(predictions_df['Year'], predictions_df['Support Vector Regression'], label='Support
Vector Regression', marker='o')
plt.plot(predictions_df['Year'], predictions_df['Time Series (Moving Average)'],
label='Time Series (Moving Average)', marker='o')
# Set labels and title
plt.xlabel('Year')
plt.ylabel('Methane Concentration')
plt.title('Methane Concentration Prediction (2021 to 2025)')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.show()
57
Figure 5.13.2. 1 Prediction of CH4 Concentration for 5 years
This graph uses four distinct regression methods: linear regression, random forest
regression, support vector regression, and time series (moving average) regression
to forecast CH4 (methane) concentration levels from 2021 to 2025. The estimated
methane concentration for each cell in the graph is based on a particular year and
regression technique.
The graph enables comparison of the predictions made by these various regression
models, evaluating the accuracy of their forecasts of methane concentration levels.
Understanding and preparing for anticipated changes in methane concentrations in
the atmosphere is made easier by the insights it offers into how each approach
predicts methane levels during the given time.
5.13.3 Forecasting the Earth Global Temperature
Using the created machine learning models, future atmospheric temperature levels
may be projected. Forecasts of temperature levels for the upcoming five years are
produced by these models using pertinent information, historical temperature
58
records, and potential influencing variables (such greenhouse gas concentrations and
changes in land use).
With the use of machine learning techniques, we can see the expected temperature
ranges for the next five years. We can estimate trends and variations in temperature
and get insight into their possible effects on the global climate over the next five
years thanks to these predictions, which give a visual depiction of the projected
temperature swings.
Here is the code for making predictions for the next five years.
# Generate years for the next five years (2021 to 2025)
future_years = np.arange(2021, 2026).reshape(-1, 1)
# Predict temperatures for the next five years using the trained models
lr_predictions = lr_model.predict(future_years)
rf_predictions = rf_model.predict(future_years)
svr_predictions = svr_model.predict(future_years)
# Create a DataFrame to store the predictions
predictions_df = pd.DataFrame({
'Year': future_years.flatten(),
'Linear Regression': lr_predictions,
'Random Forest Regression': rf_predictions,
'Support Vector Regression': svr_predictions,
})
Here is a depiction of the upcoming years with our application employing
regulations as the focus.
Year-
Linear
Random Forest
Regression Regression-
-
Support Vector
Regression-
Table 5.13.3. 1 Land Average Temperature Prediction (2021 to 2025)
59
Forecasting Land Average Temperature trends and changes will help scientists learn
more about how they could affect the climate over the next five years. Visualization
for the Land Average Temperature is given below.
# Set the figure size
plt.figure(figsize=(12, 8))
# Plot the predictions for each model
plt.plot(predictions_df['Year'],
predictions_df['Linear
Regression'],
label='Linear
Regression', marker='o')
plt.plot(predictions_df['Year'],
predictions_df['Random
Forest
Regression'],
label='Random Forest Regression', marker='o')
plt.plot(predictions_df['Year'], predictions_df['Support Vector Regression'], label='Support
Vector Regression', marker='o')
# Set labels and title
plt.xlabel('Year')
plt.ylabel('Land Average Temperature')
plt.title('Land Average Temperature Prediction (2021 to 2025)')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.show()
60
Figure 5.13.3. 1 Land Average Temperature Prediction Graph
Employing three separate regression methods—Linear Regression, Random Forest
Regression, and Support Vector Regression—provides projections for land average
temperature levels from 2021 to 2025. For a particular year and regression method,
each cell in the graph shows the expected land average temperature.
The following graph enables a comparison of the temperature forecasts produced by
these various regression models, evaluating the accuracy with which they estimate
average land temperatures. This information is useful for understanding and making
plans for probable variations in land temperatures in the upcoming years since it
sheds light on how each approach predicts temperature changes throughout the
given period.
5.13.4 Forecasting the Concentration of Ozone
It is possible to predict future ozone concentrations in the atmosphere using the
created machine learning algorithms. These models produce estimates of ozone
concentration levels for the following five years based on pertinent data, historical
61
records of ozone concentration, and potential influencing variables (such emissions
and atmospheric conditions).
The projected ozone concentration levels over the following five years can be seen
by using machine learning algorithms. We can predict trends and changes in ozone
levels and get insight into their prospective effects on the atmospheric composition
over the next five years thanks to these forecasts, which give a visual depiction of
the anticipated ozone concentration oscillations.
Here is the code for making predictions for the next five years.
# Generate years for the next five years (2021 to 2025)
future_years = np.arange(2021, 2026).reshape(-1, 1)
# Predict temperatures for the next five years using the trained models
lr_predictions = lr_model.predict(future_years)
rf_predictions = rf_model.predict(future_years)
svr_predictions = svr_model.predict(future_years)
# Create a DataFrame to store the predictions
predictions_df = pd.DataFrame({
'Year': future_years.flatten(),
'Linear Regression': lr_predictions,
'Random Forest Regression': rf_predictions,
'Support Vector Regression': svr_predictions,
})
Here is a depiction of the upcoming years with our application employing
regulations as the focus.
Year
Linear
Regression
-
-
Random Forest
Regression-
Support Vector
Regression-
Table 5.13.4. 1 Total column: SBUV Concentration Prediction
62
Forecasting Total column: SBUV Concentration trends and changes will help
scientists learn more about how they could affect the climate over the next five years.
Visualization for the Total column: SBUV Concentration is given below.
# Set the figure size
plt.figure(figsize=(12, 8))
# Plot the predictions for each model
plt.plot(predictions_df['Year'],
predictions_df['Linear
Regression'],
label='Linear
Regression', marker='o')
plt.plot(predictions_df['Year'],
predictions_df['Random
Forest
Regression'],
label='Random Forest Regression', marker='o')
plt.plot(predictions_df['Year'], predictions_df['Support Vector Regression'], label='Support
Vector Regression', marker='o')
# Set labels and title
plt.xlabel('Year')
plt.ylabel('Total column: SBUV Concentration')
plt.title('Total column: SBUV Concentration Prediction of Ozone (2021 to 2025)')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.show()
63
Figure 5.13.4. 1 Total column: SBUV Concentration Graph
Predicts the total column concentration of ozone (SBUV) from 2021 to 2025 using
three distinct regression methods: Linear Regression, Random Forest Regression,
and Support Vector Regression. The estimated total column concentration of ozone
for each cell in the graph is based on the regression approach and a given year.
The graph allows for a comparison of the predicted ozone concentrations from these
various regression models, providing for an evaluation of how well they predicted
ozone levels. It offers information that is helpful for understanding and making plans
for probable fluctuations in ozone levels in the upcoming years regarding how each
technique predicts changes in the total column concentration of ozone over the
selected period.
64
5.14 Limitations of Regression Models
5.14.1 Limitation of Decision Tree Algorithm
Decision tree algorithms have several drawbacks, such as a propensity to overfit the
training data, sensitivity to minute changes in the data, and difficulties capturing
complicated connections in the data.
5.14.2 Limitations of Random Forest
The main limitation of random forest is that many trees can make the algorithm too
slow and ineffective for real-time predictions. In general, these algorithms are fast
to train, but quite slow to create predictions once they are trained.
5.14.3 Limitations of Gradient Boosting Machine (GBM)
Due to the repetitive construction of several weak learners, training is
computationally costly and time-consuming.
5.15 Source Code of the Project
Here is the complete source code of this project.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# CO2 (measured in PPM) 800,000BCE-2021 From EPA Climate Change Indicators
hist_co2
=
pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_co2.csv',
na_values="")
hist_co2.head()
# Methane (measured in PPB) 800,000BCE-2021 From EPA Climate Change Indicators
hist_methane
=
pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_methane.csv',
na_values="")
hist_methane.head()
# Global Average Temperature- From Berkeley Earth
global_temp
=
pd.read_csv('/content/drive/MyDrive/Climate_data/berkley_earth_globaltemperatures.csv'
)
global_temp.head()
hist_ozone
=
pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_ozone.csv')
65
hist_ozone.head()
# Create a function to wrangle the 800k BCE Datasets
def hist_df_wrangle(ds, bl, gas):
# Save station names for later reference
stations = ds.iloc[5, 1:]
stations.index = range(len(stations))
stations.columns = range(len(stations))
# Remove data set descriptions
if ds is not hist_n2o:
ds = ds.iloc[bl:]
ds = ds.iloc[7:]
ds.columns = ["year"] + list(ds.columns[1:])
# Convert to numeric
ds = ds.apply(pd.to_numeric, errors='coerce')
# Get one averaged temp for each year
ds["average"] = ds.iloc[:, 1:].mean(axis=1, skipna=True)
return ds
# Wrangle CO2
hist_co2 = hist_df_wrangle(hist_co2, 1310, "co2")
# Wranlge Methane
hist_methane = hist_df_wrangle(hist_methane, 2183, "methane")
hist_methane = hist_methane.iloc[30:]
# prepare 1950s - to current sets
hist_50_co2 = hist_co2[hist_co2["year"] >= 1750][["year", "average"]]
hist_50_methane = hist_methane[hist_methane["year"] >= 1750][["year", "average"]]
hist_50_n2o = hist_n2o[hist_n2o["year"] >= 1750][["year", "average"]]
# Global Temp Yearly Averages
global_temp["year"] = pd.to_datetime(global_temp["dt"]).dt.year
global_avg = global_temp[~global_temp["landandoceanaveragetemperature"].isna()]
global_avg
global_avg.groupby("year")["landandoceanaveragetemperature"].mean().reset_index()
global_co2 = hist_50_co2[hist_50_co2["year"] >= global_avg["year"].min()]
fig, ax = plt.subplots(figsize=(12, 6))
sns.set_palette("GnBu_r")
# Plot radiative forcing for each gas separately
sns.lineplot(data=rad_force, x="year", y="co2", label="CO2", ax=ax)
sns.lineplot(data=rad_force, x="year", y="ch4", label="CH4", ax=ax)
sns.lineplot(data=rad_force, x="year", y="n2o", label="N2O", ax=ax)
sns.lineplot(data=rad_force, x="year", y="cfc12", label="CFC-12", ax=ax)
sns.lineplot(data=rad_force, x="year", y="cfc11", label="CFC-11", ax=ax)
66
=
sns.lineplot(data=rad_force, x="year", y="15_minor", label="Other 15 Minor Gases",
ax=ax)
sns.lineplot(data=rad_force, x="year", y="total", label="Total", ax=ax)
# Customize labels and title
ax.set_xlabel('Year (1979 - 2015)')
ax.set_ylabel('Radiative Forcing (W m^2)')
ax.set_title('Greenhouse Gas Radiative Forcing')
ax.set_xlim(1979, 2015)
ax.set_ylim(0, 3)
ax.set_xticks(range(1980, 2015, 5))
plt.legend(title="Gas", loc="upper left", frameon=False)
plt.show()
plt.figure(figsize=(10, 6))
plt.plot(hist_ozone["Year"], hist_ozone["Total column: SBUV"], marker='o', label="Total
Column (SBUV)")
plt.plot(hist_ozone["Year"], hist_ozone["Troposphere"], marker='o', label="Troposphere")
plt.plot(hist_ozone["Year"], hist_ozone["Stratosphere"], marker='o', label="Stratosphere")
plt.title("Ozone Layer Data Over the Years")
plt.xlabel("Year")
plt.ylabel("Ozone Concentration")
plt.legend()
plt.grid(True)
fig, ax = plt.subplots(figsize=(12, 6))
# Plot the data for each halogen compound
for column in hist_halogen.columns[1:]:
ax.plot(hist_halogen["Year"], hist_halogen[column], label=column)
# Customize the plot
ax.set_xlabel('Year')
ax.set_ylabel('Concentration')
ax.set_title('Concentration of Halogen Compounds Over Time')
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
# Show the plot
plt.grid(True)
plt.tight_layout()
plt.show()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
67
import statsmodels.api as sm
# Load the CO2 concentration dataset
#
co2_data
pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_co2.csv')
# Split the data into features (year) and the target variable (CO2 levels)
hist_co2 = hist_co2.fillna(0)
X = hist_co2[['year']]
y = hist_co2['average']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
# Random Forest Regression
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
# Support Vector Regression
svr_model = SVR(kernel='linear')
svr_model.fit(X_train, y_train)
svr_pred = svr_model.predict(X_test)
# Time Series Model (Simple Moving Average)
rolling_mean = y_train.rolling(window=5).mean().iloc[-1]
ts_pred = np.full_like(y_test, fill_value=rolling_mean)
# Calculate mean squared error and R-squared for each model
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}")
evaluate_model(y_test, lr_pred, "Linear Regression")
evaluate_model(y_test, rf_pred, "Random Forest Regression")
evaluate_model(y_test, svr_pred, "Support Vector Regression")
evaluate_model(y_test, ts_pred, "Time Series (Moving Average)")
# Extend the year range for forecasting (e.g., predict for the next 5 years)
next_years = np.arange(X_test['year'].max() + 1, X_test['year'].max() + 6)
X_forecast = pd.DataFrame({'year': next_years})
68
=
# Use the trained models to make forecasts
lr_forecast = lr_model.predict(X_forecast)
rf_forecast = rf_model.predict(X_forecast)
svr_forecast = svr_model.predict(X_forecast)
# Time Series Model (Simple Moving Average) - continue with the rolling mean
rolling_mean = y.rolling(window=5).mean().iloc[-1]
ts_forecast = np.full(5, rolling_mean)
# Display the forecasts for the next 5 years
forecast_df = pd.DataFrame({
'Year': next_years,
'Linear Regression': lr_forecast,
'Random Forest Regression': rf_forecast,
'Support Vector Regression': svr_forecast,
'Time Series (Moving Average)': ts_forecast
})
# Create a DataFrame for visualization
forecast_df = pd.DataFrame({
'Year': next_years,
'Linear Regression': lr_forecast,
'Random Forest Regression': rf_forecast,
'Support Vector Regression': svr_forecast,
'Time Series (Moving Average)': ts_forecast
})
# Set the 'Year' column as the index for plotting
forecast_df.set_index('Year', inplace=True)
# Plot the forecasts for the next 5 years
plt.figure(figsize=(12, 6))
plt.plot(forecast_df.index, forecast_df['Linear Regression'], label='Linear Regression',
marker='o')
plt.plot(forecast_df.index, forecast_df['Random Forest Regression'], label='Random Forest
Regression', marker='o')
plt.plot(forecast_df.index, forecast_df['Support Vector Regression'], label='Support Vector
Regression', marker='o')
plt.plot(forecast_df.index, forecast_df['Time Series (Moving Average)'], label='Time Series
(Moving Average)', marker='o')
plt.xlabel('Year')
plt.ylabel('CO2 Concentration')
plt.title('CO2 Concentration Forecasts for the Next 5 Years')
plt.legend()
plt.grid(True)
69
plt.show()
CONCLUSION & FUTURE WORK
This report highlights the development of a climate data analysis system that
employs machine learning techniques to predict future climate variables. The
primary goal of this system is to forecast various climate-related parameters, such
as greenhouse gas concentrations, temperature levels, and ozone concentrations,
with the aim of gaining valuable insights into the Earth's evolving climate.
Our implementation uses historical climate data to train machine learning models,
including details on greenhouse gas emissions, changes in land use, and other
relevant aspects. These models have proven to be capable of predicting future
climate conditions with accuracy, which has improved our knowledge of patterns
and possible effects.
We have demonstrated our capacity to produce exact predictions about factors
linked to climate through the application of machine learning models such as
Support Vector Machines, Random Forest Regression, and Time Series Analysis.
For climate research, policy development, and environmental management, these
forecasts have significant ramifications.
Future research in machine learning prediction and climate data analysis will center
on several crucial elements that will improve the system's functionality and uses. To
increase forecast accuracy, it is first necessary to extend the dataset with more
thorough and high-resolution climatic data. By incorporating data from other
sources and sensors, it will be possible to gain a more thorough knowledge of
climatic factors, which will ultimately enable the creation of projections that are
both exact and complex. It is highly promising to investigate cutting-edge machine
learning methods like deep learning and neural networks. Particularly for
complicated climate-related events, these state-of-the-art models have the potential
70
to provide even more precise forecasts. The system's forecasting powers may be
greatly improved by utilizing modern machine learning techniques.
Furthermore, the integration of ensemble models—multiple machine learning
algorithms combined—can alleviate the shortcomings of individual models. To
provide forecasts that are more solid and trustworthy, ensemble models combine the
advantages of many methodologies. For continually updating forecasts and
guaranteeing their applicability under changing environmental circumstances, realtime monitoring capabilities are essential.
Future work must include teamwork with experts from many sectors, user-friendly
visualization tools, and analyses of how the climate affects ecosystems, agriculture,
and human populations. These developments will encourage a better knowledge of
climate dynamics and enable more powerful solutions to the serious problems
brought on by climate change.
71
REFERENCES
[1]
PNAS. Available at:
https://www.pnas.org/doi/epdf/10.1073/pnas- (Accessed: 29
September 2023).
[2]
Bessou, C. et al. (1970) Biofuels, Greenhouse Gases, and climate change,
SpringerLink. Available at: https://link.springer.com/chapter/10.1007/-_20 (Accessed: 29 September 2023).
[3]
National Centers for Environmental Information (NCEI) (no date) Climate
Data Online: Dataset Discovery, Datasets | Climate Data Online (CDO) |
National Climatic Data Center (NCDC). Available at:
https://www.ncdc.noaa.gov/cdo-web/datasets (Accessed: 29 September 2023).
[4]
Climate change indicators: Atmospheric concentrations of greenhouse
gases. Available at: https://www.epa.gov/climate-indicators/climate-changeindicators-atmospheric-concentrations-greenhouse-gases (Accessed: 29 September
2023).
[5]
National Centers for Environmental Information (NCEI) (no date b)
National Centers for Environmental Information (NCEI). Available at:
https://www.ncei.noaa.gov/access/paleo-search/ (Accessed: 29 September 2023).
[6]
(2022) Time series from scratch - moving averages (MA) theory and
Implementation, Medium. Available at: https://towardsdatascience.com/timeseries-from-scratch-moving-averages-ma-theory-and-implementationa01b97b60a18 (Accessed: 29 September 2023).
[7]
Linear regression in machine learning (2023) GeeksforGeeks. Available at:
https://www.geeksforgeeks.org/ml-linear-regression/ (Accessed: 29 September
2023).
[8]
Saini, A. (2023) Decision tree algorithm - A complete guide, Analytics
Vidhya. Available at: https://www.analyticsvidhya.com/blog/2021/08/decisiontree-algorithm/ (Accessed: 29 September 2023).
[9]
Random Forest: A complete guide for machine learning (no date) Built In.
Available at: https://builtin.com/data-science/random-forest-algorithm (Accessed:
29 September 2023).
[10] López, O.A.M., López, A.M. and Crossa, J. (1970) Support
vector machines and support vector regression, SpringerLink. Available at:
72
https://link.springer.com/chapter/10.1007/-_9 (Accessed: 29
September 2023).
[11] Goyal, S. (2021) Evaluation metrics for regression models, Medium.
Available at: https://medium.com/analytics-vidhya/evaluation-metrics-forregression-models-c91c65d73af (Accessed: 29 September 2023).
[12] Lindsey, R. (no date) Climate change: Atmospheric carbon dioxide, NOAA
Climate.gov. Available at: https://www.climate.gov/news-features/understandingclimate/climate-change-atmospheric-carbon-dioxide (Accessed: 29 September
2023).
[13] Methane (2023) NASA. Available at: https://climate.nasa.gov/vitalsigns/methane/ (Accessed: 29 September 2023).
[14] Data Overview (2023) Berkeley Earth. Available at:
https://berkeleyearth.org/data/ (Accessed: 29 September 2023).
[15] FAQ: What is the greenhouse effect? (No date) NASA. Available at:
https://climate.nasa.gov/faq/19/what-is-the-greenhouse-effect/ (Accessed: 29
September 2023).
73