Muhammad Shahzad

Climate Data Analysis and Prediction using Machine Learning

Climate Data Analysis and Prediction Using Machine Learning Individual Project [University Name] [Department Name] ABSTRACT Climate change is one of the most important global concerns of our day, and understanding the dynamics of greenhouse gases and their influence on global temperatures is critical. The goal of this research is to use machine learning algorithms to assess past climate data and anticipate future trends. The collection contains a variety of climate-related information from several sources, including greenhouse gas concentrations, temperature records, and radiative forcing. i Statement of Originality This is to certify that, except where specific reference is made, the work described within this project is the result of the investigation carried out by myself, and that neither this project, nor any part of it, has been submitted in candidature for any other award other than this being presently studied. Any material taken from published texts or computerized sources have been fully referenced, and I fully realize the consequences of plagiarizing any of these sources. Student Name: Student Signature: Registered Course of Study: Computer Science - ........................... Date of Signing: ................................... ii TABLE OF CONTENTS Number………………………………………………………………...Page [Department Name] .............................................................................. 1 ABSTRACT ................................................................................................. i Statement of Originality .............................................................................. ii TABLE OF CONTENTS ...........................................................................iii LIST OF FIGURES ..................................................................................... v LIST OF TABLES ...................................................................................... vi CHAPTER 1: INTRODUCTION ................................................................ 8 1.1 Introduction ........................................................................................ 8 1.2 Problem Statement ............................................................................. 9 1.3 Background & Scope ......................................................................... 9 1.4 Aims & Objectives ........................................................................... 10 1.4.1 Aim ............................................................................................ 10 1.4.2 Objectives .................................................................................. 10 1.4.2 Industrial Context ...................................................................... 11 1.5 Deliverables ...................................................................................... 12 1.6 Tools Used........................................................................................ 12 CHAPTER 2: BACKGROUND AND LITERATURE REVIEW ............ 13 2.1 Persistence of climate changes due to a range of greenhouse gases ... 13 2.2 Biofuels, Greenhouse Gases and Climate Change .............................. 16 CHAPTER 3: METHODOLOGY ............................................................ 18 3.1 Machine Learning ............................................................................ 18 3.2 Development Methods and Tools .................................................... 20 3.2.1 Pandas ........................................................................................ 22 3.2.2 NumPy ....................................................................................... 22 3.2.3 Matplotlib................................................................................... 22 3.2.4 Scikit-Learn ............................................................................... 22 CHAPTER 4: PROJECT MANAGEMENT ............................................. 23 4.1 Milestone Analysis ........................................................................... 23 4.2 Analysis and Deliverables ................................................................ 24 4.3 Physical Resources ........................................................................... 27 CHAPTER 5: IMPLEMENTATION ........................................................ 27 5.1 Label Encoding ................................................................................ 27 5.1.1 Advantages of Label Encoding: ................................................. 27 5.2 Regression Models ........................................................................... 28 5.2.1 Decision Tree Regressor ............................................................ 28 5.2.2 Random Forest Regression ........................................................ 29 iii 5.2.3 Gradient Boosting Machines ..................................................... 30 5.3 Steps of Model Building .................................................................. 31 5.3.1 Dataset ....................................................................................... 31 5.3.2 Splitting the Dataset ................................................................... 31 5.3.3 Training and Testing .................................................................. 31 5.3.4 Supervised Machine Learning ................................................... 32 5.3.5 Regression Models ..................................................................... 32 5.5 Evaluation Methods.......................................................................... 32 5.5.1 Mean Absolute Error (MAE) ..................................................... 32 5.5.2 Mean Squared Error (MSE) ....................................................... 32 5.5.3 Root Mean Squared Error (RMSE) ........................................... 33 5.6 Data Collection ................................................................................. 33 5.7 Data Pre-processing and Feature Engineering ................................. 34 5.8 Data Visualization ............................................................................ 34 5.9 Training and Testing of Different Models on CO2 Dataset ............. 37 5.9.1 Using Support Vector Machine Algorithm ................................ 38 5.9.1.1 Support Vector Machine Algorithm’s Training Evaluation: .. 38 5.9.2 Using Random Forest Algorithm ............................................... 39 5.9.2.1 Random Forest Training Evaluation: ...................................... 39 5.9.3 Using Time Series Model (Simple Moving Average) ............... 40 5.9.3.1 Time Series Model (Simple Moving Average) Training Evaluation: .......................................................................................... 41 5.10 Training and Testing of Different Models on Methane Dataset .... 41 5.10.1 Using Support Vector Machine Algorithm .............................- Support Vector Machine Algorithm’s Training Evaluation: 42 5.10.2 Using Random Forest Algorithm ............................................- Random Forest Training Evaluation: .................................... 43 5.10.3 Using Time Series Model (Simple Moving Average) ............- Time Series Model (Simple Moving Average) Training Evaluation: .......................................................................................... 44 5.11 Training and Testing of Different Models on Earth Global Temperature Dataset .............................................................................. 45 5.11.1 Using Support Vector Machine Algorithm .............................- Support Vector Machine Algorithm’s Training Evaluation: 46 5.11.2 Using Random Forest Algorithm ............................................- Random Forest Training Evaluation: .................................... 47 5.12 Training and Testing of Different Models on Concentration of Ozone Dataset .................................................................................................... 48 5.12.1 Using Support Vector Machine Algorithm .............................- Support Vector Machine Algorithm’s Training Evaluation:- Support Vector Machine Algorithm’s Training Evaluation: 49 iv 5.12.1.3 Support Vector Machine Algorithm’s Training Evaluation: 50 5.12.2 Using Random Forest Algorithm ............................................- Random Forest Training Evaluation: ...................................- Random Forest Training Evaluation: ...................................- Random Forest Training Evaluation: .................................... 52 5.13 Prediction using Trained Models ................................................... 53 5.13.1 Forecasting the Concentration of Carbon Dioxide (CO2) ....... 53 5.13.2 Forecasting the Concentration of Methane (CH4) ................... 56 5.13.3 Forecasting the Earth Global Temperature .............................. 58 5.13.4 Forecasting the Concentration of Ozone.................................. 61 5.14 Limitations of Regression Models ................................................. 65 5.14.1 Limitation of Decision Tree Algorithm ................................... 65 5.14.2 Limitations of Random Forest ................................................. 65 5.14.3 Limitations of Gradient Boosting Machine (GBM)................. 65 5.15 Source Code of the Project ............................................................. 65 CONCLUSION & FUTURE WORK ........................................................ 70 REFERENCES .......................................................................................... 72 LIST OF FIGURES Number………………………………………………………………………. Page Figure 2.1. 1 Surface Warming Temperature ........................................................ 15 Figure 3.1. 1 Types of Machine Learning .............................................................. 19 Figure 3.2. 1 Software Development Cycle ........................................................... 21 Figure 3.2. 2 Iterative Model ................................................................................. 21 Figure 4.2. 1 Forecasting System flow diagram. ................................................... 25 Figure 5.2.1. 1 Decision Tree Working Diagram .................................................. 28 Figure 5.2.2. 1 Working of Random Forest ........................................................... 30 Figure 5.2.3. 1 Working of GBM........................................................................... 31 Figure 5.8. 1 Concentration of Greenhouse Gas Radiative Forcing ...................... 35 v Figure 5.8. 2 Ozone Layer Data Over the Years .................................................... 36 Figure 5.8. 3 Concentration of Halogen Compounds Over Time .......................... 37 Figure 5.13.1. 1 Prediction of CO2 Concentration for 5 years .............................. 55 Figure 5.13.2. 1 Prediction of CH4 Concentration for 5 years .............................. 58 Figure 5.13.3. 1 Land Average Temperature Prediction Graph ............................. 61 Figure 5.13.4. 1 Total column: SBUV Concentration Graph ................................ 64 LIST OF TABLES Number ............................................................................................................. Page Table 4.1. 1 Table Milestone Analysis .................................................................. 24 Table 5.6. 1 Sample of Dataset .............................................................................. 34 Table 5.9.1.1. 1 SVM Training Evaluation for CO2.............................................. 38 Table 5.9.2.1. 1 Random Forest Training Evaluation for CO2 .............................. 39 Table 5.9.3.1. 1 Simple Moving Average Training for CO2 ................................. 41 Table 5.10.1.1. 1 SVM Evaluation for Methane .................................................... 42 Table 5.10.2.1. 1 Random Forest Training Evaluation for Methane ..................... 43 Table 5.10.3.1. 1 Evaluation of Simple Moving Average for Methane ................. 45 Table 5.11.1.1. 1 SVM Training for Global Temperature ..................................... 46 Table 5.11.2.1. 1 Random Forest Evaluation for Global Temperature .................. 47 Table 5.12.1.1. 1 SVM Evaluation for Ozone ....................................................... 49 Table 5.12.1.2. 1 SVM Evaluation for Ozone on Troposphere ............................. 49 vi Table 5.12.1.3. 1 SVM Evaluation for Ozone on Stratosphere ............................. 50 Table 5.12.2.1. 1 Random Forest Evaluation for Ozone on SBUV ....................... 51 Table 5.12.2.2. 1 Random Forest Evaluation for Ozone on Troposphere ............. 52 Table 5.12.2.3. 1 Random Forest Evaluation for Ozone on Stratosphere.............. 52 Table 5.13.1. 1 Prediction of CO2 concentration levels ........................................ 54 Table 5.13.2. 1 Prediction of CH4 concentration levels ........................................ 57 Table 5.13.3. 1 Land Average Temperature Prediction (2021 to 2025) ................ 59 Table 5.13.4. 1 Total column: SBUV Concentration Prediction ........................... 62 vii CHAPTER 1: INTRODUCTION 1.1 Introduction One of the most pressing international problems of the twenty-first century is climate change, which is fuelled by a variety of environmental causes. Making wise judgments and developing practical methods to lessen the negative effects of climate change need an understanding of the complex interrelationships between greenhouse gas concentrations, temperature changes, and radiative forcing. This research explores the area of analysing and forecasting climate data, utilizing machine learning to understand past trends in the environment and provide accurate predictions. Climate change is our era's defining global concern, with far-reaching implications for the planet's ecosystems, communities, and economy. Making educated decisions and creating effective climate policies need an understanding of the intricate interactions between the many elements that contribute to climate change, such as greenhouse gas concentrations, temperature changes, and radiative forcing. Climate change refers to the periodic alteration of the Earth's climate because of variations in the atmosphere as well as interactions between the atmosphere and numerous geologic, chemical, biological, and geographic elements that are part of the Earth system. The atmosphere is a fluid that is always moving and active. Solar radiation, continent positions, ocean currents, the location and orientation of mountain ranges, atmospheric chemistry, and vegetation on the land surface are just a few of the variables that affect the planet's physical characteristics as well as its rate and direction of motion. The goal of this project, "Climate Data Analysis and Prediction Using Machine Learning," is to conduct a thorough investigation of historical climate data to identify previous trends and create forecasts for the future. This project attempts to decipher the complex linkages within climate data and deliver insightful information about the Earth's changing climate by leveraging the capabilities of machine learning algorithms. 8 1.2 Problem Statement The challenge is to use historical climate records and powerful machine learning algorithms to assess previous climate patterns, anticipate future climate scenarios, and give significant insights into the dynamics of climate change. By developing a data science project in Python using Jupiter Notebook, we will illustrate in this paper how we intend to approach this problem. The project, which is focused on the pressing subject of climate change, aims to use cutting-edge machine learning methods and historical climate information to address the following problems: 1.3 Background & Scope The Climate change is one of the world's most critical issues today. The general opinion among scientists is that the Earth's climate is changing significantly because of human activity, notably the production of greenhouse gases (GHGs) including carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O). The effects of these changes include altered ecosystems, increased frequency, and severity of extreme weather events, melting ice caps, and rising global temperatures. Climate science, environmental data analysis, and cutting-edge machine learning methods must all be used in a multidisciplinary manner to comprehend and solve climate change. To do this, this research makes full use of data-driven insights and predictive modelling to analyse past climate data in-depth and anticipate future climatic conditions. The project's backbone is the availability of enormous datasets that have been gathered over many years and encompass key climate indicators including GHG concentrations, global temperatures, radiative forcing, and more. These databases give academics and decision-makers crucial insights into the dynamics of climate change, enabling them to make wise choices and create efficient mitigation and adaptation plans. 9 The initiative also acknowledges the value of data visualization in explaining difficult climate facts to a wider audience. To increase public understanding, educate the public, and inspire group action in the fight against climate change, visual representations of climate data and model forecasts are crucial. 1.4 Aims & Objectives 1.4.1 Aim The goal of this project is to apply advanced data analysis and machine learning techniques to conduct a comprehensive investigation of historical climate data, with a major focus on greenhouse gas concentrations, global temperature changes, and radiative forcing. The project takes a diverse approach to obtain better understanding of the mechanisms behind climate change and its causes, ultimately advancing knowledge of the planet's changing climate. The project's goal is to offer useful tools and insights that can help climate scientists, policymakers, and the public in solving the urgent issues of climate change mitigation and adaptation. This is done by using the power of data-driven modelling and visualization. 1.4.2 Objectives The following are the key goals of the project on climate data analysis and machine learning models: • Develop an effective climate analysis system using cutting-edge machine learning techniques. From historical climate data, our system ought to be able to extract intricate patterns and trends that conventional statistical methods might not be able to identify. • Design a flexible climate modeling system capable of integrating critical climatic variables and factors into the analysis. These factors allow for a thorough evaluation of climate dynamics and include greenhouse gas concentrations, radiative forcing, temperature anomalies, oceanic data, and more. • Design the climate analysis system to be flexible and adaptable to evolving climate data and scientific advancements. It should have the capacity to incorporate new datasets and variables, ensuring that it remains up-to-date and relevant in a rapidly changing climate research landscape. 10 • • Develop machine learning models capable of providing highly accurate climate projections. These models should excel in predicting time-series climate data by considering historical climate records as crucial input features, enhancing the precision of future climate forecasts. The climate data analysis and machine learning models project aspire to enhance our understanding of climate change dynamics and provide critical tools for informed decision-making in the face of this global challenge. 1.4.2 Industrial Context The project is being done within the larger industrial framework of climate science and environmental management. The project's main areas of concentration are climate data analysis and machine learning models, but it also has applications and significance in many other businesses and sectors that have a stake in the information and decision-making connected to climate. Accurate analysis and forecasting of climate trends are crucial for sectors that largely rely on climate data, such as renewable energy, agriculture, construction, insurance, and disaster management. Climate information is used by these industries to manage building projects, plan agricultural cycles, assess risk, and respond to natural calamities. The project advances our understanding of climate science via research and academics. It may be utilized by researchers and climate scientists as a tool for investigating climate patterns, confirming climate models, and creating creative responses to the problems associated with climate change. This research on climate analysis and machine learning has an industrial context that spans several industries and sectors. It acts as a fundamental part of the larger ecosystem of multidisciplinary research, environmental management, policy formation, and climate science. Its findings and conclusions might have an influence on a wide range of stakeholders, including corporations, governments, researchers, and the public. All these parties have a stake in understanding and tackling the crucial issue of climate change. 11 1.5 Deliverables A climate data analysis system based on machine learning that conducts comprehensive climate data analysis and forecasts future climate patterns and trends is the project's deliverable. 1.6 Tools Used Jupiter Notebook, the Python programming language, and several machine learning packages are the tools utilized for this project. 12 CHAPTER 2: BACKGROUND AND LITERATURE REVIEW 2.1 Persistence of climate changes due to a range of greenhouse gases Emissions of a broad range of greenhouse gases of varying lifetimes contribute to global climate change. Carbon dioxide displays exceptional persistence that renders its warming nearly irreversible for more than 1,000 y. Here we show that the warming due to non-CO2greenhouse gases, although not irreversible, persists notably longer than the anthropogenic changes in the greenhouse gas concentrations themselves. We explore why the persistence of warming depends not just on the decay of a given greenhouse gas concentration but also on climate system behaviour, particularly the timescales of heat transfer linked to the ocean. For carbon dioxide and methane, nonlinear optical absorption effects also playa smaller but significant role in prolonging the warming. In effect, dampening factors that slow temperature increase during periods of increasing concentration also slow the loss of energy from the Earth’s climate system if radiative forcing is reduced. Approaches to climate change mitigation options through reduction of green-house gas or aerosol emissions therefore should not be expected to decrease climate change impacts as rapidly as the gas or aerosol lifetime, even for short-lived species; such actions can have their greatest effect if undertaken soon enough to avoid transfer of heat to the deep ocean. Carbon dioxide, methane, nitrous oxide, and other greenhouse gases increased over the course of the 20th century due to human activities. The human-caused increases in these gases are the primary forcing those accounts for much of the global warming of the past fifty years, with carbon dioxide being the most important single radiative forcing agent (1). Recent studies have shown that the human-caused warming linked to carbon dioxide is nearly irreversible for more than 1,000 y, even if emissions of the gas were to cease entirely (2–5). The importance of the ocean intaking up heat and slowing the response of the climate system to radiative forcing changes has been noted in many studies (e.g., refs. 6 and 7). The key role of the ocean’s thermal lag has also been highlighted by recent approaches to proposed metrics for comparing the warming of different greenhouse gases (8, 9).Among the 13 observations attesting to the importance of these effects are those showing that climate changes caused by transient volcanic aerosol loading persist for more than 5 y (7, 10), and apportion can be expected to last more than a century in the ocean(11–13); clearly these signals persist far longer than the radiative forcing decay timescale of about 12–18 mo. for the volcanic aero-sol (14, 15). Thus, the observed climate response to volcanic events suggests that some persistence of climate change should be expected even for quite short-lived radiative forcing perturbations. It follows that the climate changes induced by short-lived anthropogenic greenhouse gases such as methane or hydrofluorocarbons (HFCs) may not decrease in concert with decreases in concentration if the anthropogenic emissions of those gases were to be eliminated. In this paper, our primary goal is to show how different processes and timescales contribute to determining how long the climate changes due to various greenhouse gases could be expected to remain if anthropogenic emissions were to cease. Advances in modeling have led to improved Atmosphere-Ocean General Circulation Models (AOGCMs) as well as to Earth Models of Intermediate Complexity (EMICs). Although a detailed representation of the climate system changes on regional scales can only be provided by AOGCMs, the simpler EMICs have been shown to be useful, particularly to examine phenomena on a global average basis. In this work, we use the Bern 2.5CCEMIC (see Materials and Methods and Text), which has been extensively intercompared to other EMICs and to complex AOGCMs (3, 4). It should be noted that, although the Bern2.5CC EMIC includes a representation of the surface and deep ocean, it does not include processes such as ice sheet losses or changes in the Earth’s albedo linked to evolution of vegetation. However, it is noteworthy that this EMIC, although parameterized and simplified, includes 14 levels in the ocean; further, its global ocean heat uptake and climate sensitivity are near the mean of available complex models, and its computed timescales for uptake of tracers into the ocean have been shown to compare well to observations (16). A recent study (17) explored the response of one AOGCM to a sudden stop of all forcing, and the Bern 2.5CC EMIC shows broad similarities in computed warming to that study (see Fig. S1), although there are also differences in detail. The climate sensitivity (which characterizes the long-term absolute warming response to a doubling of atmospheric carbon dioxide concentrations) is 3 °C for the model used here. Our results should be considered illustrative and exploratory rather than fully quantitative given the limitations of the EMIC and the uncertainties in climate sensitivity. 14 Figure 2.1. 1 Surface Warming Temperature Fig. 1 shows the computed future global warming contributions for carbon dioxide, methane, and nitrous oxide for a midrange scenario (23) of projected future anthropogenic emissions of these gases to 2050. Radiative forcings for all three of these gases, and their spectral overlaps, are represented in this work using the expressions assessed in ref. 24. In 2050, the anthropogenic emissions are stopped entirely for illustration purposes. The figure shows nearly irreversible warming for at least 1,000 y due to the imposed carbon dioxide increases, as in previous work. All published studies to date, which use multiple EMICs and one AOGCM, show largely irreversible warming due to future carbon computed surface warming obtained in the Bern 2.5CC model due to CO2, CH4, and N2O emission increases to 2050 following a “midrange “scenario (called A1B; see ref. 23) followed by zero anthropogenic emissions thereafter. The gases are changed sequentially in this calculation to explicitly separate the contributions of each. The bumps shown in the calculated warming are due to changes in ocean circulation, as in previous studies 15 (5, 26, 39). The main panel shows the contributions to warming due to CO2, N2O, and CH4. The inset shows an expanded view of the warming from year2000 to 2200. 2.2 Biofuels, Greenhouse Gases and Climate Change Biofuels are fuels produced from biomass, mostly in liquid form, within a time frame sufficiently short to consider that their feedstock (biomass) can be renewed, contrarily to fossil fuels. This paper reviews the current and future biofuel technologies, and their development impacts (including on the climate) within given policy and economic frameworks. Current technologies make it possible to provide first generation biodiesel, ethanol, or biogas to the transport sector to be blended with fossil fuels. Still under-development 2nd generation biofuels from lignocellulose should be available on the market by 2020. Research is active on the improvement of their conversion efficiency. A ten-fold increase compared with current cost-effective capacities would make them highly competitive. Within bioenergy policies, emphasis has been put on biofuels for transportation as this sector is fast-growing and represents a major source of anthropogenic greenhouse gas emissions. Compared with fossil fuels, biofuel combustion can emit less greenhouse gases throughout their life cycle, considering that part of the emitted CO2 returns to the atmosphere where it was fixed from by photosynthesis in the first place. Life cycle assessment (LCA) is commonly used to assess the potential environmental impacts of biofuel chains, notably the impact on global warming. This tool, whose holistic nature is fundamental to avoid pollution trade-offs, is a standardised methodology that should make comparisons between biofuel and fossil fuel chains objective and thorough. However, it is a complex and time-consuming process, which requires lots of data, and whose methodology is still lacking harmonisation. Hence the life-cycle performances of biofuel chains vary widely in the** literature. Furthermore, LCA is a site- and time-independent tool that cannot consider the spatial and temporal dimensions of emissions and can hardly serve as a decision-making tool either at local or regional levels. Focusing on greenhouse gases, emission factors used in LCAs give a rough estimate of the potential average emissions on a national level. However, they do not consider the types of crops, soil or management practices, for instance. Modelling the impact of local factors on the determinism of greenhouse gas emissions can provide better estimates for LCA on 16 the local level, which would be the relevant scale and degree of reliability for decision-making purposes. Nevertheless, a deeper understanding of the processes involved, most notably N2O emissions, is still needed to improve the accuracy of LCA. Perennial crops are a promising option for biofuels, due to their rapid and efficient use of nitrogen, and their limited farming operations. However, the main overall limiting factor to biofuel development will ultimately be land availability. Given the available land areas, population growth rate and consumption behaviours, it would be possible to reach by 2030 a global 10% biofuel share in the transport sector, contributing to lower global greenhouse gas emissions by up to 1 GtCO2 eq per year (IEA, 2006), provided that harmonised policies ensure that sustainability criteria for the production systems are respected worldwide. Furthermore, policies should also be more integrative across sectors, so that changes in energy efficiency, the automotive sector and global consumption patterns converge towards drastic reduction of the pressure on resources. Indeed, neither biofuels nor other energy source or carriers are likely to mitigate the impacts of anthropogenic pressure on resources in a range that would compensate for this pressure growth. Hence, the first step is to reduce this pressure by starting from the variable that drives it up, i.e., anthropic consumptions. 17 CHAPTER 3: METHODOLOGY 3.1 Machine Learning The development of machine learning has been a breakthrough in the control of intricate connections between inputs and related outputs. It tackles the difficulties of handling a wide range of situations that could occur throughout the creation of a system that analyses data to offer valuable insights. To do this, a thorough examination and study of the aspects of the input data are necessary to train the system successfully. With this process, our system is trained using real-world datasets that include the necessary elements to predict future responses accurately. The system develops throughout training by learning from its mistakes, producing more accurate and trustworthy outcomes. However, it is essential to comprehend what machine learning includes and its function before getting into the specifics of this technique. The process of building a system that can provide outcomes based on trained and learned information is known as machine learning. It enables algorithms to decide based on knowledge gained, effectively allowing them to learn from the data that is accessible. With the use of machine learning, computers may acquire knowledge without explicit programming, giving them a human-like character by being able to comprehend circumstances and settings in order to make wise judgments. With applications expanding to several sectors beyond our expectations, this scientific field has emerged as one of the most exciting technologies. Machine learning is being actively used in many fields, making it a breakthrough development in the study of computers. Machine learning often comes in four flavours: ● Supervised Machine Learning ● Unsupervised Machine Learning ● Semi-Supervised Machine Learning 18 ● Reinforcement Machine Learning Figure 3.1. 1 Types of Machine Learning I have made the decision to move forward with the Machine Learning model approach for our inventory forecasting system after doing a thorough review of the options. This strategy, as opposed to previous approaches, enables more convenience and accuracy in processing the necessary data in today's data-driven environment. I must consider all the variables and elements that have a substantial impact on sales variations in order to develop a machine learning model that works. As a result, we have decided to tackle this difficulty using a machine learning-based strategy. Three distinct machine learning models, each with specific advantages, will be tested for our inventory forecasting system: • Decision Trees: A tree-based model that divides data into branches according to characteristics, allowing for simple interpretation and comprehension of decision-making processes. 19 • Random Forest: Random Forest is a form of ensemble learning that blends several decision trees to improve accuracy and reduce overfitting. • Gradient Boosting: A method for building several weak learners successively, each of which fixes the flaws in the one before it. We'll also look at various models including Support Vector Machines (SVM), Neural Networks, Time Series Analysis (ARIMA, SARIMA, etc.), Long ShortTerm Memory (LSTM), XGBoost, and Prophet. A sizable and varied training dataset is required to create a trustworthy machine learning model. In the coming sections of this article, we will go into more depth about these machines learning models, highlighting their unique qualities and possible uses in our inventory forecasting system. 3.2 Development Methods and Tools Using Python as our development platform, we have decided to design our inventory forecasting model using a machine learning-based methodology. Python is a great option for constructing machine learning solutions and for quick prototyping because of its widespread use and broad library support for data science applications. We will use a methodical, sequential approach to create our machine learning model: • • • • • Inspection of the dataset: Analyze the past sales data in detail and extract the necessary elements for precise prediction. Data analysis: Using Python's data science modules, investigate correlations and trends in the sales data based on the attributes that were extracted. Data processing: Convert the data into a numerical data matrix that may be used to train a machine learning model. Encode any characteristics that are not numerical into a numerical format before training. Pipelines: Build pipelines that input the processed data into three distinct machine learning models (Decision Trees, Random Forest, and Gradient Boosting). Set the training process for each model into motion. Model Evaluation: Assess the behavior of the trained models using the training data. 20 Figure 3.2. 1 Software Development Cycle Figure 3.2. 2 Iterative Model This rigorous approach will help us create a strong inventory forecasting model that optimizes sales estimates for the cafeteria industry while considering a variety of influencing elements. Knowledge of Python-3 and the necessary Python libraries for data science and machine learning is needed for the effective development of 21 our inventory forecasting system. Following are the libraries we'll be using in our implementation: 3.2.1 Pandas A free, open-source toolkit for data manipulation, Pandas makes a variety of datarelated activities easier to do, including data entry, normalization, merging datasets, visualization, statistical operations, analysis, and more. 3.2.2 NumPy NumPy is a well-known toolkit that provides effective data structures for working with arrays. Medical computing, machine learning, data analytics, and other applications benefit greatly from its swift and efficient processing of multidimensional arrays. 3.2.3 Matplotlib Matplotlib is a cross-platform statistics visualization and graphical plotting package that easily works with NumPy. It offers an approachable replacement for MATLAB by enabling the incorporation of visual graphs and illustrations into Python-coded GUI systems. 3.2.4 Scikit-Learn Scikit-Learn, sometimes referred to as sklearn, is an effective and flexible Python toolkit for machine learning applications. It provides several different models and techniques for classification, regression, clustering, dimensionality reduction, and other tasks. The Python interfaces provided by Scikit-Learn for effective data modelling are clear and consistent. Scikit-Learn is built upon NumPy, SciPy, and Matplotlib. Feature extraction, selection, cross-validation, ensemble techniques, supervised and unsupervised learning algorithms, and other crucial capabilities are supported. These libraries will enable us to efficiently handle data processing, feature extraction, and machine learning model training. Our cafeteria business will be able to maximize income and improve inventory management thanks to the combination 22 of Python and these libraries, which will allow us to build a strong and precise inventory forecasting system. CHAPTER 4: PROJECT MANAGEMENT We used a Gantt chart to successfully monitor our progress during this project. The Gantt chart was chosen since it is a popular project management tool used in many different sectors. It helped in planning and coordinating project operations by enabling us to show the timetable of various tasks and their interdependencies visually. The milestone system played a significant role in the planning process. Milestones acted as important milestones and objectives, directing the development of our project. Every milestone served as a significant achievement or the conclusion of an essential stage, ensuring that we kept on course and adhered to critical deadlines. A thorough overview of the project's schedule and deliverables was given thanks to the collaboration between the Gantt chart and milestone system. The Gantt chart assisted in visualizing the different jobs and their durations, allowing for efficient resource allocation and the detection of possible bottlenecks. As the project moved from one stage to the next, the milestone system served as strategic anchors, denoting significant accomplishments. This mix of technologies helped us stay organized and in a clear direction throughout the project's lifespan. The successful completion of the inventory forecasting system for the cafeteria company was ultimately due to the methodical methodology that enabled effective communication between team members and enabled timely modifications to project timetables and priorities. 4.1 Milestone Analysis Finding important turning points or noteworthy occasions within a project's timetable is known as a milestone analysis. These turning points are significant benchmarks that show development and the end of crucial stages. Project managers 23 may maintain the project's timeline, deal with possible concerns quickly, and recognize achievements as they are made by monitoring and analyzing milestones. Milestone Requirements Milestone-1 (Initial Report) In this milestone, we needed to come up with all the research details of our project and state the methods we had decided to follow to complete our project. Milestone-2 (Project Report 2) In this milestone, we had to come up with a proofof-concept forecasting model that shows that we are on the correct track to complete the full project. Milestone-3 (Final Report Submission) By this milestone, we needed to come up with the final climate data analysis model and submit our final report for the project. Final Presentation Here, we finally present our work and demonstrate the climate data analysis model that we have created. Table 4.1. 1 Table Milestone Analysis This project was initially divided into a total of four milestones, each with a specific objective that needed to be accomplished. The table contains the requirements and milestones that must be met. The table contains the fundamental summaries that were anticipated at the conclusion of each milestone. Now that we had established the objectives for each milestone, we could start working on finishing the project. 4.2 Analysis and Deliverables In this research, the dataset needed to train the machine learning model was gathered from the Kaggle public database. The initial dataset was intended to be used for training machine learning models that can forecast future order volume for a business model akin to a restaurant. 24 We utilize the pandas, NumPy, and matplotlib libraries of the python programming language to analyze the dataset. We estimate all the various parameters and their relationships using these libraries' statistical modeling procedures, which are provided in these libraries. We next use the matplotlib software to visualize all these data using a plot. These visually represented findings were analyzed, and they served as input for the feature extraction procedure previously in this study. The libraries that will be utilized to create our implementation of the inventory forecasting system are those that were previously discussed. The fundamental class diagram strategy, which was employed to create our forecasting system, is described here. Figure 4.2. 1 Forecasting System flow diagram. 25 The illustration shown up above shows the complete procedure. In this case, raw climate data is originally gathered and processed using several climatic variables and parameters. This raw climate data is then turned into insightful information using the fundamental algorithm and libraries like Pandas, NumPy, Matplotlib, and scikit-learn. The program makes use of these methods and libraries to transform the raw climate data into a useful dataset, improving the general accuracy of the climate analysis and machine learning system. • • • • • • • • Climate data here refers to all the information pertaining to the different elements and variables that make up the climate. Climate variables include elements like temperature, humidity, greenhouse gas concentrations, and other metrics that have a do with the climate. The term "intensity" describes the size or number of climatic variables at a particular period. Additional information, such as the location, the timing, and certain climatic occurrences, are included in other data. A database is a complete set of climatic information that include all the variables and factors. The term "raw data" describes the original, unprocessed climatic data, which might not be useful on its own. The raw climatic data is processed and converted into a useful dataset using Python tools and algorithms. Future climatic patterns and trends are predicted in the last stage. This is how the program works to increase decision-making in machine learning and climate analysis while also increasing overall accuracy. For this application to be successful, accurate and trustworthy climatic data are required. There may be disparities in the final forecasts if the initial data is inaccurate or contains mistakes. Therefore, acquiring accurate and trustworthy climate data is essential to improving the overall accuracy and efficiency of the machine learning and climate analysis system. 26 4.3 Physical Resources We will need a thorough historical dataset of climate data that includes a range of climatic variables and characteristics. This dataset will operate as the fundamental source of data from which we will extract all the components required for our machine learning models to be trained. Our main goal is to build a reliable machine learning and climate analysis system that can anticipate future climate patterns and trends. We will be able to estimate and predict a variety of measures connected to the climate using this system, which is based on machine learning algorithms, improving our understanding of climate dynamics. We’ll use the free Jupyter Notebook Python code editor tool to make our research and data analysis easier. In addition, for data processing and manipulation, we’ll use fundamental Python modules like Pandas and NumPy, along with Matplotlib for data visualization. With the help of these tools and libraries, we will be able to display and evaluate the climatic data for the project’s data science component efficiently. CHAPTER 5: IMPLEMENTATION 5.1 Label Encoding Label Encoding refers to converting the labels into a numeric form to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning. 5.1.1 Advantages of Label Encoding: Scikit-learn provides a very efficient tool for encoding the levels of categorical features into numeric values. Label Encoder encode labels with a value between 0 27 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier. 5.2 Regression Models 5.2.1 Decision Tree Regressor Decision trees may be used for both classification and regression applications and are non-parametric models. Because the model is non-parametric, the number of parameters (or weights) does not rely on the number of features in the dataset. Decision trees benefit from this property, which makes them very adaptable and able to handle datasets with many attributes without dramatically increasing computing complexity. Decision trees are flexible because, depending on the nature of the issue, they can generate categorical (discrete) and numerical (continuous) predictions. Decision trees output class labels for classification problems while producing numerical values for regression tasks. Figure 5.2.1. 1 Decision Tree Working Diagram 28 5.2.2 Random Forest Regression Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set. Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance. The first algorithm for random decision forests was created in 1995 by Tin Kam Ho using the random subspace method, which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg. An extension of the algorithm was developed by Leo Bierman and Adele Cutler, who registered "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.). The extension combines Bierman’s "bagging" idea and random selection of features, introduced first by Ho and later independently by Amit and Geminin order to construct a collection of decision trees with controlled variance. Random forests are frequently used as "Blackbox" models in businesses, as they generate reasonable predictions across a wide range of data while requiring little configuration. 29 Figure 5.2.2. 1 Working of Random Forest 5.2.3 Gradient Boosting Machines The sophisticated machine learning approach known as a gradient boosting machine (GBM) is utilized for both classification and regression problems. A powerful predictive model is produced by GBMs, an ensemble learning technique that successively integrates the predictions of several weak learners (usually decision trees). Gradient Boosting Machines (GBMs) start by fitting a weak learner to the training data; this weak learner is often a shallow decision tree with a small number of levels. Although this initial weak learner performs mediocrely as a whole, it generates predictions. The residuals, or discrepancies between the actual target values and the predictions given by the first weak learner, are then computed by the GBM. These residuals represent the mistakes or inconsistencies that must be fixed. The residuals obtained in the preceding phase are then fitted by the GBM to a second weak learner. To correct the mistakes produced by the first learner, this second learner focuses on recognizing and learning from the patterns found in the residuals. 30 Figure 5.2.3. 1 Working of GBM 5.3 Steps of Model Building 5.3.1 Dataset Climate variables, historical climate data, geographic information, emissions data, atmospheric conditions, climate change metrics, and trend analysis are among the features of the dataset for climate analysis and machine learning. These factors are essential for efficient climate research and machine learning as they help us comprehend historical climate patterns, forecast future trends, evaluate environmental implications, and take reasoned actions to slow down climate change. 5.3.2 Splitting the Dataset Divide the data into training and testing part we have divide in 70 -30 ratio 80 percent for training and 10 percent for testing of the data and 10 percent for data validation. We have split the data into 80 percent training 10 percent for testing and 10 percent for validating the results. 5.3.3 Training and Testing Training is done using training data which is received from dataset. this training will help the system to learn from the data about the pattern and various relationship. Testing of data is done to test whether the training phase has been successful or not 31 the testing data is used to test the data after the training this ensure that the prediction or calculation done by the machine learning algorithm is right or wrong. 5.3.4 Supervised Machine Learning It is used to train the algorithm to perform same task on various data to extract various pattern and relationship from them. Supervised learning provide data with e.g., and result to train the algorithm. 5.3.5 Regression Models • • • Decision Tree Random Forest Regression Gradient Boosting Machine (GBM) 5.5 Evaluation Methods This model shows the numeric value after calculation and prediction made by various algorithm. 5.5.1 Mean Absolute Error (MAE) In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. MAE is calculated as the sum of absolute errors divided by the sample size. (1) 5.5.2 Mean Squared Error (MSE) The mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the 32 squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk (the average loss on an observed data set), as an estimate of the true MSE (the true risk: the average loss on the actual population distribution). (2) 5.5.3 Root Mean Squared Error (RMSE) The root mean square (RMS or rms or rms) is defined as the square root of the mean square (the arithmetic mean of the squares of a set of numbers). The RMS is also known as the quadratic mean and is a particular case of the generalized mean with exponent 2. RMS can also be defined for a continuously varying function in terms of an integral of the squares of the instantaneous values during a cycle. (3) 5.6 Data Collection To acquire data, pertinent information and datasets must be obtained from several sources, including historical climate data, climatic variables, geographic information, emissions data, atmospheric conditions, and other variables that may have an influence on climate patterns and trends. A thorough data collection is needed to train the machine learning models and build a precise system for climate analysis and machine learning. 33 year co2 ch4 - 0.092 0.039 0.031 co2_eq _ppm_ total- - 0.097 0.042 0.034 1.747 - 0.107 0.102 0.044 0.036 - 0.107 0.046 - 0.113 0.048 0.42 n2o cfc12 cfc11 15_minor total aggi_1990_1 0.785 aggi_ chan ge NaN 385 0.807 2.2 1.786 388 0.825 1.8 0.038 1.818 390 0.84 1.5 0.041 1.859 394 0.859 1.9 Table 5.6. 1 Sample of Dataset 5.7 Data Pre-processing and Feature Engineering Preparing and cleaning raw data so that it is appropriate for further analysis is the first stage in the data analysis pipeline. It entails activities including dealing with missing data, getting rid of duplicates, scaling features, and encoding categorical variables. 5.8 Data Visualization In climate analysis and machine learning, the process of using charts, graphs, and other graphical representations to convey information about climate variables, emissions, atmospheric conditions, and other relevant factors is known as data visualization. With the help of this visual representation, stakeholders may get crucial insights, identify patterns or trends in the climatic data, and draw informed judgments. The correlation matrix, which shows the connections and dependencies between different climatic parameters and the concentration of all substances, provides a visual understanding of how different climate variables are connected and have an influence on one another in the climate system. Making defensible judgments on methods for mitigating and adapting to climate change is made easier because to this method, which improves our knowledge of the intricate interactions that take place inside the climate system. 34 Figure 5.8. 1 Concentration of Greenhouse Gas Radiative Forcing A The graph displays, from 1979 to 2015, the radiative forcing, expressed in Watts per square meter (W m2), for major greenhouse gases. A distinct gas, such as CO2, CH4, N2O, CFC-12, CFC-11, and other minor gases, as well as the overall radiative forcing, are represented by each line. The following are significant findings from the graph: • Over time, CO2 shows a continuous rise in radiative forcing, indicating a considerable contribution to global warming. • Radiative forcing is affected differently by various gases, some of which exhibit varied effects. • The "Total" line depicts the total radiation forcing from all gases, which sums up the effect on the planet's energy balance. This graph illustrates how various greenhouse gases affect radiative forcing, highlighting their contributions to climate change and global warming. 35 Figure 5.8. 2 Ozone Layer Data Over the Years The line graph showing the variations in ozone concentration in the entire column, troposphere, and stratosphere of the Earth's atmosphere over a period of years may be created. The y-axis shows the ozone concentration, while the x-axis shows the years from the dataset. One of the atmospheric layers is shown by each line on the graph. These are some important conclusions to draw from the graph: • Trends in the amount of ozone in the stratosphere, troposphere, and overall column during the chosen years. • Any changes or trends in the ozone content of each atmospheric layer. • Alterations in ozone levels in various atmospheric strata and their connection. Studying the Earth's ozone layer and its possible effects on climatic and environmental conditions requires knowing how ozone levels have changed over time in various layers of the atmosphere, which is made easier by this representation. 36 Figure 5.8. 3 Concentration of Halogen Compounds Over Time The line plot showing the change in concentration of several halogen compounds over time. The graph illustrates how the concentrations of each halogen component have varied over time by depicting each as a distinct line. The years are shown on the x-axis, while the concentration of each component is shown on the y-axis. The comparison of concentration patterns for various halogen compounds during the given time is made possible by this depiction. 5.9 Training and Testing of Different Models on CO2 Dataset The dataset is split into two subsets after the data preparation step: the training set and the testing set. A fraction of the pre-processed data is present in the training set, which is utilized to train the machine learning model. The trained model's performance is assessed on the testing set, while its applicability to fresh, untried data is evaluated. For overfitting to be avoided, when the model memorizes the training data but fails to perform effectively on fresh data, the training and testing sets are crucial. Depending on the size of the dataset, a typical split ratio for training and testing sets is 70–30 or 80–20. 37 5.9.1 Using Support Vector Machine Algorithm Among the most effective machine learning methods, Support Vector Machines (SVM) are utilized for classification and regression problems. They effectively separate data points by determining the best hyperplane to optimize the margin between various kinds of data. SVM can handle non-linear data by applying a kernel function to translate it into a higher-dimensional space. The greatest margin hyperplane is chosen, improving generalization, and lowering overfitting. To define the decision boundary, support vectors—the data points that are closest to the hyperplane—are crucial. The SVM contains a hyperparameter called "C" that allows for flexible model adjustment by balancing margin width and classification accuracy. SVM is renowned for its resilience and adaptability for both linear and nonlinear classification problems. It can manage a variety of data sources and classification tasks. Here is the code for the support vector machine algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Support Vector Regression svr_model = SVR(kernel='linear') svr_model.fit(X_train, y_train) svr_pred = svr_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, svr_pred, "Support Vector Regression") 5.9.1.1 Support Vector Machine Algorithm’s Training Evaluation: Evaluation of support vector machine algorithm training. Mean Squared Error: 13.23 R-squared: 0.97 Table 5.9.1.1. 1 SVM Training Evaluation for CO2 With a low mean squared error (MSE) of 13.23, the SVM model used in climate analysis makes precise predictions with few mistakes. The model also has a high Rsquared (R2) value of 0.97, which shows that it accurately explains 97% of the 38 variance in climate data and highlights its great fit and trustworthy forecasting skills. The SVM model essentially shows how well it predicts factors linked to the climate. 5.9.2 Using Random Forest Algorithm An ensemble learning technique used in supervised learning is Random Forest. It mixes several models to solve complicated issues that might not be amenable to just one ML model. Both classification and regression jobs can use it. Multiple decision trees are constructed in Random Forest on various subsets of the dataset, and predictions are averaged to increase accuracy. The method collects predictions from each decision tree and utilizes majority vote to arrive at the final prediction rather than depending just on one decision tree. The model's accuracy increases, and the chance of overfitting decreases as the number of trees increases. Here is the code for the random forest algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Random Forest Regression rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) rf_pred = rf_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, rf_pred, "Random Forest Regression") 5.9.2.1 Random Forest Training Evaluation: Evaluation report of random forest algorithm. Mean Squared Error: 3.36 R-squared: 0.99 Table 5.9.2.1. 1 Random Forest Training Evaluation for CO2 39 With a Mean Squared Error (MSE) of 3.66, predictions are very accurate and have few prediction mistakes. 99% of the variation in the climatic data is explained by the model, according to the R-squared (R2) value of 0.99. This denotes a strong match and trustworthy forecasting ability for climate-related variables. 5.9.3 Using Time Series Model (Simple Moving Average) A mathematical method for smoothing time-series data is the moving average algorithm. It entails taking a collection of sequential data points and computing the average of a predetermined number of nearby points, referred to as the "window" or "period." Following that, the centre data point inside that frame is given this averaged value. Each data point in the series is subjected to the same procedure, yielding a fresh set of values that have been smoothed. To make underlying trends or patterns more obvious, the moving average is particularly helpful for eliminating noise or volatility in data. It is frequently used to analyse and show data over time, aiding the detection of trends and patterns within noisy datasets, in a variety of disciplines, including finance, signal processing, and climate study. Depending on the particulars of the data and the objectives of the study, other moving average variants, such as simple moving averages (SMA) and exponential moving averages (EMA), might be used. Here is the code for the moving average algorithms’ evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Time Series Model (Simple Moving Average) rolling_mean = y_train.rolling(window=5).mean().iloc[-1] ts_pred = np.full_like(y_test, fill_value=rolling_mean) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, ts_pred, "Time Series (Moving Average)") 40 5.9.3.1 Time Series Model (Simple Moving Average) Training Evaluation: Evaluation results of simple moving average algorithm. Mean Squared Error: 544.40 R-squared: -0.08 Table 5.9.3.1. 1 Simple Moving Average Training for CO2 The moving average algorithm's predictions contain quite big mistakes, as shown by the MSE value of 544.40, and the model is not doing a good job of capturing the patterns or trends in the data, as indicated by the negative R2 value of -0.08. These findings suggest that the moving average algorithm is not a good fit for this specific dataset and that different modelling approaches could be better suitable for predicting or smoothing the data. 5.10 Training and Testing of Different Models on Methane Dataset Using several machine learning methods to create predictive models is what is meant by "training and testing of different models" in the context of the methane dataset. These models are taught using historical methane data during the training phase, and their performance is evaluated during the testing phase by comparing their predictions to measured methane concentrations. To determine the best model for predicting methane levels in the provided dataset. 5.10.1 Using Support Vector Machine Algorithm Powerful machine learning methods called Support Vector Machines (SVM) are used to solve regression issues, such as the interpretation of data relating to methane concentration. In the area of methane concentration prediction, SVM excels at finding the ideal hyperplane that optimizes the margin between various data points. The durability and adaptability of SVM in handling both linear and nonlinear regression problems make it appropriate for a variety of scenarios involving the prediction of methane concentration. It can handle various data sources and regression tasks related to methane concentration analysis with ease. 41 Here is the code for the support vector machine algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Support Vector Regression svr_model = SVR(kernel='linear') svr_model.fit(X_train, y_train) svr_pred = svr_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, svr_pred, "Support Vector Regression") 5.10.1.1 Support Vector Machine Algorithm’s Training Evaluation: Evaluation of support vector machine algorithm training. Mean Squared Error: - R-squared: - Table 5.10.1.1. 1 SVM Evaluation for Methane The SVM model used to analyse methane concentration data shows considerable prediction errors and a lack of explanatory power, with a strikingly high mean squared error (MSE) of 86,341,224.65 and a markedly low R-squared (R2) value of -11,709.77. The increased MSE, which denotes erroneous predictions with significant departures from actual values, indicates this model performs badly. The significantly negative R2 value suggests that the model does not satisfactorily explain the variance in the methane concentration data, raising questions about the model's accuracy in predicting climate-related variables. 5.10.2 Using Random Forest Algorithm In supervised learning for the study of methane datasets, Random Forest, an ensemble learning approach, is used. This approach mixes numerous models to tackle difficult problems that a single machine learning model might not be able to 42 handle well. For classification and regression tasks relating to methane concentration analysis, Random Forest can be used. The accuracy of the model tends to rise as the ensemble size grows while the danger of overfitting decreases. As a result, Random Forest is particularly useful for improving the accuracy of methane concentration estimates and decreasing the likelihood that the methane dataset analysis would overfit. Here is the code for the random forest algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Random Forest Regression rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) rf_pred = rf_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, rf_pred, "Random Forest Regression") 5.10.2.1 Random Forest Training Evaluation: Evaluation report of random forest algorithm. Mean Squared Error: 464.53 R-squared: 0.94 Table 5.10.2.1. 1 Random Forest Training Evaluation for Methane The Random Forest algorithm's predictions for methane concentration in climate analysis have a low Mean Squared Error (MSE) of 464.53, making them very accurate with few prediction mistakes. The model successfully explains 94% of the variation in the methane concentration data, as indicated by the R-squared (R2) value of 0.94, suggesting a good fit and dependable forecasting capabilities for climate-related variables. For estimating methane concentrations in the context of climate studies, the Random Forest algorithm seems to be a reliable option. 43 5.10.3 Using Time Series Model (Simple Moving Average) A mathematical method called the moving average algorithm is used in methane concentration analysis to smooth time-series data. It entails averaging a predetermined number of nearby data points, referred to as the "window" or "period," by taking the average of a sequence of consecutive data points on methane concentration. The noise and oscillations in the methane concentration data are reduced by this procedure, making it simpler to spot underlying trends and patterns. Simple moving averages (SMA) and exponential moving averages (EMA) are two examples of moving average variations that may be used in the analysis of methane datasets depending on the unique properties of the data on methane concentrations and the goals of the study. With the use of these moving average approaches, it is possible to improve the accuracy of predictions for methane concentration as well as get a better understanding of trends and variations in methane concentration over time. Here is the code for the moving average algorithms’ evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Time Series Model (Simple Moving Average) rolling_mean = y_train.rolling(window=5).mean().iloc[-1] ts_pred = np.full_like(y_test, fill_value=rolling_mean) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, ts_pred, "Time Series (Moving Average)") 5.10.3.1 Time Series Model (Simple Moving Average) Training Evaluation: Evaluation results of simple moving average algorithm. 44 Mean Squared Error: 8028.06 R-squared: -0.09 Table 5.10.3.1. 1 Evaluation of Simple Moving Average for Methane With a relatively high Mean Squared Error (MSE) of 8028.06, suggesting considerable prediction mistakes, the Simple Moving Average method performs sub optimally in forecasting methane concentration. Additionally, the model's negative R-squared (R2) value of -0.09 indicates that it has trouble accurately capturing the patterns or trends in the data. These findings show that the Simple Moving Average methodology is not suitable for this dataset, and alternative modelling methods should be considered to enhance trend capture and forecast accuracy for methane concentration data in climate analysis. 5.11 Training and Testing of Different Models on Earth Global Temperature Dataset Using the "berkeley_earth_globaltemperatures" dataset, "training and testing of different models" refers to the process of creating prediction models using a variety of machine learning approaches. These models are trained using historical temperature data, and during the testing phase, their performance is evaluated by contrasting temperature forecasts with actual observations. Finding the best accurate model to predict temperature variations in the "berkeley_earth_globaltemperatures" dataset is the goal. 5.11.1 Using Support Vector Machine Algorithm The study of climate data frequently makes use of Support Vector Machines (SVM), especially for datasets like Berkeley Earth Global Temperatures. These machine learning algorithms are quite good at classifying and regressing issues involving climatic variables. SVMs are excellent in classifying data by locating the best hyperplane that optimizes the margin between various types of climatic data. Their adaptability is demonstrated by their proficiency in handling many kinds of climate data sources and by the accuracy with which they identify both linear and nonlinear climate trends. 45 SVMs show to be a reliable and flexible machine learning technique in the context of analyzing climate data using the Berkeley Earth Global Temperatures dataset. They are important tools in climate study and prediction because they help identify and comprehend intricate climatic patterns. Here is the code for the support vector machine algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Support Vector Regression svr_model = SVR(kernel='linear') svr_model.fit(X_train, y_train) svr_pred = svr_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, svr_pred, "Support Vector Regression") 5.11.1.1 Support Vector Machine Algorithm’s Training Evaluation: Evaluation of support vector machine algorithm training. Mean Squared Error: 18.61 R-squared: -0.00 Table 5.11.1.1. 1 SVM Training for Global Temperature The "berkeley_earth_globaltemperatures" dataset was analysed using an SVM model, although it shows significant prediction errors and a lack of explanatory ability. The R-squared (R2) score is close to zero (-0.00), and the mean squared error (MSE) is noticeably large (18.61). A high degree of forecast error and considerable departures from actual temperature readings are both indicated by the increased MSE. Additionally, the virtually zero R2 value raises concerns about the model's capacity to accurately forecast temperature-related variables in the "berkeley_earth_globaltemperatures" dataset since it suggests that the model does not adequately explain the variance in the temperature data. 46 5.11.2 Using Random Forest Algorithm In supervised learning scenarios, such as the study of climate data utilizing datasets like Berkeley Earth Global Temperatures, Random Forest, an ensemble learning approach, finds useful uses. To successfully handle complicated problems that a single machine learning model might not be able to solve, it makes use of the potential of integrating many models. For classification and regression tasks in the context of analysing climate data, Random Forest is flexible and effective. With the help of several decision trees' predictions combined, it excels in increasing model accuracy. The performance of the model tends to grow with the number of trees in the forest, and the danger of overfitting, when the model fits noise rather than patterns, reduces. The Berkeley Earth Global Temperatures dataset uses Random Forest as a useful technique to improve the precision and resilience of climate data analysis and prediction. Here is the code for the random forest algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Random Forest Regression rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) rf_pred = rf_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, rf_pred, "Random Forest Regression") 5.11.2.1 Random Forest Training Evaluation: Evaluation report of random forest algorithm. Mean Squared Error: 22.81 R-squared: -0.23 Table 5.11.2.1. 1 Random Forest Evaluation for Global Temperature 47 The "berkeley_earth_globaltemperatures" dataset's temperature forecasts made by the Random Forest method show a remarkably low Mean Squared Error (MSE) of 22.81, suggesting a high degree of accuracy with few prediction mistakes. Furthermore, the model successfully explains 23% of the temperature data variation, as shown by the R-squared (R2) value of -0.23. As a result, it is possible that the Random Forest method may not be the best option for predicting variables linked to temperature in the "berkeley_earth_globaltemperatures" dataset given its weak explanatory power and high MSE. 5.12 Training and Testing of Different Models on Concentration of Ozone Dataset Using the "concentration of ozone" dataset, the idea of "training and testing of different models" entails the creation of prediction models using a variety of machine learning approaches. In the testing phase, the performance of these models is evaluated by comparing projected ozone levels with actual observations. The models are trained using historical ozone concentration data. Our understanding of ozone dynamics and trends will be improved by determining the model that can anticipate ozone concentration fluctuations in the dataset with the greatest degree of accuracy. 5.12.1 Using Support Vector Machine Algorithm When applied to classification and regression tasks using the "concentration of ozone" dataset, Support Vector Machines (SVM) are among the most potent machine learning algorithms. SVM is very good at separating data points by finding the best hyperplane that optimizes the separation between different data categories. SVM demonstrates robustness and adaptability, addressing both linear and nonlinear classification problems. It works well when managing a variety of data sources and various classification tasks when used to ozone concentration analysis. Here is the code for the support vector machine algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Support Vector Regression 48 svr_model = SVR(kernel='linear') svr_model.fit(X_train, y_train) svr_pred = svr_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, svr_pred, "Support Vector Regression") 5.12.1.1 Support Vector Machine Algorithm’s Training Evaluation: Evaluation of support vector machine algorithm training on SBUV column. Mean Squared Error: Mean Absolute Error: 5.53 1.91 R-squared: -0.74 Table 5.12.1.1. 1 SVM Evaluation for Ozone An SVM model was used to assess the "concentration of ozone" dataset, in particular the SBUV column. The results, however, reveal a low explanatory ability and significant prediction mistakes. Indicating large forecast errors with considerable departures from the actual ozone concentrations, the mean squared error (MSE) is noticeably high at 5.53. In addition, the R-squared (R2) score is nearly zero (-0.74), indicating that the model has difficulty adequately explaining the variation in the data on ozone concentration. These findings throw doubt on the SVM model's ability to predict variables linked to ozone inside the "concentration of ozone" dataset. 5.12.1.2 Support Vector Machine Algorithm’s Training Evaluation: Evaluation of support vector machine algorithm training on Troposphere column. Mean Squared Error: Mean Absolute Error: R-squared: - Table 5.12.1.2. 1 SVM Evaluation for Ozone on Troposphere 49 With a Mean Squared Error (MSE) of 0.53 and predictions that are relatively accurate with few mistakes, the SVM model applied to the "concentration of ozone" dataset, notably in the Troposphere column, produces promising results. Another indicator of prediction accuracy is the Mean Absolute Error (MAE), which is 0.63. The model's ability to describe the variation in ozone concentration data in the Troposphere column, however, is only moderate, as indicated by the R-squared (R2) value of 0.64. Even though it's not a perfect match, it indicates that the SVM model does a respectable job of forecasting ozone concentrations in this dataset column. 5.12.1.3 Support Vector Machine Algorithm’s Training Evaluation: Evaluation of support vector machine algorithm training on Stratosphere column. Mean Squared Error: 6.06 Mean Absolute Error: 1.99 R-squared: -0.17 Table 5.12.1.3. 1 SVM Evaluation for Ozone on Stratosphere The SVM model, namely in the Stratosphere column, produces a noticeably larger Mean Squared Error (MSE) of 6.06 when applied to the "concentration of ozone" dataset. This shows that the ozone concentrations in the Stratosphere column predicted by the model have bigger errors and greater departures from real values. A measure of the magnitude of prediction mistakes, the Mean Absolute Error (MAE) is 1.99. In addition, the R-squared (R2) value is -0.17, indicating that the model has difficulty explaining the variation in the data on ozone concentration in the Stratosphere column. This low R2 value raises questions regarding the SVM model's precision and potency in forecasting ozone levels in this dataset column. When used with the Stratosphere column of the "concentration of ozone" dataset, the SVM model generally seems to perform badly. 5.12.2 Using Random Forest Algorithm Random Forest is a supervised learning approach for ensemble learning that is commonly used in classification and regression applications. With this strategy, 50 numerous models are combined to tackle complicated issues that a single machine learning model might not be able to handle well. Using numerous decision trees, Random Forest can improve accuracy in the context of the "concentration of ozone" dataset. The model's precision tends to rise with the number of trees in the forest. The danger of overfitting may also be decreased by using more trees, which improves the model's ability to generalize when forecasting ozone concentrations. Here is the code for the random forest algorithm's evaluation. # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Random Forest Regression rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) rf_pred = rf_model.predict(X_test) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, rf_pred, "Random Forest Regression") 5.12.2.1 Random Forest Training Evaluation: Evaluation report of random forest algorithm on SBUV column data. Mean Squared Error Mean Absolute Error 5.07 1.88 R-squared -0.59 Table 5.12.2.1. 1 Random Forest Evaluation for Ozone on SBUV The SBUV column of the "concentration of ozone" dataset generated by the Random Forest model displays a Mean Squared Error (MSE) of 5.07, suggesting reasonably accurate predictions with few prediction mistakes. A further indicator of prediction accuracy is the Mean Absolute Error (MAE), which is 1.88. The R-squared (R2) score of -0.59, however, indicates that the model has difficulty explaining the 51 variation in the data on ozone concentrations, highlighting its limits in identifying underlying patterns or trends in the dataset. Alternative modelling techniques may be taken into consideration considering these findings to enhance predictions regarding ozone concentrations in the SBUV column. 5.12.2.2 Random Forest Training Evaluation: Evaluation report of random forest algorithm on Troposphere column data. Mean Squared Error 0.30 Mean Absolute Error R-squared 0.35 0.80 Table 5.12.2.2. 1 Random Forest Evaluation for Ozone on Troposphere The Mean Squared Error (MSE) of the Random Forest model applied to the "concentration of ozone" dataset, especially in the Troposphere column, is 0.30. This suggests that the model's forecasts of ozone concentrations in the troposphere column are largely correct and include few errors. Given the tiny size of the prediction errors, the Mean Absolute Error (MAE) is 0.35. A further indication that the model does a good job of describing the variance in the ozone concentration data in the Troposphere column is the R-squared (R2) value of 0.80. Due to the Random Forest model's remarkable ability to capture underlying patterns and trends in the dataset, as seen by the high R2 value, strong prediction abilities may be achieved. Overall, it appears that the Random Forest method is a good fit for predicting ozone concentrations in the Troposphere column of the "concentration of ozone" dataset. 5.12.2.3 Random Forest Training Evaluation: Evaluation report of random forest algorithm on Stratosphere column data. Mean Squared Error Mean Absolute Error - R-squared -6.44 Table 5.12.2.3. 1 Random Forest Evaluation for Ozone on Stratosphere 52 With a value of 38.52, the Mean Squared Error (MSE) of the Random Forest model applied to the Stratosphere column of the "concentration of ozone" dataset is noticeably higher. This suggests significant forecast errors and a departure from the actual values of ozone concentration in the Stratosphere column. The extent of prediction mistakes is also shown by the Mean Absolute Error (MAE), which is 3.20. In addition, the R-squared (R2) value is -6.44, which indicates that the model has difficulty explaining the variation in the data on ozone concentration in the Stratosphere column. The Random Forest model performs poorly in this column, as seen by the significantly negative R2 value, and it is unable to identify any significant patterns or trends in the dataset. To enhance forecasts of ozone concentrations in the Stratosphere column of the "concentration of ozone" dataset, additional modelling strategies may be considered. 5.13 Prediction using Trained Models 5.13.1 Forecasting the Concentration of Carbon Dioxide (CO2) Future CO2 levels in the atmosphere may be predicted using the trained machine learning models. Based on input data including historical CO2 levels, potential influencing variables (e.g., emissions, changes in land use), and other pertinent information, the models will produce projections of CO2 concentration levels for certain years. The expected CO2 concentration levels over the next five years may be seen by using the machine learning models to make projections. These forecasts are graphically shown, demonstrating the anticipated changes in CO2 concentration. The objective is to predict CO2 level trends and changes and obtain understanding of its possible influence on climate over the next five years. # Extend the year range for forecasting (e.g., predict for the next 5 years) next_years = np.arange(X_test['year'].max() + 1, X_test['year'].max() + 6) X_forecast = pd.DataFrame({'year': next_years}) # Use the trained models to make forecasts lr_forecast = lr_model.predict(X_forecast) 53 rf_forecast = rf_model.predict(X_forecast) svr_forecast = svr_model.predict(X_forecast) # Time Series Model (Simple Moving Average) - continue with the rolling mean rolling_mean = y.rolling(window=5).mean().iloc[-1] ts_forecast = np.full(5, rolling_mean) # Display the forecasts for the next 5 years forecast_df = pd.DataFrame({ 'Year': next_years, 'Linear Regression': lr_forecast, 'Random Forest Regression': rf_forecast, 'Support Vector Regression': svr_forecast, 'Time Series (Moving Average)': ts_forecast }) # Create a DataFrame for visualization forecast_df = pd.DataFrame({ 'Year': next_years, 'Linear Regression': lr_forecast, 'Random Forest Regression': rf_forecast, 'Support Vector Regression': svr_forecast, 'Time Series (Moving Average)': ts_forecast }) Here is a depiction of the upcoming years with our application employing regulations as the focus. Random Support Time Series Linear Forest Vector (Moving Year Regression Regression Regression Average- - - - - Table 5.13.1. 1 Prediction of CO2 concentration levels Forecasting CO2 trends and changes will help scientists learn more about how they could affect the climate over the next five years. Visualization for the CO2 concentration is given below. # Plot the forecasts for the next 5 years 54 plt.figure(figsize=(12, 6)) plt.plot(forecast_df.index, forecast_df['Linear Regression'], label='Linear Regression', marker='o') plt.plot(forecast_df.index, forecast_df['Random Forest Regression'], label='Random Forest Regression', marker='o') plt.plot(forecast_df.index, forecast_df['Support Vector Regression'], label='Support Vector Regression', marker='o') plt.plot(forecast_df.index, forecast_df['Time Series (Moving Average)'], label='Time Series (Moving Average)', marker='o') plt.xlabel('Year') plt.ylabel('CO2 Concentration') plt.title('CO2 Concentration Forecasts for the Next 5 Years') plt.legend() plt.grid(True) plt.show() Figure 5.13.1. 1 Prediction of CO2 Concentration for 5 years The graph shows the results of four different regression techniques—Linear Regression, Random Forest Regression, Support Vector Regression, and Time Series (Moving Average)—for the years 2021 to 2025. The table offers a comparison of these estimates for each approach, each of which offers its own unique set of predictions. To estimate future CO2 concentration levels, it is important to evaluate and compare how well various regression models work. 55 5.13.2 Forecasting the Concentration of Methane (CH4) Future methane (CH4) levels in the atmosphere may be predicted using the developed machine learning models. Forecasts of methane concentration levels are produced by these models using pertinent data, historical methane levels, and potential influencing variables (such as emissions sources and environmental conditions). We can depict the predicted methane concentration levels over the following five years by using machine learning algorithms. These projections, which visually depict the projected fluctuations in methane content, are made. The goal is to predict trends and changes in methane levels to obtain knowledge about how they could affect the climate over the next five years. # Predict for the next 5 years (2021 to 2025) future_years = np.arange(2021, 2026).reshape(-1, 1) lr_predictions = lr_model.predict(future_years) rf_predictions = rf_model.predict(future_years) svr_predictions = svr_model.predict(future_years) ts_predictions = np.full((5,), rolling_mean) # Use the last rolling mean for simplicity # Create a DataFrame to store the predictions predictions_df = pd.DataFrame({ 'Year': future_years.flatten(), 'Linear Regression': lr_predictions, 'Random Forest Regression': rf_predictions, 'Support Vector Regression': svr_predictions, 'Time Series (Moving Average)': ts_predictions }) Here is a depiction of the upcoming years with our application employing regulations as the focus. 56 Linear Random Forest Year Regression Regression Support Vector Regression Time Series (Moving Average) 2021 - - - 547.42 - - - - 547.42 - - - 547.42 - - - 547.42 - - - 547.42 Table 5.13.2. 1 Prediction of CH4 concentration levels Forecasting CH4 trends and changes will help scientists learn more about how they could affect the climate over the next five years. Visualization for the CH4 concentration is given below. # Set the figure size plt.figure(figsize=(12, 8)) # Plot the predictions for each model plt.plot(predictions_df['Year'], predictions_df['Linear Regression'], label='Linear Regression', marker='o') plt.plot(predictions_df['Year'], predictions_df['Random Forest Regression'], label='Random Forest Regression', marker='o') plt.plot(predictions_df['Year'], predictions_df['Support Vector Regression'], label='Support Vector Regression', marker='o') plt.plot(predictions_df['Year'], predictions_df['Time Series (Moving Average)'], label='Time Series (Moving Average)', marker='o') # Set labels and title plt.xlabel('Year') plt.ylabel('Methane Concentration') plt.title('Methane Concentration Prediction (2021 to 2025)') # Add a legend plt.legend() # Show the plot plt.grid(True) plt.show() 57 Figure 5.13.2. 1 Prediction of CH4 Concentration for 5 years This graph uses four distinct regression methods: linear regression, random forest regression, support vector regression, and time series (moving average) regression to forecast CH4 (methane) concentration levels from 2021 to 2025. The estimated methane concentration for each cell in the graph is based on a particular year and regression technique. The graph enables comparison of the predictions made by these various regression models, evaluating the accuracy of their forecasts of methane concentration levels. Understanding and preparing for anticipated changes in methane concentrations in the atmosphere is made easier by the insights it offers into how each approach predicts methane levels during the given time. 5.13.3 Forecasting the Earth Global Temperature Using the created machine learning models, future atmospheric temperature levels may be projected. Forecasts of temperature levels for the upcoming five years are produced by these models using pertinent information, historical temperature 58 records, and potential influencing variables (such greenhouse gas concentrations and changes in land use). With the use of machine learning techniques, we can see the expected temperature ranges for the next five years. We can estimate trends and variations in temperature and get insight into their possible effects on the global climate over the next five years thanks to these predictions, which give a visual depiction of the projected temperature swings. Here is the code for making predictions for the next five years. # Generate years for the next five years (2021 to 2025) future_years = np.arange(2021, 2026).reshape(-1, 1) # Predict temperatures for the next five years using the trained models lr_predictions = lr_model.predict(future_years) rf_predictions = rf_model.predict(future_years) svr_predictions = svr_model.predict(future_years) # Create a DataFrame to store the predictions predictions_df = pd.DataFrame({ 'Year': future_years.flatten(), 'Linear Regression': lr_predictions, 'Random Forest Regression': rf_predictions, 'Support Vector Regression': svr_predictions, }) Here is a depiction of the upcoming years with our application employing regulations as the focus. Year- Linear Random Forest Regression Regression- - Support Vector Regression- Table 5.13.3. 1 Land Average Temperature Prediction (2021 to 2025) 59 Forecasting Land Average Temperature trends and changes will help scientists learn more about how they could affect the climate over the next five years. Visualization for the Land Average Temperature is given below. # Set the figure size plt.figure(figsize=(12, 8)) # Plot the predictions for each model plt.plot(predictions_df['Year'], predictions_df['Linear Regression'], label='Linear Regression', marker='o') plt.plot(predictions_df['Year'], predictions_df['Random Forest Regression'], label='Random Forest Regression', marker='o') plt.plot(predictions_df['Year'], predictions_df['Support Vector Regression'], label='Support Vector Regression', marker='o') # Set labels and title plt.xlabel('Year') plt.ylabel('Land Average Temperature') plt.title('Land Average Temperature Prediction (2021 to 2025)') # Add a legend plt.legend() # Show the plot plt.grid(True) plt.show() 60 Figure 5.13.3. 1 Land Average Temperature Prediction Graph Employing three separate regression methods—Linear Regression, Random Forest Regression, and Support Vector Regression—provides projections for land average temperature levels from 2021 to 2025. For a particular year and regression method, each cell in the graph shows the expected land average temperature. The following graph enables a comparison of the temperature forecasts produced by these various regression models, evaluating the accuracy with which they estimate average land temperatures. This information is useful for understanding and making plans for probable variations in land temperatures in the upcoming years since it sheds light on how each approach predicts temperature changes throughout the given period. 5.13.4 Forecasting the Concentration of Ozone It is possible to predict future ozone concentrations in the atmosphere using the created machine learning algorithms. These models produce estimates of ozone concentration levels for the following five years based on pertinent data, historical 61 records of ozone concentration, and potential influencing variables (such emissions and atmospheric conditions). The projected ozone concentration levels over the following five years can be seen by using machine learning algorithms. We can predict trends and changes in ozone levels and get insight into their prospective effects on the atmospheric composition over the next five years thanks to these forecasts, which give a visual depiction of the anticipated ozone concentration oscillations. Here is the code for making predictions for the next five years. # Generate years for the next five years (2021 to 2025) future_years = np.arange(2021, 2026).reshape(-1, 1) # Predict temperatures for the next five years using the trained models lr_predictions = lr_model.predict(future_years) rf_predictions = rf_model.predict(future_years) svr_predictions = svr_model.predict(future_years) # Create a DataFrame to store the predictions predictions_df = pd.DataFrame({ 'Year': future_years.flatten(), 'Linear Regression': lr_predictions, 'Random Forest Regression': rf_predictions, 'Support Vector Regression': svr_predictions, }) Here is a depiction of the upcoming years with our application employing regulations as the focus. Year Linear Regression - - Random Forest Regression- Support Vector Regression- Table 5.13.4. 1 Total column: SBUV Concentration Prediction 62 Forecasting Total column: SBUV Concentration trends and changes will help scientists learn more about how they could affect the climate over the next five years. Visualization for the Total column: SBUV Concentration is given below. # Set the figure size plt.figure(figsize=(12, 8)) # Plot the predictions for each model plt.plot(predictions_df['Year'], predictions_df['Linear Regression'], label='Linear Regression', marker='o') plt.plot(predictions_df['Year'], predictions_df['Random Forest Regression'], label='Random Forest Regression', marker='o') plt.plot(predictions_df['Year'], predictions_df['Support Vector Regression'], label='Support Vector Regression', marker='o') # Set labels and title plt.xlabel('Year') plt.ylabel('Total column: SBUV Concentration') plt.title('Total column: SBUV Concentration Prediction of Ozone (2021 to 2025)') # Add a legend plt.legend() # Show the plot plt.grid(True) plt.show() 63 Figure 5.13.4. 1 Total column: SBUV Concentration Graph Predicts the total column concentration of ozone (SBUV) from 2021 to 2025 using three distinct regression methods: Linear Regression, Random Forest Regression, and Support Vector Regression. The estimated total column concentration of ozone for each cell in the graph is based on the regression approach and a given year. The graph allows for a comparison of the predicted ozone concentrations from these various regression models, providing for an evaluation of how well they predicted ozone levels. It offers information that is helpful for understanding and making plans for probable fluctuations in ozone levels in the upcoming years regarding how each technique predicts changes in the total column concentration of ozone over the selected period. 64 5.14 Limitations of Regression Models 5.14.1 Limitation of Decision Tree Algorithm Decision tree algorithms have several drawbacks, such as a propensity to overfit the training data, sensitivity to minute changes in the data, and difficulties capturing complicated connections in the data. 5.14.2 Limitations of Random Forest The main limitation of random forest is that many trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. 5.14.3 Limitations of Gradient Boosting Machine (GBM) Due to the repetitive construction of several weak learners, training is computationally costly and time-consuming. 5.15 Source Code of the Project Here is the complete source code of this project. import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # CO2 (measured in PPM) 800,000BCE-2021 From EPA Climate Change Indicators hist_co2 = pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_co2.csv', na_values="") hist_co2.head() # Methane (measured in PPB) 800,000BCE-2021 From EPA Climate Change Indicators hist_methane = pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_methane.csv', na_values="") hist_methane.head() # Global Average Temperature- From Berkeley Earth global_temp = pd.read_csv('/content/drive/MyDrive/Climate_data/berkley_earth_globaltemperatures.csv' ) global_temp.head() hist_ozone = pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_ozone.csv') 65 hist_ozone.head() # Create a function to wrangle the 800k BCE Datasets def hist_df_wrangle(ds, bl, gas): # Save station names for later reference stations = ds.iloc[5, 1:] stations.index = range(len(stations)) stations.columns = range(len(stations)) # Remove data set descriptions if ds is not hist_n2o: ds = ds.iloc[bl:] ds = ds.iloc[7:] ds.columns = ["year"] + list(ds.columns[1:]) # Convert to numeric ds = ds.apply(pd.to_numeric, errors='coerce') # Get one averaged temp for each year ds["average"] = ds.iloc[:, 1:].mean(axis=1, skipna=True) return ds # Wrangle CO2 hist_co2 = hist_df_wrangle(hist_co2, 1310, "co2") # Wranlge Methane hist_methane = hist_df_wrangle(hist_methane, 2183, "methane") hist_methane = hist_methane.iloc[30:] # prepare 1950s - to current sets hist_50_co2 = hist_co2[hist_co2["year"] >= 1750][["year", "average"]] hist_50_methane = hist_methane[hist_methane["year"] >= 1750][["year", "average"]] hist_50_n2o = hist_n2o[hist_n2o["year"] >= 1750][["year", "average"]] # Global Temp Yearly Averages global_temp["year"] = pd.to_datetime(global_temp["dt"]).dt.year global_avg = global_temp[~global_temp["landandoceanaveragetemperature"].isna()] global_avg global_avg.groupby("year")["landandoceanaveragetemperature"].mean().reset_index() global_co2 = hist_50_co2[hist_50_co2["year"] >= global_avg["year"].min()] fig, ax = plt.subplots(figsize=(12, 6)) sns.set_palette("GnBu_r") # Plot radiative forcing for each gas separately sns.lineplot(data=rad_force, x="year", y="co2", label="CO2", ax=ax) sns.lineplot(data=rad_force, x="year", y="ch4", label="CH4", ax=ax) sns.lineplot(data=rad_force, x="year", y="n2o", label="N2O", ax=ax) sns.lineplot(data=rad_force, x="year", y="cfc12", label="CFC-12", ax=ax) sns.lineplot(data=rad_force, x="year", y="cfc11", label="CFC-11", ax=ax) 66 = sns.lineplot(data=rad_force, x="year", y="15_minor", label="Other 15 Minor Gases", ax=ax) sns.lineplot(data=rad_force, x="year", y="total", label="Total", ax=ax) # Customize labels and title ax.set_xlabel('Year (1979 - 2015)') ax.set_ylabel('Radiative Forcing (W m^2)') ax.set_title('Greenhouse Gas Radiative Forcing') ax.set_xlim(1979, 2015) ax.set_ylim(0, 3) ax.set_xticks(range(1980, 2015, 5)) plt.legend(title="Gas", loc="upper left", frameon=False) plt.show() plt.figure(figsize=(10, 6)) plt.plot(hist_ozone["Year"], hist_ozone["Total column: SBUV"], marker='o', label="Total Column (SBUV)") plt.plot(hist_ozone["Year"], hist_ozone["Troposphere"], marker='o', label="Troposphere") plt.plot(hist_ozone["Year"], hist_ozone["Stratosphere"], marker='o', label="Stratosphere") plt.title("Ozone Layer Data Over the Years") plt.xlabel("Year") plt.ylabel("Ozone Concentration") plt.legend() plt.grid(True) fig, ax = plt.subplots(figsize=(12, 6)) # Plot the data for each halogen compound for column in hist_halogen.columns[1:]: ax.plot(hist_halogen["Year"], hist_halogen[column], label=column) # Customize the plot ax.set_xlabel('Year') ax.set_ylabel('Concentration') ax.set_title('Concentration of Halogen Compounds Over Time') ax.legend(loc='upper left', bbox_to_anchor=(1, 1)) # Show the plot plt.grid(True) plt.tight_layout() plt.show() import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.svm import SVR from sklearn.metrics import mean_squared_error, r2_score 67 import statsmodels.api as sm # Load the CO2 concentration dataset # co2_data pd.read_csv('/content/drive/MyDrive/Climate_data/ghg_concentrations_co2.csv') # Split the data into features (year) and the target variable (CO2 levels) hist_co2 = hist_co2.fillna(0) X = hist_co2[['year']] y = hist_co2['average'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Linear Regression lr_model = LinearRegression() lr_model.fit(X_train, y_train) lr_pred = lr_model.predict(X_test) # Random Forest Regression rf_model = RandomForestRegressor(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) rf_pred = rf_model.predict(X_test) # Support Vector Regression svr_model = SVR(kernel='linear') svr_model.fit(X_train, y_train) svr_pred = svr_model.predict(X_test) # Time Series Model (Simple Moving Average) rolling_mean = y_train.rolling(window=5).mean().iloc[-1] ts_pred = np.full_like(y_test, fill_value=rolling_mean) # Calculate mean squared error and R-squared for each model def evaluate_model(y_true, y_pred, model_name): mse = mean_squared_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) print(f"{model_name} - Mean Squared Error: {mse:.2f}, R-squared: {r2:.2f}") evaluate_model(y_test, lr_pred, "Linear Regression") evaluate_model(y_test, rf_pred, "Random Forest Regression") evaluate_model(y_test, svr_pred, "Support Vector Regression") evaluate_model(y_test, ts_pred, "Time Series (Moving Average)") # Extend the year range for forecasting (e.g., predict for the next 5 years) next_years = np.arange(X_test['year'].max() + 1, X_test['year'].max() + 6) X_forecast = pd.DataFrame({'year': next_years}) 68 = # Use the trained models to make forecasts lr_forecast = lr_model.predict(X_forecast) rf_forecast = rf_model.predict(X_forecast) svr_forecast = svr_model.predict(X_forecast) # Time Series Model (Simple Moving Average) - continue with the rolling mean rolling_mean = y.rolling(window=5).mean().iloc[-1] ts_forecast = np.full(5, rolling_mean) # Display the forecasts for the next 5 years forecast_df = pd.DataFrame({ 'Year': next_years, 'Linear Regression': lr_forecast, 'Random Forest Regression': rf_forecast, 'Support Vector Regression': svr_forecast, 'Time Series (Moving Average)': ts_forecast }) # Create a DataFrame for visualization forecast_df = pd.DataFrame({ 'Year': next_years, 'Linear Regression': lr_forecast, 'Random Forest Regression': rf_forecast, 'Support Vector Regression': svr_forecast, 'Time Series (Moving Average)': ts_forecast }) # Set the 'Year' column as the index for plotting forecast_df.set_index('Year', inplace=True) # Plot the forecasts for the next 5 years plt.figure(figsize=(12, 6)) plt.plot(forecast_df.index, forecast_df['Linear Regression'], label='Linear Regression', marker='o') plt.plot(forecast_df.index, forecast_df['Random Forest Regression'], label='Random Forest Regression', marker='o') plt.plot(forecast_df.index, forecast_df['Support Vector Regression'], label='Support Vector Regression', marker='o') plt.plot(forecast_df.index, forecast_df['Time Series (Moving Average)'], label='Time Series (Moving Average)', marker='o') plt.xlabel('Year') plt.ylabel('CO2 Concentration') plt.title('CO2 Concentration Forecasts for the Next 5 Years') plt.legend() plt.grid(True) 69 plt.show() CONCLUSION & FUTURE WORK This report highlights the development of a climate data analysis system that employs machine learning techniques to predict future climate variables. The primary goal of this system is to forecast various climate-related parameters, such as greenhouse gas concentrations, temperature levels, and ozone concentrations, with the aim of gaining valuable insights into the Earth's evolving climate. Our implementation uses historical climate data to train machine learning models, including details on greenhouse gas emissions, changes in land use, and other relevant aspects. These models have proven to be capable of predicting future climate conditions with accuracy, which has improved our knowledge of patterns and possible effects. We have demonstrated our capacity to produce exact predictions about factors linked to climate through the application of machine learning models such as Support Vector Machines, Random Forest Regression, and Time Series Analysis. For climate research, policy development, and environmental management, these forecasts have significant ramifications. Future research in machine learning prediction and climate data analysis will center on several crucial elements that will improve the system's functionality and uses. To increase forecast accuracy, it is first necessary to extend the dataset with more thorough and high-resolution climatic data. By incorporating data from other sources and sensors, it will be possible to gain a more thorough knowledge of climatic factors, which will ultimately enable the creation of projections that are both exact and complex. It is highly promising to investigate cutting-edge machine learning methods like deep learning and neural networks. Particularly for complicated climate-related events, these state-of-the-art models have the potential 70 to provide even more precise forecasts. The system's forecasting powers may be greatly improved by utilizing modern machine learning techniques. Furthermore, the integration of ensemble models—multiple machine learning algorithms combined—can alleviate the shortcomings of individual models. To provide forecasts that are more solid and trustworthy, ensemble models combine the advantages of many methodologies. For continually updating forecasts and guaranteeing their applicability under changing environmental circumstances, realtime monitoring capabilities are essential. Future work must include teamwork with experts from many sectors, user-friendly visualization tools, and analyses of how the climate affects ecosystems, agriculture, and human populations. These developments will encourage a better knowledge of climate dynamics and enable more powerful solutions to the serious problems brought on by climate change. 71 REFERENCES [1] PNAS. Available at: https://www.pnas.org/doi/epdf/10.1073/pnas- (Accessed: 29 September 2023). [2] Bessou, C. et al. (1970) Biofuels, Greenhouse Gases, and climate change, SpringerLink. Available at: https://link.springer.com/chapter/10.1007/-_20 (Accessed: 29 September 2023). [3] National Centers for Environmental Information (NCEI) (no date) Climate Data Online: Dataset Discovery, Datasets | Climate Data Online (CDO) | National Climatic Data Center (NCDC). Available at: https://www.ncdc.noaa.gov/cdo-web/datasets (Accessed: 29 September 2023). [4] Climate change indicators: Atmospheric concentrations of greenhouse gases. Available at: https://www.epa.gov/climate-indicators/climate-changeindicators-atmospheric-concentrations-greenhouse-gases (Accessed: 29 September 2023). [5] National Centers for Environmental Information (NCEI) (no date b) National Centers for Environmental Information (NCEI). Available at: https://www.ncei.noaa.gov/access/paleo-search/ (Accessed: 29 September 2023). [6] (2022) Time series from scratch - moving averages (MA) theory and Implementation, Medium. Available at: https://towardsdatascience.com/timeseries-from-scratch-moving-averages-ma-theory-and-implementationa01b97b60a18 (Accessed: 29 September 2023). [7] Linear regression in machine learning (2023) GeeksforGeeks. Available at: https://www.geeksforgeeks.org/ml-linear-regression/ (Accessed: 29 September 2023). [8] Saini, A. (2023) Decision tree algorithm - A complete guide, Analytics Vidhya. Available at: https://www.analyticsvidhya.com/blog/2021/08/decisiontree-algorithm/ (Accessed: 29 September 2023). [9] Random Forest: A complete guide for machine learning (no date) Built In. Available at: https://builtin.com/data-science/random-forest-algorithm (Accessed: 29 September 2023). [10] López, O.A.M., López, A.M. and Crossa, J. (1970) Support vector machines and support vector regression, SpringerLink. Available at: 72 https://link.springer.com/chapter/10.1007/-_9 (Accessed: 29 September 2023). [11] Goyal, S. (2021) Evaluation metrics for regression models, Medium. Available at: https://medium.com/analytics-vidhya/evaluation-metrics-forregression-models-c91c65d73af (Accessed: 29 September 2023). [12] Lindsey, R. (no date) Climate change: Atmospheric carbon dioxide, NOAA Climate.gov. Available at: https://www.climate.gov/news-features/understandingclimate/climate-change-atmospheric-carbon-dioxide (Accessed: 29 September 2023). [13] Methane (2023) NASA. Available at: https://climate.nasa.gov/vitalsigns/methane/ (Accessed: 29 September 2023). [14] Data Overview (2023) Berkeley Earth. Available at: https://berkeleyearth.org/data/ (Accessed: 29 September 2023). [15] FAQ: What is the greenhouse effect? (No date) NASA. Available at: https://climate.nasa.gov/faq/19/what-is-the-greenhouse-effect/ (Accessed: 29 September 2023). 73

Scheduled maintenance