Mohamed Ihab Khalifa | Freelancer Web Scraping And Analyzing Stock Data

Web Scraping and Analyzing Stock Data

Web Scraping and Analyzing Stock Data This project involves web scraping and analyzing stocks data of Tesla and GameStop. It was completed in complying with a project-based course, 'Python Project for Data Science', offered online by IBM. It employs a battery of popular Python libraries for data scraping and data analysis, including: 'yfinance' to extract stock data from Yahoo Finance, a quintessential platform for accessing financial news and reports; the 'beautiful soup' library for parsing HTML documents; and 'pandas' and 'plotly' libraries for data analysis and visualization. Employing these libraries, I'll web scrape stock as well as (quarterly) revenue data for each of Tesla and GameStop, and then visualize their changes over time on a simple interactive dashboard. Overall, this project cover a range of data science tasks, including: Web scraping stock and revenue data Cleaning up and updating data Data parsing Creating an interactive dashboard for visualizing data The data being analyzed for this project consist of the stock data and revenue amounts (particularly, quarterly revenue) for each of the companies Tesla and GameStop. The stock data for both companies are scraped from Yahoo Finance, which provides financial records of Tesla's stock dated since 2010 and GameStop's stock dated since 2002 and up to the present day. As for the revenue data, Tesla's quarterly revenue are scraped from www.macrotrends.net, a platforms that also publishes financial records and reports. However, the quarterly revenue data of GameStop are scraped from a special link provided by IBM for the original project course. I've used here IBM's link for this task especially as it provides the quarterly revenue data for GameStop dated since 2005 whereas websites like 'macrotrends' publishes financial records that date no earlier than 2009. Note if you are using the executable notebook version, make sure to run the first two cells first before executing any of the code below so that the Python packages necessary for the task are installed, imported, and ready for use. To run any given cell, simply select the cell and click on the 'Run' icon on the notebook toolbar In [ ]: #Installing the Python modules to be used !pip install yfinance !pip install pandas !pip install requests !pip install bs4 !pip install plotly In [2]: #Importing the modules for use import requests import yfinance as yf import pandas as pd from bs4 import BeautifulSoup import plotly.graph_objects as go from plotly.subplots import make_subplots import warnings warnings.simplefilter("ignore") Part One: Using yfinance to Extract Stock Data For this part, I will use the yfinance module to extract Tesla's and GameStop's stock data from Yahoo Finance. I will specify the interval of time to preview the stock data across as 'max' which will provide the stock data dated since 2009. In [3]: #First, extracting Tesla's stock data from Yahoo Finance #Creating a Ticker object for Tesla's data tesla = yf.Ticker('TSLA') #extracting Tesla's stock data (dated since the beginning) tesla_data = tesla.history(period='max') #resetting the index (and retrieving dates as a coloumn, 'Date') tesla_data.reset_index(inplace=True) #to preview the first 5 enteries of Tesla's stock data tesla_data.head() Date Open High Low Close Volume Dividends Stock Splits 0 - 3.800 5.000 3.508 4.778 - 0 0.0 1 - 5.158 6.084 4.660 4.766 - 0 0.0 2 - 5.000 5.184 4.054 4.392 - 0 0.0 3 - 4.600 4.620 3.742 3.840 - 0 0.0 4 - 4.000 4.000 3.166 3.222 - 0 0.0 Out[3]: In [4]: #Now extracting GameStop's stock data #Creating a ticker object for GameStop's data gme = yf.Ticker('GME') #extracting GameStop's stock data gme_data = gme.history(period='max') #resetting the index (and retrieving dates) gme_data.reset_index(inplace=True) #to preview the first 5 enteries of Gamestop's stock data gme_data.head() Date Open High Low Close Volume Dividends Stock Splits 0 - - - - - - 0.0 0.0 1 - - - - - - 0.0 0.0 2 - - - - - - 0.0 0.0 Out[4]: 3 - - - - - - 0.0 0.0 4 - - - - - - 0.0 0.0 Part Two: Web Scraping Tesla's and Gamestop's Revenue Data In this section, I will use the Python libraries, requests, pandas, and beautiful soup to scrape Tesla's and GameStop's historical revenue data. Retrieving Tesla's historical revenue data In [5]: #First, specifying the web page to scrape the data from tesla_url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue" #making a get request to extract the html document tesla_html_data = requests.get(tesla_url).text #creating a beautiful soup object to parse html document tesla_soup = BeautifulSoup(tesla_html_data) Two methods to retrieve the revenue data: i) Using Python's pandas library: The first technique involves using pandas read_html() method to specify the html document and name of the table in order to scrape the revenue table in one step. In [6]: #Extracting revenue table 'Tesla Quarterly Revenue' table_list = pd.read_html(tesla_html_data, #specifying the html document match='Tesla Quarterly Revenue', #specifying the table to l flavor='bs4') #specifying the parsing engine #Converting the revenue table into a dataframe tesla_revenue = table_list[0] #Renaming the dataframe coloumns appropriately tesla_revenue.rename(columns={'Tesla Quarterly Revenue(Millions of US $)': 'Date', 'Tesl #Displaying the dataframe tesla_revenue Date Revenue 0 - $16,934 1 - $18,756 2 - $17,719 3 - $13,757 4 - $11,958 5 - $10,389 6 - $10,744 7 - $8,771 8 - $6,036 Out[6]: 9 - $5,985 10 - $7,384 11 - $6,303 12 - $6,350 13 - $4,541 14 - $7,226 15 - $6,824 16 - $4,002 17 - $3,409 18 - $3,288 19 - $2,985 20 - $2,790 21 - $2,696 22 - $2,285 23 - $2,298 24 - $1,270 25 - $1,147 26 - $1,214 27 - $937 28 - $955 29 - $940 30 - $957 31 - $852 32 - $769 33 - $621 34 - $615 35 - $431 36 - $405 37 - $562 38 - $306 39 - $50 40 - $27 41 - $30 42 - $39 43 - $58 44 - $58 45 - $49 46 - $36 47 - $31 48 - $28 49 - $21 50 - NaN 51 - $46 52 - $27 ii) Using Python's Beautiful Soup library: Alternatively, we can use beautiful soup to parse the html document and extract the revenue table, before assigning the data into a pandas dataframe. In [7]: #First, creating an empty dataframe with the necessary coloumns tesla_revenue = pd.DataFrame(columns=['Date', 'Revenue']) #extracting the revenue table using its tag name ('tbody') and index tesla_revenue_table = tesla_soup.find_all('tbody')[1] #looping through the table rows and extracting the data for row in tesla_revenue_table.find_all('tr'): #accessing the coloumns along row col = row.find_all('td') date = col[0].text revenue = col[1].text #appending date and revenue to dataframe, 'tesla_revenue' tesla_revenue = tesla_revenue.append({'Date': date, 'Revenue': revenue}, ignore_inde #displaying the dataframe tesla_revenue Date Revenue 0 - $16,934 1 - $18,756 2 - $17,719 3 - $13,757 4 - $11,958 5 - $10,389 6 - $10,744 7 - $8,771 8 - $6,036 9 - $5,985 10 - $7,384 11 - $6,303 12 - $6,350 Out[7]: 13 - $4,541 14 - $7,226 15 - $6,824 16 - $4,002 17 - $3,409 18 - $3,288 19 - $2,985 20 - $2,790 21 - $2,696 22 - $2,285 23 - $2,298 24 - $1,270 25 - $1,147 26 - $1,214 27 - $937 28 - $955 29 - $940 30 - $957 31 - $852 32 - $769 33 - $621 34 - $615 35 - $431 36 - $405 37 - $562 38 - $306 39 - $50 40 - $27 41 - $30 42 - $39 43 - $58 44 - $58 45 - $49 46 - $36 47 - $31 48 - $28 49 - $21 50 - 51 - $46 52 - $27 Retrieving GameStop's historical revenue data In [8]: #Specifying the url to GameStop's revenue data gme_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDevelop #extracting html document gme_html_data = requests.get(gme_url).text #creating a beautiful soup object for parsing the document gme_soup = BeautifulSoup(gme_html_data) Once again, we could parse and extract the data in two ways: i) Using Python's pandas library: In [9]: #Extracting revenue table 'GameStop Quarterly Revenue' table_list = pd.read_html(gme_html_data, # specifying the html document match='GameStop Quarterly Revenue', # specifying the table to flavor='bs4') # specifying the parse engine #converting the revenue table into a dataframe gme_revenue = table_list[0] #renaming the dataframe coloumns appropriately gme_revenue.rename(columns={'GameStop Quarterly Revenue(Millions of US $)': 'Date', 'Gam #displaying the dataframe gme_revenue Date Revenue 0 - $1,021 1 - $2,194 2 - $1,439 3 - $1,286 4 - $1,548 ... ... ... 57 - $1,667 58 - $534 59 - $416 60 - $475 61 - $709 Out[9]: 62 rows × 2 columns ii) Using Python's Beautiful Soup library: In [10]: #First, creating an empty dataframe with the necessary coloumns gme_revenue = pd.DataFrame(columns=['Date', 'Revenue']) #extracting the revenue table using its tag name ('tbody') and index gme_revenue_table = gme_soup.find_all('tbody')[1] #looping through the table rows and extracting the data for row in gme_revenue_table.find_all('tr'): #accessing the coloumns along row col = row.find_all('td') date = col[0].text revenue = col[1].text #appending date and revenue to dataframe, gme_revenue gme_revenue = gme_revenue.append({'Date': date, 'Revenue': revenue}, ignore_index=Tru #displaying the dataframe gme_revenue Date Revenue 0 - $1,021 1 - $2,194 2 - $1,439 3 - $1,286 4 - $1,548 ... ... ... 57 - $1,667 58 - $534 59 - $416 60 - $475 61 - $709 Out[10]: 62 rows × 2 columns Part Three: Cleaning Up the Data Having extracted the revenue data for Tesla and GameStop, it seems that some enteries are missing or include inappropriate or non-numerical characters (as in the Revenue coloumn). As such, it's time to clean up and prepare the data for analysis. In [11]: #First, cleaning up Tesla's revenue data #removing the non-numeric/special characters from 'Revenue' coloumn tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$', "") #second, removing Nan (not a number) and empty enteries tesla_revenue.dropna(inplace=True) true_revenues = tesla_revenue['Revenue'] != "" tesla_revenue = tesla_revenue[true_revenues] #previewing the dataframe tesla_revenue.head() Date Revenue 0 - 16934 1 - 18756 2 - 17719 3 - 13757 4 - 11958 Out[11]: In [12]: #Now cleaning up GameStop's data #first, removing special characters from 'Revenue' coloumn gme_revenue["Revenue"] = gme_revenue['Revenue'].str.replace(',|\$', "") #second, removing Nan (not a number) and empty enteries gme_revenue.dropna(inplace=True) true_revenues = gme_revenue['Revenue'] != "" gme_revenue = gme_revenue[true_revenues] #previewing the dataframe gme_revenue.head() Date Revenue 0 - 1021 1 - 2194 2 - 1439 3 - 1286 4 - 1548 Out[12]: Part Four: Visualizing Tesla and GameStop's Stock Data For this section, I'll use plotly, a powerful graphing library, to create an interactive dashboard with the stocks and revenue data that were extracted for each of Tesla and GameStop. The data for each company will be plotted separately. To do so, first, I'll define a function for visualization the data on a subplot. The subplot will consist of two scatter plots, the first displaying a company's historical share prices (above), whilst the second displaying a company's historical revenue (below), and both are separated by a range slider that allows the user to navigate freely and zero-in on any particular segment of data within a particular range of time. In [13]: #First, defining the graph function for visualizing the data def make_graph(stock_data, revenue_data, stock): """This function takes a dataframe with the stock data (must include Date and Close p a dataframe with the revenue data (must include Date and Revenue amounts), and the n the stock, and plots the historical share price and revenue data on a subplot compri two scatter plot graphs, one for each.""" fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=("Historical S stock_data_specific = stock_data revenue_data_specific = revenue_data fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_f fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_ fig.update_xaxes(title_text="Date", row=1, col=1) fig.update_xaxes(title_text="Date", row=2, col=1) fig.update_yaxes(title_text="Price ($US)", row=1, col=1) fig.update_yaxes(title_text="Revenue ($US Millions)", row=2, col=1) fig.update_layout(showlegend=False, height=900, title=stock, xaxis_rangeslider_visible=True) fig.show() Visualizing Tesla's Stock and Revenue Data In [14]: #Executing the make_graph() function to plot GameStop's historical data make_graph(tesla_data, tesla_revenue, 'Tesla Stock Data') Tesla Stock Data Historical Share Price 1200 Price ($US) - Date US Millions) Historical Revenue 15k 10k Revenue ( 5k 0 2010 2012 2014 2016 2018 2020 2022 Date Visualizing GameStop's Stock and Revenue Data In [15]: #Again, executing the make_graph() function to plot GameStop's historical data make_graph(gme_data, gme_revenue, 'GameStop Stock Data') GameStop Stock Data Historical Share Price Price ($US) 80 60 40 20 0 Date Historical Revenue e ($US Millions) - Reven - 2010 2015 2020 Date Note, moving the mouse cursor on any given data point on the graph will display the date and the share price or revenue amount (in USD) for that data point. Also, feel free to use the range slider to zoom in on any particular time interval and preview the data trends within that interval. In [16]: #END