Web Scraping and Analyzing Stock Data
Web Scraping and Analyzing Stock Data
This project involves web scraping and analyzing stocks data of Tesla and GameStop. It
was completed in complying with a project-based course, 'Python Project for Data
Science', offered online by IBM. It employs a battery of popular Python libraries for
data scraping and data analysis, including: 'yfinance' to extract stock data from Yahoo
Finance, a quintessential platform for accessing financial news and reports; the
'beautiful soup' library for parsing HTML documents; and 'pandas' and 'plotly' libraries
for data analysis and visualization. Employing these libraries, I'll web scrape stock as
well as (quarterly) revenue data for each of Tesla and GameStop, and then visualize
their changes over time on a simple interactive dashboard.
Overall, this project cover a range of data science tasks, including:
Web scraping stock and revenue data
Cleaning up and updating data
Data parsing
Creating an interactive dashboard for visualizing data
The data being analyzed for this project consist of the stock data and revenue amounts (particularly,
quarterly revenue) for each of the companies Tesla and GameStop. The stock data for both
companies are scraped from Yahoo Finance, which provides financial records of Tesla's stock dated
since 2010 and GameStop's stock dated since 2002 and up to the present day. As for the revenue
data, Tesla's quarterly revenue are scraped from www.macrotrends.net, a platforms that also
publishes financial records and reports. However, the quarterly revenue data of GameStop are
scraped from a special link provided by IBM for the original project course. I've used here IBM's link
for this task especially as it provides the quarterly revenue data for GameStop dated since 2005
whereas websites like 'macrotrends' publishes financial records that date no earlier than 2009.
Note if you are using the executable notebook version, make sure to run the first two cells first before
executing any of the code below so that the Python packages necessary for the task are installed,
imported, and ready for use. To run any given cell, simply select the cell and click on the 'Run' icon on
the notebook toolbar
In [ ]: #Installing the Python modules to be used
!pip install yfinance
!pip install pandas
!pip install requests
!pip install bs4
!pip install plotly
In [2]: #Importing the modules for use
import requests
import yfinance as yf
import pandas as pd
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.simplefilter("ignore")
Part One: Using yfinance to Extract Stock Data
For this part, I will use the yfinance module to extract Tesla's and GameStop's stock data from Yahoo
Finance. I will specify the interval of time to preview the stock data across as 'max' which will provide the
stock data dated since 2009.
In [3]: #First, extracting Tesla's stock data from Yahoo Finance
#Creating a Ticker object for Tesla's data
tesla = yf.Ticker('TSLA')
#extracting Tesla's stock data (dated since the beginning)
tesla_data = tesla.history(period='max')
#resetting the index (and retrieving dates as a coloumn, 'Date')
tesla_data.reset_index(inplace=True)
#to preview the first 5 enteries of Tesla's stock data
tesla_data.head()
Date
Open
High
Low
Close
Volume
Dividends
Stock Splits
0
-
3.800
5.000
3.508
4.778
-
0
0.0
1
-
5.158
6.084
4.660
4.766
-
0
0.0
2
-
5.000
5.184
4.054
4.392
-
0
0.0
3
-
4.600
4.620
3.742
3.840
-
0
0.0
4
-
4.000
4.000
3.166
3.222
-
0
0.0
Out[3]:
In [4]: #Now extracting GameStop's stock data
#Creating a ticker object for GameStop's data
gme = yf.Ticker('GME')
#extracting GameStop's stock data
gme_data = gme.history(period='max')
#resetting the index (and retrieving dates)
gme_data.reset_index(inplace=True)
#to preview the first 5 enteries of Gamestop's stock data
gme_data.head()
Date
Open
High
Low
Close
Volume
Dividends
Stock Splits
0
-
-
-
-
-
-
0.0
0.0
1
-
-
-
-
-
-
0.0
0.0
2
-
-
-
-
-
-
0.0
0.0
Out[4]:
3
-
-
-
-
-
-
0.0
0.0
4
-
-
-
-
-
-
0.0
0.0
Part Two: Web Scraping Tesla's and Gamestop's Revenue Data
In this section, I will use the Python libraries, requests, pandas, and beautiful soup to scrape Tesla's and
GameStop's historical revenue data.
Retrieving Tesla's historical revenue data
In [5]: #First, specifying the web page to scrape the data from
tesla_url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
#making a get request to extract the html document
tesla_html_data = requests.get(tesla_url).text
#creating a beautiful soup object to parse html document
tesla_soup = BeautifulSoup(tesla_html_data)
Two methods to retrieve the revenue data:
i) Using Python's pandas library:
The first technique involves using pandas read_html() method to specify the html document
and name of the table in order to scrape the revenue table in one step.
In [6]: #Extracting revenue table 'Tesla Quarterly Revenue'
table_list = pd.read_html(tesla_html_data,
#specifying the html document
match='Tesla Quarterly Revenue',
#specifying the table to l
flavor='bs4')
#specifying the parsing engine
#Converting the revenue table into a dataframe
tesla_revenue = table_list[0]
#Renaming the dataframe coloumns appropriately
tesla_revenue.rename(columns={'Tesla Quarterly Revenue(Millions of US $)': 'Date', 'Tesl
#Displaying the dataframe
tesla_revenue
Date
Revenue
0
-
$16,934
1
-
$18,756
2
-
$17,719
3
-
$13,757
4
-
$11,958
5
-
$10,389
6
-
$10,744
7
-
$8,771
8
-
$6,036
Out[6]:
9
-
$5,985
10
-
$7,384
11
-
$6,303
12
-
$6,350
13
-
$4,541
14
-
$7,226
15
-
$6,824
16
-
$4,002
17
-
$3,409
18
-
$3,288
19
-
$2,985
20
-
$2,790
21
-
$2,696
22
-
$2,285
23
-
$2,298
24
-
$1,270
25
-
$1,147
26
-
$1,214
27
-
$937
28
-
$955
29
-
$940
30
-
$957
31
-
$852
32
-
$769
33
-
$621
34
-
$615
35
-
$431
36
-
$405
37
-
$562
38
-
$306
39
-
$50
40
-
$27
41
-
$30
42
-
$39
43
-
$58
44
-
$58
45
-
$49
46
-
$36
47
-
$31
48
-
$28
49
-
$21
50
-
NaN
51
-
$46
52
-
$27
ii) Using Python's Beautiful Soup library:
Alternatively, we can use beautiful soup to parse the html document and extract the revenue
table, before assigning the data into a pandas dataframe.
In [7]: #First, creating an empty dataframe with the necessary coloumns
tesla_revenue = pd.DataFrame(columns=['Date', 'Revenue'])
#extracting the revenue table using its tag name ('tbody') and index
tesla_revenue_table = tesla_soup.find_all('tbody')[1]
#looping through the table rows and extracting the data
for row in tesla_revenue_table.find_all('tr'):
#accessing the coloumns along row
col = row.find_all('td')
date = col[0].text
revenue = col[1].text
#appending date and revenue to dataframe, 'tesla_revenue'
tesla_revenue = tesla_revenue.append({'Date': date, 'Revenue': revenue}, ignore_inde
#displaying the dataframe
tesla_revenue
Date
Revenue
0
-
$16,934
1
-
$18,756
2
-
$17,719
3
-
$13,757
4
-
$11,958
5
-
$10,389
6
-
$10,744
7
-
$8,771
8
-
$6,036
9
-
$5,985
10
-
$7,384
11
-
$6,303
12
-
$6,350
Out[7]:
13
-
$4,541
14
-
$7,226
15
-
$6,824
16
-
$4,002
17
-
$3,409
18
-
$3,288
19
-
$2,985
20
-
$2,790
21
-
$2,696
22
-
$2,285
23
-
$2,298
24
-
$1,270
25
-
$1,147
26
-
$1,214
27
-
$937
28
-
$955
29
-
$940
30
-
$957
31
-
$852
32
-
$769
33
-
$621
34
-
$615
35
-
$431
36
-
$405
37
-
$562
38
-
$306
39
-
$50
40
-
$27
41
-
$30
42
-
$39
43
-
$58
44
-
$58
45
-
$49
46
-
$36
47
-
$31
48
-
$28
49
-
$21
50
-
51
-
$46
52
-
$27
Retrieving GameStop's historical revenue data
In [8]: #Specifying the url to GameStop's revenue data
gme_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDevelop
#extracting html document
gme_html_data = requests.get(gme_url).text
#creating a beautiful soup object for parsing the document
gme_soup = BeautifulSoup(gme_html_data)
Once again, we could parse and extract the data in two ways:
i) Using Python's pandas library:
In [9]: #Extracting revenue table 'GameStop Quarterly Revenue'
table_list = pd.read_html(gme_html_data, # specifying the html document
match='GameStop Quarterly Revenue', # specifying the table to
flavor='bs4') # specifying the parse engine
#converting the revenue table into a dataframe
gme_revenue = table_list[0]
#renaming the dataframe coloumns appropriately
gme_revenue.rename(columns={'GameStop Quarterly Revenue(Millions of US $)': 'Date', 'Gam
#displaying the dataframe
gme_revenue
Date
Revenue
0
-
$1,021
1
-
$2,194
2
-
$1,439
3
-
$1,286
4
-
$1,548
...
...
...
57
-
$1,667
58
-
$534
59
-
$416
60
-
$475
61
-
$709
Out[9]:
62 rows × 2 columns
ii) Using Python's Beautiful Soup library:
In [10]: #First, creating an empty dataframe with the necessary coloumns
gme_revenue = pd.DataFrame(columns=['Date', 'Revenue'])
#extracting the revenue table using its tag name ('tbody') and index
gme_revenue_table = gme_soup.find_all('tbody')[1]
#looping through the table rows and extracting the data
for row in gme_revenue_table.find_all('tr'):
#accessing the coloumns along row
col = row.find_all('td')
date = col[0].text
revenue = col[1].text
#appending date and revenue to dataframe, gme_revenue
gme_revenue = gme_revenue.append({'Date': date, 'Revenue': revenue}, ignore_index=Tru
#displaying the dataframe
gme_revenue
Date
Revenue
0
-
$1,021
1
-
$2,194
2
-
$1,439
3
-
$1,286
4
-
$1,548
...
...
...
57
-
$1,667
58
-
$534
59
-
$416
60
-
$475
61
-
$709
Out[10]:
62 rows × 2 columns
Part Three: Cleaning Up the Data
Having extracted the revenue data for Tesla and GameStop, it seems that some enteries are missing or
include inappropriate or non-numerical characters (as in the Revenue coloumn). As such, it's time to clean
up and prepare the data for analysis.
In [11]: #First, cleaning up Tesla's revenue data
#removing the non-numeric/special characters from 'Revenue' coloumn
tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$', "")
#second, removing Nan (not a number) and empty enteries
tesla_revenue.dropna(inplace=True)
true_revenues = tesla_revenue['Revenue'] != ""
tesla_revenue = tesla_revenue[true_revenues]
#previewing the dataframe
tesla_revenue.head()
Date
Revenue
0
-
16934
1
-
18756
2
-
17719
3
-
13757
4
-
11958
Out[11]:
In [12]: #Now cleaning up GameStop's data
#first, removing special characters from 'Revenue' coloumn
gme_revenue["Revenue"] = gme_revenue['Revenue'].str.replace(',|\$', "")
#second, removing Nan (not a number) and empty enteries
gme_revenue.dropna(inplace=True)
true_revenues = gme_revenue['Revenue'] != ""
gme_revenue = gme_revenue[true_revenues]
#previewing the dataframe
gme_revenue.head()
Date
Revenue
0
-
1021
1
-
2194
2
-
1439
3
-
1286
4
-
1548
Out[12]:
Part Four: Visualizing Tesla and GameStop's Stock Data
For this section, I'll use plotly, a powerful graphing library, to create an interactive dashboard with the stocks
and revenue data that were extracted for each of Tesla and GameStop. The data for each company will be
plotted separately.
To do so, first, I'll define a function for visualization the data on a subplot. The subplot will consist of two
scatter plots, the first displaying a company's historical share prices (above), whilst the second displaying a
company's historical revenue (below), and both are separated by a range slider that allows the user to
navigate freely and zero-in on any particular segment of data within a particular range of time.
In [13]: #First, defining the graph function for visualizing the data
def make_graph(stock_data, revenue_data, stock):
"""This function takes a dataframe with the stock data (must include Date and Close p
a dataframe with the revenue data (must include Date and Revenue amounts), and the n
the stock, and plots the historical share price and revenue data on a subplot compri
two scatter plot graphs, one for each."""
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=("Historical S
stock_data_specific = stock_data
revenue_data_specific = revenue_data
fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_f
fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_
fig.update_xaxes(title_text="Date", row=1, col=1)
fig.update_xaxes(title_text="Date", row=2, col=1)
fig.update_yaxes(title_text="Price ($US)", row=1, col=1)
fig.update_yaxes(title_text="Revenue ($US Millions)", row=2, col=1)
fig.update_layout(showlegend=False,
height=900,
title=stock,
xaxis_rangeslider_visible=True)
fig.show()
Visualizing Tesla's Stock and Revenue Data
In [14]: #Executing the make_graph() function to plot GameStop's historical data
make_graph(tesla_data, tesla_revenue, 'Tesla Stock Data')
Tesla Stock Data
Historical Share Price
1200
Price ($US)
-
Date
US Millions)
Historical Revenue
15k
10k
Revenue (
5k
0
2010
2012
2014
2016
2018
2020
2022
Date
Visualizing GameStop's Stock and Revenue Data
In [15]: #Again, executing the make_graph() function to plot GameStop's historical data
make_graph(gme_data, gme_revenue, 'GameStop Stock Data')
GameStop Stock Data
Historical Share Price
Price ($US)
80
60
40
20
0
Date
Historical Revenue
e ($US Millions)
-
Reven
-
2010
2015
2020
Date
Note, moving the mouse cursor on any given data point on the graph will display the date and the share price
or revenue amount (in USD) for that data point. Also, feel free to use the range slider to zoom in on any
particular time interval and preview the data trends within that interval.
In [16]: #END