White paper: How Alternate Data Powers Predictive Analytics
How Web Data Powers Predictive Analytics in Finance
INTRODUCTION ............................................................................................................. 1
Section 1: Beating the Market with Alternative Web Data ............................................... 2
Why Investment Managers are Leveraging Alternative Web Data .............................. 3
How Alternative Web Data Gives Investment Managers an Edge .............................. 5
Section 2: The Challenges of Collecting Alternative Web Data ....................................... 7
Using Pattern Matching and Heuristics to Structure the Web ..................................... 8
Is Web Scraping Legal? .............................................................................................. 8
But What About Copyright? ......................................................................................... 9
Section 3: Case Studies ................................................................................................ 10
Pfizer – General Analysis .......................................................................................... 10
Pfizer – ESG Analysis ............................................................................................... 12
Pfizer – Products Analysis ........................................................................................ 13
A Look Ahead at Alternative Web Data ......................................................................... 14
INTRODUCTION
In the past, investment management institutions relied mostly on traditional data to gain
an edge in investing. Traditional data ranges from SEC filings to earnings reports and
pricing information-any type of data produced by the company itself. The rise of the
digital age, however, has opened up new sources of data for investors beyond the
scope of traditional data. The seemingly infinite scope of alternative data includes data
produced from credit cards, satellites, social media and perhaps most importantly — the
web.
With the additional integration of alternative data, investment management institutions
and hedge funds in particular that once relied only on traditional data now have an edge
in predicting the rise and fall of the markets. As increasing numbers of financial
institutions jump on the bandwagon of alternative data, spending on alternative data by
trading and asset management firms is set to exceed $7 billion by 2020.1
What was only a few years ago a question of when institutions should start using data
has shifted to the question of how they can organize and structure these mostly
unstructured datasets. And with 4 billion webpages and 1.2 million terabytes of data on
the internet estimated to be generated globally by 2025, there is no shortage of web
data to sort through. As increasing numbers of investment management institutions
incorporate alternative web data into their predictive algorithms, it will change the face
of investment as we know it.
This white paper is intended to be a guide for investment management (IMs) institutions
to better understand how alternative web data is quickly becoming an essential
component for generating alpha and mitigating investment risk. In addition, it explores
different models of web data crawlers and what IMs need to look for as they incorporate
alternative web data into their predictive analytics models.
1
Alternative data for investment decisions: Today’s innovation could be tomorrow’s requirement. Deloitte
Center for Financial Services. 2017.
Page | 1
SECTION 1: BEATING THE MARKET WITH ALTERNATIVE WEB DATA
“Your company’s biggest database isn’t your transaction, CRM, ERP or other internal
database. Rather it’s the Web itself…Treat the Internet itself as your organization’s
largest data source.” -- Gartner
As previously mentioned, alternative data includes any type of data that is beyond the
scope of traditional data: satellite imagery, social media data, and web data (which
includes news sites, blogs, discussions and forums) along with credit card data.
Alternative web data, which falls under the broader category of big data, is typically
unstructured and demands a process for structuring it in order to deliver insights.
The Most Popular Types of Alternative Data
Type
Example
Structured or Unstructured
Satellite imagery
Counting the number of oil-storage
Unstructured
tanks to calculate inventory, weather
data
Social network
Naver.com
Unstructured
Baidu.com
vk.com
Web data (news sites, blogs,
Marketwatch.com, Fool.com,
discussions, forums
Seekingalpha.com
Unstructured
finance.yahoo.com
Credit card data
Credit card numbers, dates, financial
amounts, phone numbers,
addresses, product names, etc.
Page | 2
Structured
WHY INVESTMENT MANAGERS ARE LEVERAGING ALTERNATIVE WEB DATA
Web data, which includes posts from news articles, blogs, discussions and forums
offers advantages over other types of alternative data. The scale and diversity of web
data is vast enough to offer highly personalized and relevant datasets for specific
industries and use cases. Web monitoring services such as Webhose that also offer
archived data can be particularly valuable in producing such datasets. In addition, web
data such as news and blogs are constantly updated and offer companies the ability to
continually keep up with their industry as well as the competition in near real-time, as
we will see in a case study later in this document.
Due to the above characteristics, alternative web data can be especially valuable for
investment managers in its ability to enhance signal. The first in the financial industry to
take advantage of this type of alternative data a few years ago as a whole were hedge
funds, but it has since gained traction in the remaining buy-side institutions as well. A
WBR survey in the third quarter of 2018 found that 79% of investment institutions use
alternative data. Of the non-users surveyed, 82% of them plan to incorporate alternative
data into their trading strategies within the next year. The question is no longer when an
investment management firm will start to buy alternative data, but what type of
alternative data will help them in their investment strategy.
Page | 3
Total Buy-Side Spend on Alternative Data ($m)
$1,708.00
$1,088.00
$656.00
$400.00
$232.00
2016
2017
2018E
2019E
2020E
Source: Alternativedata.org
While current figures of buy-side institutions using alternate web data are unavailable,
we do know that alternative web data is one of the more accurate datasets for these
investment institutions, along with credit card data.
Most Accurate/Insightful Datasets
24%
28%
Credit/Debit Card
Web Data (scraping)
Web Traffic
7%
14%
Email Receipt
Other
27%
Source: Alternativedata.org
Page | 4
HOW ALTERNATIVE WEB DATA GIVES INVESTMENT MANAGERS AN EDGE
A recent Greenwich Associates study announced that 72% of investment management
institutions reported that alternative data of all types has enhanced their signal. A fifth of
respondents reported that alternative data was responsible for them receiving more
than 20% of their alpha. And alternative web data is the first type of data these
institutions invest in, right next to credit card data.
Funds Using Dataset | % of Funds Using Dataset
50%
45%
40%
43%
38%
36%
35%
35%
33%
30%
31%
29%
24%
25%
19%
20%
15%
10%
5%
0%
Source: Alternativedata.org
The accuracy of alternative web data and its potential for highly personalized datasets is
what delivers investment managers that edge.
Page | 5
Here are a few examples of how alternative web data specifically has helped investors
to both generate greater alpha while at the same time mitigating risk:
•
Almost a week before Gilead Sciences bought Kite Pharma, an
immunotherapy cancer treatment company, for $11.9 billion, it was predicted
by AI technology, allowing investors to take advantage of the predicted
market movement. Kite Pharma rose 28% that same week of the buyout.2
•
During the US government shutdown of 2019, important reports from the US
Department of Agriculture ceased to be published, making it difficult for
farmers and traders to make important decisions about what crops to grow,
trade and sell. With the help of publicly available alternative web data from
different government and non-governmental sources in addition to various
satellite data, private agencies were able to publish their own crop supply and
demand forecasts, which were critical for both farmers and commodities
investors.3
•
After the Tōhoku earthquake hit Japan, the web monitoring of news stories
was able to identify the link between the disaster and the price of the iPad2 by
finding an additional news story about the destruction of a major
manufacturing plant that produces the NAND flash, a critical component of
the iPad2.4
2
Ram, Aliya and Wigglesworth, Robin. “When Silicon Valley came to Wall Street.” Financial Times. October 26, 2017.
3
Meyer, Gregory and Terazono, Emiko. “New Crop Data Advisors Cash in on US Shutdown.” March 8, 2019. Financial Times.
4
Williams, Janaya. "Solving mysteries using predictive analytics." April 23, 2014.
Page | 6
SECTION 2: THE CHALLENGES OF COLLECTING ALTERNATIVE WEB DATA
“Getting information off the internet is like taking a drink from a
firehose.”
– Mitchell Kapor, founder of Lotus Development Corporation and the designer of
Lotus 1-2-3, co-founder of the Electronic Frontier Foundation
Before investors can apply alternative web data for purposes of alpha and risk
mitigation, however, they must first collect the data, which presents a range of different
challenges.
First, the crawling service used by investment management institutions must be solid
and reliable. That means that it must be:
•
Able to easily structure and organize the data. Most web data are unstructured
and difficult to access, since it’s not usually tagged or labeled. Even when
organizations have access to unstructured data, they have no way of indexing
and structuring it in a manner in which they can deliver insights to their
customers. In the financial realm, accessing and organizing data means
identifying data that has a low signal to noise ratio. In other words, it means
filtering out information that isn’t important to investors (noise).
•
Comprehensive yet accurate. A superior crawler will collect unstructured data
and structure it without missing key details (such as the dates and full titles of a
blog post). This is crucial, since inaccurate details plugged into trading algorithms
can impact the signal generated, costing investors millions.
•
Open and transparent. Crawling, or scraping the web involves a number of
privacy and safety issues. A data monitoring service that is legal and compliant
with a site’s Terms of Service (TOS) is essential—which may or may not allow
web crawling or scraping.
Page | 7
USING PATTERN MATCHING AND HEURISTICS TO STRUCTURE THE WEB
The ability of web crawlers to structure web content is critical. Before a system can
analyze content, it must first know where the content is. It must be able to map fields
and their values. Fields like title, post text, comments, dates, author names etc. must be
extracted and tagged.
It’s easy to write specific crawlers to crawl a small number of sites that will extract those
pre-defined fields from these small number of sources. But when you need data from
millions of sources you haven’t previously crawled, you need an advanced crawler.
Webhose’s crawlers use sophisticated pattern matching heuristics to match patterns on
newly discovered websites. It leverages knowledge about the structure of previously
crawled sites onto sites it has never crawled before. This ability enables structuring the
web on scale.
IS WEB SCRAPING LEGAL?
Yes, unless you use it unethically. At Webhose we make sure to crawl only publicly
available content. The crawler is very efficient and tries to minimize the resources it
takes from any site it crawls.
Keep in mind that an advanced crawler crawl millions of sites. Since it’s impossible to
contact each site owner and ask for permission, the crawler needs to make it easy to be
identified and to give the sites the choice to block the crawler in case they don’t wish it
to access their content.
These are the steps we take to do this:
•
The crawler automatically lets the website it crawls identify it by using a user
agent: omgili/0.5 +https://omgili.com
•
Fixed IPs are used for easier identification on server logs
•
The crawler follows standard crawling directive like robots.txt and HTML meta
tags
Page | 8
By allowing the crawler to scrape a website, the site owner benefits by:
•
Being able to be connected to hundreds of apps, services and marketplaces,
which can then link back, potentially sending thousands of relevant visitors to the
web property.
•
If the site runs advertisements, being noticed and linked to by these services can
increase the site’s attractiveness to advertisers and the revenue the site
generates.
•
Instead of being crawled by hundreds of inefficient crawlers downloading the
same data repeatedly, companies tap into an already crawled repository to
download the data in a machine-readable format.
BUT WHAT ABOUT COPYRIGHT?
The content we crawl is being converted into a M2M (machine to machine) data
derivative in a machine-readable format (JSON or XML). It is not presented to humans,
and it is used as a technological method to transfer content from point A to point B, just
like you would if you’d write your own crawler. It is of course forbidden to use the data
crawled and present it as your own, on your own website or resource.
Page | 9
SECTION 3: CASE STUDIES
Let’s take a deep dive into some examples of how Webhose’s web data can be
leveraged with a software such as TextReveal, SESAMm's dedicated Natural Language
Processing (NLP) solution for investing.
The goal of this NLP technology is to draw meaningful insights of people from several
types of documents (social media, news, forums, etc.) on a daily basis and correlate
these insights with market movements. The software aggregates and manipulates
textual contents in various forms in order to capture comprehensive information. With
new insights and deep information, IMs can enhance their investment decision process.
By mixing several different approaches of one company analysis, we will show how web
data and NLP together can provide predictive insights for IMs.
PFIZER – GENERAL ANALYSIS
In the general analysis below, the goal is to forecast a global price trend on the Pfizer
stock. We do this by gathering data about the Pfizer company to take a look at global
sentiment. This includes articles not only about Pfizer company, but also the CEO and
names of Pfizer specific drugs and related underlying trends. In this case, the analysis
was performed on 75 million articles from blogs, news articles, discussions mostly in
English, but also German, Spanish, French and Dutch.
Using advanced NLP technology, the resulting graph reveals interesting correlations
between market data and volumes leading to market movements. The most significant
event we see in the general analysis is a global negative sentiment trend in July of 2019
(shown in the blue line on the top) followed by a drop in the market (shown in the red
and green bar charts on the bottom). The market later turned bullish, meaning that it is
trading upwards.
Page | 10
Figure 1: Pfizer volume, market prices and sentiment data from SESAMm’s Markets API dashboard
Past negative sentiment
trend ended right before
market drop
Current sentiment trend:
Positive
Figure 2: Volume of articles & messages identified about Pfizer based on a small data sample
Most of the additional spikes in the market can be traced to events related to the
company, including market events or massive lawsuits or Environmental, Social and
Governance (ESG) events.
Page | 11
PFIZER – ESG ANALYSIS
Figure 3: ESG scores based on data sample with specific filters from SESAMm’s Markets API dashboard
Let’s take a closer examination into the Environment, Social and Governance (ESG)
risk of the company, which examines the sustainability and social impact of a company
for investors.
The NLP technology also automatically detects concepts related to pollution, water
management, energy consumption and waste to associate them with high volumes of
mention as well as sentiment to create a global Environmental score of the company.
After this analysis, we see that Pfizer has a pretty good environmental reputation; it is
not mentioned in problems related to the environment. Investors interested in creating a
portfolio with a positive environmental impact or low environmental risk should choose
Pfizer.
Social and Governance scores, on the other hand, are a different story. The Social
score detected significant numbers of references of the company related to
discrimination, strikes and blackmail. The Governance score found references to the
company and negative topics in the volume of articles and messages, such as conflict,
anti-trust and fraud.
Page | 12
As a result of this analysis, investors who wish to distance themselves from stocks with
a bad reputation for legal action or treatment of employees should not invest in Pfizer
stocks. Let’s look at one additional type of predictive insight: Product Analysis.
PFIZER – PRODUCTS ANALYSIS
Figure 4: ESG scores based on data sample with specific filters SESAMm’s Markets API dashboard
A Product Analysis entails the automatic detection of Pfizer and its products. Let’s look
at the analysis of a number of its latest products, including Lyrica and Xeljanz. This
includes the number of mentions and its correlation with either positive or negative
sentiment of these products.
For instance, data can be collected from American consumer blogs to identify whether
or not there are side effects of these drugs. For instance, many mentions of bad side
effects bode poorly for the drug. We can see that both Lyrica and Xeljanz have a high
volume in the growth of mentions of the products, and there is a strong likelihood that
these mentions are correlated with high growth. These three analyses of Pfizer,
General, ESG and Product are all done by feeding Webhose’s high-quality data into
SESAMm’s advanced NLP TextReveal.
Page | 13
A LOOK AHEAD AT ALTERNATIVE WEB DATA
Although it is being rapidly adopted all along all types of investment management
institutions, alternative data is still an emerging trend, with many new applications that
will be revealed in the future. The sheer amount of web data presents exciting new
types of analysis for investment management institutions such as product, sentiment
and ESG analysis. At the same time, however, it presents specific challenges. Along
with the biggest challenge of organizing and identifying important data from the massive
amount of data present on the web, collecting and crawling alternative web data must
also be legal and safe. Inaccurate data can directly influence investment models or
trading signals generated, resulting in losses of billions of dollars. Unscrupulous
crawling policies can destroy a brand’s reputation in an instant.
Along with being able to verify the accuracy and relevancy of data, crawling data on the
web must ensure compliance with the Terms of Service (TOS) of the different websites,
especially the corporate giants such as Twitter and Facebook. A web data service that
offers comprehensive and safe coverage, the ability to scale and full transparency will
be able to withstand the inevitable changes in a fast-paced digital world that is radically
transforming the field of investment.
Power Investments with Web Data
Want to learn more about how to crawl the web for alternative data?
Talk to an Expert
Page | 14
About Webhose
Webhose is the leading data collection provider turning unstructured web content into
machine-readable data feeds. It delivers comprehensive, up-to-the-minute coverage of
the open web that includes millions of news articles and blog posts in addition to vast
coverage of online discussions, forums and review sites in all languages. Webhose also
offers a dark web monitoring and data breach detection service that provides coverage
of the dark networks and includes millions of sites, files, marketplaces and messaging
platforms crawled daily.
About SESAMm
SESAMm is an innovative fintech company specializing in big data and artificial
intelligence for investment. Its team builds analytics and investment signals by
analyzing billions of web articles and messages using natural language processing and
machine learning. With its NLP platform and quantitative data science platform,
SESAMm addresses the entire value chain of alpha research. SESAMm’s 40 people
team in Paris, New York, Metz, and Luxembourg works with major hedge funds, banks
and asset management clients around the world for both fundamental and quantitative
use cases.
Page | 15