Python: Network Analysis of Reddit Hyperlinks
Network Analysis of Reddit Hyperlinks Dataset
By
LEKE, John Oluwagbemiga
Introduction
Reddit is made up of thousands of communities (subreddits) that connect through shared
discussions and links. This report dives into how these subreddits interact, which ones are the most
influential, and how information flows between them. By using network analysis, we break down key
metrics like degree centrality and betweenness to get a clearer picture of how these communities
are connected and what that means for online conversations.
Objective
The objective of this research is to conduct a network analysis on the Reddit Hyperlinks Dataset
from the Stanford Large Network Dataset Collection. The goal is to explore relationships within the
dataset, determine their connectivity, generate meaningful research questions, and apply network
analysis techniques to extract meaningful insights using NetworkX, Pandas, and Matplotlib.
Dataset Overview
The two-dataset selected for this analysis are named “soc-redditHyperlinks-body” and “socredditHyperlinks-title” the dataset contains records of hyperlinks posted between different
subreddits on Reddit. The key columns in the dataset include:
•
source_subreddit: The subreddit from which the hyperlink originates.
•
target_subreddit: The subreddit that receives the hyperlink.
•
post_id: A unique identifier for the Reddit post.
•
timestamp: The time the hyperlink was posted.
•
link_sentiment: The sentiment score of the link.
•
properties: Additional link properties.
•
year: The year the link was posted.
Methodology
Tools & Libraries Used
The analysis was conducted using Python, with the following key libraries:
•
Pandas: For data manipulation and preprocessing.
•
NetworkX: For building and analyzing the subreddit network graph.
•
Matplotlib & Seaborn: For data visualization.
•
NumPy: For numerical operations.
Research Questions
This analysis seeks to answer the following research questions:
•
•
•
•
•
•
Are there subreddits that only receive links but never link to others.
What is the average shortest path length between subreddits in the largest connected
component.
Which subreddits receive the highest number of negative sentiment links.
What are the top 10 subreddits with the highest closeness centrality in the subreddit
network.
Which subreddits have high degree centrality but low betweenness centrality.
How many unique subreddits exist in the dataset, and which ones are linked the most.
Data Preprocessing
Loading and Inspecting Data
The dataset was loaded using Pandas, and displayed the first few rows:
The dataset contains 7 columns, each representing different aspects of the Reddit hyperlink network.
Handling Missing Values
A check for missing values showed minimal data loss, so standard imputation techniques were
carried out to check for missing values and the result shows there are no missing values in the
dataset.
Data Merging Process
The datasets have been successfully merged, combining information from both the body and title
datasets. Below are key observations regarding the merge process:
Total Records: The merged dataset contains 858,488 entries with 7 columns.
Column Structure:
•
•
•
•
•
•
•
SOURCE_SUBREDDIT: The subreddit that posted the link.
TARGET_SUBREDDIT: The subreddit that received the link.
POST_ID: Unique identifier for each post.
TIMESTAMP: The time the post was made.
LINK_SENTIMENT: Indicates sentiment (positive, neutral, or negative).
PROPERTIES: Additional post metadata.
DATASET: A new column added to distinguish between the two original datasets (body and
title).
Network Analysis and Visualization Process
Dataset Sampling for Improved Connectivity
To ensure a more interconnected network, a subset of 1,000 data points was randomly selected from
the dataset. This sampling process helps reduce noise while retaining meaningful relationships
between subreddits.
Graph Construction Using NetworkX
An undirected graph was created using the NetworkX library, where each subreddit represents a
node, and the edges between them represent interactions based on link sentiment scores. The edges
were added by iterating through the sampled dataset, establishing connections between source
subreddits and target subreddits with assigned weights.
Refining the Network Structure
To enhance the network's clarity and relevance, the largest connected component was extracted,
ensuring that only the most significant interactions were analyzed. Additionally, a k-core
decomposition (k=2) was applied to remove weakly connected nodes, retaining only those with a
minimum degree of connectivity. These steps eliminated noise, improved visualization, and
highlighted key subreddit relationships.
Graph Visualization
To make the network easier to interpret, a spring layout was used, which naturally spaces out the
nodes based on their connections. Each node was colored blue, while edges were shown in gray to
create a clear distinction. The final graph, consisting of 91 nodes and a sampled set of edges,
effectively highlights the relationships and interactions between different subreddits.
Analysis of Research Questions
1. Are There Subreddits That Only Receive Links but Never Link to Others?
The analysis identified 11,317 isolated subreddits (32.7%) that only receive links but never link out,
while 67.3% were connected.
A sample of isolated subreddits includes violins, shopping, botcraft, and polyamoryr4r. This suggests
that a significant number of subreddits passively receive links without active engagement in linksharing.
A pie chart visualizes this distribution, with isolated subreddits in red and connected ones in green.
2. What is the average shortest path length between subreddits in the largest
connected component?
The average shortest path length shows how easily subreddits in the largest connected group are
linked. In this case, the average path length is 3.80, meaning it takes about four steps to get from one
subreddit to another. This helps us understand how closely connected the subreddits are and how
information might flow between them.
3. Which subreddits receive the highest number of negative sentiment links?
Some subreddits receive more negative sentiment links than others. To identify them, we counted
the number of negative links directed at each subreddit and highlighted the top 10 most affected
ones.
The bar chart visualizes these subreddits, with the one on the far left receiving the highest number
of negative links. This analysis helps in understanding which communities are the most criticized or
controversial within the network.
4. What are the top 10 subreddits with the highest closeness centrality in the
subreddit network?
Closeness centrality measures how easily a subreddit can reach others in the network.
The top 10 subreddits with the highest closeness centrality are the most efficiently connected,
meaning they can quickly interact with other communities.
The bar chart highlights these subreddits, showing which ones act as central hubs for communication
and information flow.
5. Which subreddits have high degree centrality but low betweenness centrality?
This analysis identifies subreddits with high degree centrality (many connections) but low
betweenness centrality (not key bridges). By selecting the top 1,000 most connected subreddits and
computing centrality measures, we filtered those with degree > 0.01 and betweenness < 0.001.
A scatter plot visualized these relationships.
The results highlight subreddits that are locally influential but not major connectors in the network,
making them ideal for engagement but not for cross-community influence.
6. How many unique subreddits exist in the dataset, and which ones are linked the
most?
This analysis explores the number of unique subreddits and identifies the most linked ones.
The dataset contains 35,776 unique subreddits, determined by combining the source and target
subreddit lists.
A bar chart visualizes the top 20 most linked subreddits, highlighting the most referenced
communities. These highly linked subreddits likely serve as major discussion hubs or information
sources within the network.
Key Findings
1. Isolated Subreddits: Around 32.7% (11,317) of subreddits only receive links but don’t link
out to others. This suggests that many communities are being talked about rather than
actively engaging in conversations.
2. Average Shortest Path Length: In the largest connected part of the network, it takes an
average of 3.8 steps to get from one subreddit to another, meaning information spreads
quickly.
3. Negative Sentiment Links: Certain subreddits receive more negative sentiment links, which
could indicate they are more divisive or criticized by others.
4. Closeness Centrality: The top 10 subreddits with the highest closeness centrality are the key
players, making it easier to reach other parts of the network.
5. Degree vs. Betweenness Centrality: Some subreddits have a lot of direct connections (high
degree centrality) but don’t serve as bridges between different groups (low betweenness
centrality). These subreddits are influential in their niche but don’t necessarily connect
broader communities.
6. Unique Subreddits & Most Linked Ones: There are 35,776 unique subreddits, but only a small
fraction gets the most attention, dominating the conversation flow.
Conclusion
This analysis shows that Reddit Hyperlinks Dataset reveals key network patterns, including the
presence of isolated subreddits, influential communication hubs, and varying sentiment
distributions. The insights from this analysis can help in understanding online communities, their
connectivity, and the way information spreads across different subreddits.
References
•
Stanford Large Network Dataset Collection (https://snap.stanford.edu/data/)
•
NetworkX Documentation
•
Matplotlib & Pandas Official Docs.