A program that takes a subreddit and search query as input, and returns the positivity & negativity of the top 30 relevant Reddit posts from the past year.
Code on GitHub:
Impact on Apex Fund:
The Reddit SA program addresses Phase 4 of Apex Fund’s Investment Framework:
Apex Fund’s Fundamental Analysts will use this program to discover significant Reddit posts for prospective securities. They can click the URL for the posts with the highest positivity/negativity and apply their human intuition on the Reddit post’s discourse for more insight. The Wall Street Journal recently wrote an article on the importance of tracking the sentiment of retail investors on social media networks like Reddit, and this program facilitates that.
Technical Challenges & Solutions
As with any Data Science project, the bulk of technical challenges came from fetching and cleaning the raw Reddit comment data to prepare it for the VADER Sentiment Analysis framework.
Fetching the Reddit Post Data
Extracting Raw Comment Data from Reddit Posts
After fetching the 30 Reddit Posts that best fit the user’s query, we now have to extract the raw comment data on each Reddit post using the PRAW API.
The Submission object (the abstraction used by the PRAW API for a Reddit post) is pretty unintuitive when it comes to fetching comment data.
The PRAW API returns the comment section of a
Submission as a CommentForest object, an unintuitive abstraction that is essentially a list of Comment and MoreComments objects.
Comment objects are what we would like to fetch for our program.
MoreComments objects are simply a pointer to more Comment objects.
CommentForest objects come with ~400
Comment objects already loaded. If we want more
Comment objects from the Reddit post, we have to expand
Another complication is that
MoreComments objects in the initial
CommentForest returned for a Reddit post only represent the top-layer comments. There is a more complex mechanism for retrieving data on inner-layer comments (the comments that are replies to top-layer comments).
This API design sparked a discussion in the Apex Fund Quant Team about what data we really want from the comment section of Reddit posts.
After running some time tests, we found that retrieving inner-layer comment data requires too many API calls, which significantly slows down the overall program. We also noticed that top-layer comments usually have the most upvotes and engagement, so it is OK to ignore inner-layer comments for now due to our computing constraints (we only have access to the free GPU given thru Google CoLab).
We also had to decide how many of the top-layer comments to extract for each Reddit post, since only ~400
Comment objects are loaded in the returned
CommentForest object. We did more time tests and found that expanding
MoreComments objects requires lots of extra API calls that slow down the entire program. Luckily, we did some manual observations of the default 400
Comment objects (by comparing the comments to an example post) and realized that the 400
Comment objects are the comments with the most upvotes in all of the top-layer comments. With this observation, we concluded that we still capture most of the top-layer comment sentiment by only using the default 400
Comment objects for each Reddit post.
(We still have an
expand_top_layer flag that can be made
True if we acquire more computing power and can afford to expand all
Cleaning the Raw Comment Data
For each of the 30 Reddit posts, we extracted the following information from each of the default 400
- Raw Comment Text
- Number of Upvotes
- Total Reddit Coin Value of All Comment Awards
The raw comment text is what needs to be cleaned before we can run any sentiment analysis. The Number of Upvotes and Total Award Value are used later to weight the sentiment of each comment.
For each comment’s raw text, we converted the raw text into a list of words (by splitting on whitespaces), and applied the following cleaning procedures:
- Removing emojis
- Tokenizing words
- Removing URL links
- Removing stopwords (words that don’t carry any meaning like “and” &“the”)
- Lemmatizing words (converting all words to the root form)
Defining “Positivity” and “Negativity” Scores
After all of the data extraction and cleaning, we had to tackle the core data science problem in this program: quantifying positive and negative sentiment.
The first step was to run all the cleaned words from each comment through the VADER Sentiment Analyzer, which returns a Compound Polarity Score for each word. Compound Polarity Score (CPS) is an estimate of the intrinsic positivity/negativity of a word; the scores range from -1 (most negative) to 1 (most positive). Most words have a CPS of 0 (meaning neutral).
These Compound Polarity Scores do not consider the context of the word. The VADER Sentiment Analyzer is simply an aggregated result of other data scientists running more complex Sentiment Analysis algorithms on thousands of English texts to approximate the intrinsic sentiment of each English word.
To add context to these scores, we weight each word’s Compound Polarity Score by the Number of Upvotes and Total Award Coin Value of the comment that the word belongs to.
Before applying the weight, we need to normalize Number of Upvotes and Total Award Coin Value by dividing by the global max of these two fields across all 30 posts.
To calculate our overall Positivity Score for each post, we simply add up all the weighted Compound Polarity Scores in the post that are positive. To calculate the overall Negativity Score for each post, we do the same for all negative scores.
This project forced me to consider important aspects of software development and data science.
On the software development side, I learned to always be cognizant of my end users (Apex Fund’s fundamental analysts). I focused on making this an on-demand tool that can be run by my users within 2–3 minutes. This constrained how much data I allowed the program to analyze due to my limited computing power.
I also had to make sure my program is easily usable by non-technical users, since the fundamental analysts are Finance majors that don’t have much coding experience. As I result, I hid away most of the technical code, and only exposed an intuitive code block for users to put in inputs like subreddit and search query:
On the data science side, I learned a lot about picking the right data. The time constraint of 2–3 minutes for the entire program forced me to pick only the highest-leverage data, which happened to be the default ~400
Comment objects that have the most upvotes and engagement in each Reddit post. The data picking process also taught me the tradeoffs between complexity and accuracy for a data science project. For example, we could’ve made this program more accurate by including inner-layer comment data, but this adds the complexity of trying to reason about the sentiment of an inner-layer comment in relation to its parent comment(s).
I also learned about the importance of delivering final results that are actually actionable. Our original approach for positivity and negativity scores was not actionable because we only normalized within each post, and did not do a global normalization like the approach explained in this article. This made it impossible for a user to compare the positivity/negativity scores of one post to another. With the new approach, the users can save time by only investigating Reddit posts with high positivity/negativity scores in the result chart because they know that the scores can be compared.
Thank you for reading!
If you are interested in joining Apex Fund as a Junior Quantitative Analyst, email me your resume at email@example.com! (Must be a Freshman, Sophomore, or Junior attending the University of Maryland — College Park.)