Tutorial: Live Data Collection from Reddit

Communalytic can collect public posts from a given subreddit (including submissions, comments and replies) for up to 7 consecutive days in the Edu version and up to 31 consecutive days in the Pro version. Note: The collection of live posts is available in the Pro version only. The Edu version can only collect historical data.

What’s a Subreddit you may ask? Subreddits are online groups/forums on Reddit dedicated to a specific topic(s). If this is your first time working with Reddit data, we suggest you watch The Beginner’s Guide to Reddit from Mashable or a bit longer introductory video about Reddit from Teknikforce.

Technical Details: Communalytic uses Reddit’s public API to collect data via the PRAW library. As outlined in the table below, Communalytic starts by retrieving 100 most recent submissions in a given subreddit (Stage 1). The collection then continues by retrieving new submissions until the specified end date at 00:01 UTC (Stage 2). Both Stage 1 and 2 rely on PRAW’s SubredditStream to collect information about submissions. In the last stage (Stage 3), Communalytic goes over all of the submissions collected during Stage 2 & 3 and retrieves comments and replies via PRAW’s Submission.comments call.

Three Stages of Live Data Collection from Reddit

Stage 1: Collect 100 most recent submissions

Communalytic starts by retrieving 100 most recent submissions (=thread starting posts), even if they are posted before the current date.

Stage 2: Collect new submissions until the end date/time (UTC)

The collection continues until the specified end date and time (UTC).

Please note some submissions in “high volume” subreddits such as r/all may be missed due to the API limitation.

Stage 3: Collect comments and replies

During the final stage, Communalytic attempts to retrieve all comments and replies to comments corresponding to the submissions that have been collected during Stage 1 and 2.

Please note that any comments or replies that have been deleted will not be collected during this stage.

The following steps show how to collect data from Reddit using Communalytic. The procedure for the EDU and PRO versions are similar. The main difference is that the Pro version can collect both live and historical posts from a given subreddit, while the Edu version can only collect historical data.

Step 1

Go to the “My Datasets” page and click on the “Reddit (Live)” button.

Step 2

If you know what subreddit you would like to examine, proceed to Step 4 of this tutorial. Otherwise, click on the “Locate a subreddit” button.

Step 3a

Using the Subreddit Search page, you can locate subreddits that discuss a given topic by using the “Keyword” search bar.

When searching for a subreddit, a space between words will be counted as AND. If you would like to search for two keywords separately, use “|” to separate keywords.

After typing in your search keyword(s), click the “Search” button.

Step 3b

The Subreddit List page shows a list of public subreddits (with at least 100 comments made in the last 7 days) and sample posts corresponding to the search criteria.

Click “Start Collection on…” (corresponding subreddit) to select the designed subreddit.

Step 4

Before starting your data collection, name your dataset, then enter the name of the selected subreddi, and the end date of data collection. You can collect data for up to 7 consecutive days in the Edu version and up to 31 consecutive days in the Pro version from the current date.

You can check the box “Email me once job completes” to receive an email notification. (Note: the Pro version doesn’t have this checkbox, since it will send an email notification automatically.)

As a final step on this page, click the “Start Collection” button.

Data collection time will vary by subreddits. Subreddits with more comments and replies may take up to several hours after the end date to collect.

Step 5

To confirm that data collection is underway, you should be able to see your new dataset listed on the “My Datasets” page.

When your data collection is complete, it will say “Complete” under Status.

Data Structure

The table below shows data points available in the dataset, as provided by Reddit API:

Field Description Sample Submission Sample Comment Sample Reply 
idUnique identifier for the postq6x0lwhgf14zphgkadlb
dateThe date when the post was created/updated10/12/202110/12/202110/13/2021
authorPoster’s unique username 916farmer_AskMyMom_None
titleSubmission titleSaw this genius on the road today. Wouldn’t it be a shame if their email got overwhelmed with vax fax.
textThe main body of the postWait wait wait. So they don’t want a vaccine card or requirements because they don’t want the government “tracking them”: but will plaster personal information on their car windows?

I mean, I watched Donald Duck and Bugs Bunny use reverse psychology on each other, is there any way we can do that with these guys?
comment_onUnique identifier of the parent post; Note: only available for Comment- and Reply-type postsq6x0lwhgexsg3
typePost type; Possible values are: Submission = a thread starting post, Comment = a reply to a submission, Reply = a reply to a comment or to another replySubmissionCommentReply
scoreThe overall engagement  score assigned to the post based on the total number of up & down votes11934-2
upvote_ratioThe ratio of upvotes out of all votes received by the post; Note: only provided for Submission-type posts0.94
urlURL shared in the submission if applicable; Note: only provided for Submission-type postshttps://i.redd.it/sw3evc3sf3t71.jpg
permalinkA persistent URL to the posthttps://www.reddit.com/r/…https://www.reddit.com/r/…https://www.reddit.com/r/…
user_link_karmaUser’s link-based karma score816244841
user_comment_karmaUser’s comment-based karma score238653150
user_flairUser’s subreddit-specific “flair” (tag or category); Note: Many subreddits/users don’t use this feature. Also in some subreddits, only their moderators can assign a flair to a user/post.NoneNoneNone
submission_flair“Flair” (tag or category) assigned to the submission Note: In some subreddits, only their moderators can assign a flair to a post.None