Tutorial: Live Data Collection from Reddit

Communalytic can collect public posts from a given subreddit (including submissions, comments and replies) for up to 7 consecutive days in the Edu version (and up to 31 consecutive days in the Pro version), starting from the date when you initiated the data collection. Note: The collection of historical posts is available in the Pro version only.

What’s a Subreddit you may ask? Subreddits are online groups/forums on Reddit dedicated to a specific topic(s). If this is your first time working with Reddit data, we suggest you watch The Beginner’s Guide to Reddit from Mashable or a bit longer introductory video about Reddit from Teknikforce.

Technical Details: Communalytic uses Reddit’s public API to collect data via the PRAW library. As outlined in the table below, Communalytic starts by retrieving 100 most recent submissions in a given subreddit (Stage 1). The collection then continues by retrieving new submissions until the specified end date at 00:01 UTC (Stage 2). Both Stage 1 and 2 rely on PRAW’s SubredditStream to collect information about submissions. In the last stage (Stage 3), Communalytic goes over all of the submissions collected during Stage 2 & 3 and retrieves comments and replies via PRAW’s Submission.comments call.

Three Stages of Live Data Collection from Reddit

Stage 1: Collect 100 most recent submissions

Communalytic starts by retrieving 100 most recent submissions (=thread starting posts), even if they are posted before the current date.

Stage 2: Collect new submissions until the end date/time (UTC)

The collection continues until the specified end date and time (UTC).

Please note some submissions in “high volume” subreddits such as r/all may be missed due to the API limitation.

Stage 3: Collect comments and replies

During the final stage, Communalytic attempts to retrieve all comments and replies to comments corresponding to the submissions that have been collected during Stage 1 and 2.

Please note that any comments or replies that have been deleted will not be collected during this stage.

The following steps show how to collect data from Reddit using Communalytic. The procedure for the EDU and PRO versions are similar.

Step 1

Go to the “My Datasets” page and click on the “Reddit (Live)” button.

Step 2

If you know what subreddit you would like to examine, proceed to Step 4 of this tutorial. Otherwise, click on the “Locate a subreddit” button.

Step 3a

Using the Subreddit Search page, you can locate subreddits that discuss a given topic by using the “Keyword” search bar.

When searching for a subreddit, a space between words will be counted as AND. If you would like to search for two keywords separately, use “|” to separate keywords.

After typing in your search keyword(s), click the “Search” button.

Step 3b

The Subreddit List page shows a list of public subreddits (with at least 100 comments made in the last 7 days) and sample posts corresponding to the search criteria.

Click “Start Collection on…” (corresponding subreddit) to select the designed subreddit.

Step 4

Before starting your data collection, name your dataset, then enter the name of the selected subreddi, and the end date of data collection. You can collect data for up to 7 consecutive days in the Edu version and up to 31 consecutive days in the Pro version from the current date.

You can check the box “Email me once job completes” to receive an email notification. (Note: the Pro version doesn’t have this checkbox, since it will send an email notification automatically.)

As a final step on this page, click the “Start Collection” button.

Data collection time will vary by subreddits. Subreddits with more comments and replies may take up to several hours after the end date to collect.

Step 5

To confirm that data collection is underway, you should be able to see your new dataset listed on the “My Datasets” page.

When your data collection is complete, it will say “Complete” under Status, as pictured above.