You can now download 1.65 billion Reddit comments

You can now download a dataset of 1.65 billion Reddit comments: Beware the Redditor AI

Once our species’ greatest trove of knowledge was the Library of Alexandria.

Now we have Reddit, a roiling mass of human ingenuity/douchebaggery that has recently focused on tearing itself apart like Tommy Wiseau in legendarily awful flick ‘The Room.’

But unlike the ancient library, the fruits of Reddit’s labors, good and ill, will not be destroyed in fire.

In fact, thanks to Jason Baumgartner of PushShift.io (aided by The Internet Archive), a dataset of 1.65 billion comments, stretching from October 2007 to May 2015, is now available to download.

TNW City Coworking space - Where your best work happens

A workspace designed for growth, collaboration, and endless networking opportunities in the heart of tech.

Book a tour now

The data – pulled using Reddit’s API – is made up of JSON objects, including the comment, score, author, subreddit, position in the comment tree and a range of other fields.

The uncompressed dataset weighs in at over 1TB, meaning it’ll be most useful for major research projects with enough resources to really wrangle it.

Technically, the archive is incomplete, but not significantly. After 14 months of work and many API calls, Baumgartner was faced with approximately 350,000 comments that were not available. In most cases that’s because the comment resides in a private subreddit or was simply removed.

Something wicked this way runs

There are plenty of things you could do with that much information – natural language processing, trend prediction, comment score analysis – but one option is particularly perturbing.

With that much data on human interactions, the Reddit dataset could serve as the corpus for an AI project considering conversational modeling (predicting what will come next in dialogues).

That’s key to understanding natural language and further developing machine intelligence. It’s what Google researchers touched upon recently in their paper about a chatbot that seems to hate children.

Now imagine an AI fed with nearly 1.65 billion interactions between Reddit users – RedditorBot, a technological tick clinging to the Web, bloated with the site’s fascinations, perversions, prejudices and outright arsehole tendency.

It now occurs to me that Skynet won’t eradicate us because of a desire to remove the illogic of humanity. It’ll just be a supremely pissed off, all powerful artificial Redditor with a grudge.

➤ Complete Public Reddit Comments Corpus [Internet Archive]

Image credit: Bubbye on Imgur

Story by Mic Wright

Reporter, TNW

Mic Wright is a journalist specialising in technology, music and popular culture. He lives in Dublin. He is on Twitter at @brokenbottleboy. (show all) Mic Wright is a journalist specialising in technology, music and popular culture. He lives in Dublin. He is on Twitter at @brokenbottleboy.

Get the TNW newsletter

Get the most important tech news in your inbox each week.

You can now download a dataset of 1.65 billion Reddit comments: Beware the Redditor AI

Something wicked this way runs

Get the TNW newsletter

Also tagged with

Sam Altman tells Congress to fund AI testing, not to require model approvals

Cerebras says it will work with everyone in AI hardware except NVIDIA

Discover TNW All Access

Uber’s bet on Nuro is bigger than it let on, at close to $500m

A UK MP’s lawsuit could decide whether xAI answers for what Grok makes