1. Get the data

First of all, we need to gather the data from our chat applications. We will now learn how to export data from two of the most commonly used instant messaging apps: WhatsApp and Telegram.

1.1 WhatsApp export

We have to export one .txt file for each chat we want to include in the final dataset. So, as described on the official WhatsApp website:

Open your WhatsApp mobile app Pick an individual chat one individual chat (e.g. a chat with one friend) > Tap “more options” (three vertical dots) Select the More voice > Export chat voice Select Without Media on the pop-up Select an Email service (e.g. Gmail app) and add your e-mail address as a recipient Wait to receive the mail with chat history as txt file attachment Download and store the txt attachment on the computer Repeat those steps for every individual chat you want to include

Note that only one-to-one chats are allowed (namely individual), we suggest to export chats with the highest number of messages, in order to achieve a bigger dataset and get better final results.

Now you should have more files, each with a structure that looks like the snippet below:

Take note of the text you find under <YourName> placeholder in your exported chats. This parameter is your name for the WhatsApp app and we will use this value later.

1.2 Telegram

The process here will be faster than WhatsApp because Telegram will export everything in a single .json file, without having the limit of exporting one chat at a time.

So, as described on the official Telegram website:

Open the Telegram Desktop app Open the menu on the upper left screen (three horizontal lines) Go to settings voice > Click on Advanced >Select Export Telegram data Only these fields should be selected:

Account information, Contact list, Personal chats, Machine-readable JSON Be sure nothing is selected under Media export settings and set the size limit to the maximum Launch the export and wait Rename the output file as “telegram_dump.json”

Now you should have one file named telegram_dump.json with this structure:

2. Parse the data