Susan Potter

Extracting Your Twitter Archive into DuckDB

Sun December 12, 2022

Since I heard a number of people interested in extracting the data from their Twitter archive, I thought it would be beneficial to describe the steps.


  • you have already requested your account's Twitter archive via the Twitter site

  • your Twitter archive is currently available for download

  • you have jq and duckdb command-line tools installed or know how to install them. If you use Nix as your package manager it should be as simple as running nix-shell -p jq duckdb which enters you into a temporary shell with those two command installed and brought into scope.


0. Download and unzip zip archive:

Before you can start working with your Twitter archive, you need to download it from your Twitter account settings. Once you have downloaded the zip archive, you need to unzip it to access the files inside. This will extract the files and folders contained in the zip archive to your current location.

1. Edit data/tweets.js saving pure JSON to tweets.json. Only first line, remove global assignment leaving only JSON:

The first step is to extract the tweets data from the tweets.js file. This file contains your tweets data in JavaScript Object Notation (JSON) format. However, the file also contains some additional code that wraps the JSON data in a global assignment. To extract the pure JSON data, you need to edit the tweets.js file and remove the first line that contains the global assignment. Once you have removed this line, you can save the remaining JSON data to a new file called tweets.json.

2. Flatten tweet structure to essential fields

jq -r '.[].tweet | {id: .id_str, favorite_count, retweet_count, created_at, full_text}' tweets.json > flat_tweets.json

The second step involves using the jq command to extract and filter the tweet data from the tweets.json file. The jq command is a command-line tool for parsing and manipulating JSON data. In this step, the jq command extracts the tweet data from the tweets.json file and saves the flattened tweet JSON structured output to a new file called flat_tweets.json. The jq command uses a filter expression to specify which data to extract. In this case, the filter expression extracts the id, favorite_count, retweet_count, created_at, and full_text fields for each tweet in the tweets.json file.

3. Slurp flat_tweets.json content and create CSV file via jq

jq -r 'map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at}) | (first | keys_unsorted) as $keys | map([to_entries[] | .value]) as $rows | $keys,$rows[] | @csv' -s ./flat_tweets.json > flat_tweets.csv

The third step involves using the jq command again to convert the tweet data from flat_tweets.json to a CSV file. CSV stands for Comma Separated Values and is a common format for storing and exchanging tabular data. In this step, the jq command uses a filter expression to convert the tweet data from JSON to CSV format and save it to a new file called flat_tweets.csv.

Here is a breakdown of the filter expressions in this command:

  • map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at}): This expression uses the map function to transform each element in the input array (which represents a tweet) into a new object with the specified properties. The favorite_count and retweet_count properties get converted to numbers using the builtin tonumber function.

  • (first | keys_unsorted) as $keys: This expression extracts the keys of the first element in the input array, then assigns them to a variable called $keys. The keys_unsorted builtin function returns the keys in the order they appear in the object, rather than sorting them alphabetically. These will form the CSV file header row.

  • map([to_entries[] | .value]) as $rows: This expression uses the map function to transform each element in the input array into an array of values, by extracting the values of each property using the to_entries builtin. The resulting arrays of values get assigned to a variable named $rows.

  • $keys,$rows[] | @csv: This expression outputs the contents of the $keys and $rows variables as a CSV file. The @csv function converts the input data into CSV format.

  • Finally, the -s jq command-line option tells jq to read the input from a file rather than from standard input, and the > flat_tweets.csv part of the command redirects the standard output to a file called flat_tweets.csv.

4. Create DuckDB database file

duckdb tweets.duckdb

The fourth step involves using the duckdb command to create a new DuckDB database. DuckDB is an embeddable SQL database designed to be embeddable, lightweight and fast. In this step, the duckdb command creates a new DuckDB database file called tweets.duckdb and opens up the duckdb shell immediately after.

5. (In duckdb shell): CREATE TABLE tweets AS SELECT * FROM read_csv_auto('./flat_tweets.csv');

The fifth step involves using the CREATE TABLE SQL command to create a new table in the DuckDB database inserting the data produced by a CSV import SELECT statement. Note that the schema is automatically inferred from the data in the CSV file upon the CSV importing function read_csv_auto.

6. (Still in duckdb shell): select count(1) from tweets;

The output from this should show you the number of tweets in the original tweets.json you started with in step 2 that you can cross reference using the command jq '. | length to validate.

Who cares?

Now you can query the database for tweets that match the criterion you are looking to migrate.

For example, you might want to only republish "popular" tweets so you might write a SELECT query like the following:

SELECT * from tweets WHERE favorite_count > 10 AND retweet_count > 3;

This is totally arbitrary but I wanted to explain why having the tweets in DuckDB is useful. :)

You might also want to extract different data into CSV and import into a new database in the same DuckDB database file that describe the entities so you can do more interesting joins. For me this wasn't an interesting use case yet.