Susan Potter
software: Created / Updated

Extracting Your Twitter Archive into DuckDB

Updates:

If you're interested in extracting data from your Twitter archive and exploring it in a fun and educational way, I've got you covered!

Assumptions

  • You have already requested your Twitter archive through the Twitter website.
  • Your Twitter archive is available for download.
  • You have jq and duckdb command-line tools installed, or you know how to install them. If you use Nix as your package manager, you can easily install them by running nix-shell -p jq duckdb, which will provide you with a temporary shell environment containing these two tools or you can use a Nix flake.nix I created below for this project:
{
  description = "Extract and analyze your Twitter Archive";

  nixConfig.bash-prompt = "\[analyze-twitter-archive\]> ";

  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs/release-23.05";
    flake-utils.url = "github:numtide/flake-utils";
  };

  outputs = { self, nixpkgs, flake-utils }:
    flake-utils.lib.eachDefaultSystem (system:
      let
        pkgs = nixpkgs.legacyPackages."${system}";
        inherit (pkgs) mkShell;
      in
      {
        devShell = pkgs.mkShell {
          buildInputs = with pkgs; [
            duckdb
            jq
          ];
        };
      }
    );
}

Importing Your Twitter Archive into DuckDB

0. Download and unzip your Twitter Archive:

Before you can start working with your Twitter archive, you need to download it from your Twitter account settings. Once you have downloaded the zip archive, you need to unzip it to access the files inside. This will extract the files and folders contained in the zip archive to your current location.

1. Edit data/tweets.js saving pure JSON to tweets.json.

The first step is to extract the tweets data from the tweets.js file. I was able to do this by running the following sed command:

sed 's/^window\.YTD\.tweets\.part0 = //g' tweets.js.bak > tweets.json

Now we have a tweets.json which is purely JSON data. This file contains your tweets data in JavaScript Object Notation (JSON) format.

However, if the above command doesn't work for you then you need to edit the tweets.js file and remove the first line that contains the global assignment and replace with just [ to open the JSON Array. Once you have removed this line, you can save the remaining JSON data to a new file called tweets.json.

To check this file is well formed JSON run:

jq -e '.' tweets.json 1>/dev/null

If it isn't you will usually see a parse error message that tells you what line and column the parsing error occurred for pointers. For instance if you run it on your tweets.js by accident you will get the following:

parse error: Invalid numeric literal at line 1, column 24

Now let's review the basic JSON schema of the new JSON file:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "tweet": {
        "type": "object",
        "properties": {
          "edit_info": {
            "type": "object",
            "properties": {
              "initial": {
                "type": "object",
                "properties": {
                  "editTweetIds": {
                    "type": "array",
                    "items": {
                      "type": "string"
                    }
                  },
                  "editableUntil": {
                    "type": "string",
                    "format": "date-time"
                  },
                  "editsRemaining": {
                    "type": "string"
                  },
                  "isEditEligible": {
                    "type": "boolean"
                  }
                },
                "required": ["editTweetIds", "editableUntil", "editsRemaining", "isEditEligible"]
              }
            },
            "required": ["initial"]
          },
          "retweeted": {
            "type": "boolean"
          },
          "source": {
            "type": "string"
          },
          "entities": {
            "type": "object",
            "properties": {
              "hashtags": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "text": {
                      "type": "string"
                    },
                    "indices": {
                      "type": "array",
                      "items": {
                        "type": "string"
                      }
                    }
                  },
                  "required": ["text", "indices"]
                }
              },
              "symbols": {
                "type": "array",
                "items": {
                  "type": "string"
                }
              },
              "user_mentions": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "name": {
                      "type": "string"
                    },
                    "screen_name": {
                      "type": "string"
                    },
                    "indices": {
                      "type": "array",
                      "items": {
                        "type": "string"
                      }
                    },
                    "id_str": {
                      "type": "string"
                    },
                    "id": {
                      "type": "string"
                    }
                  },
                  "required": ["name", "screen_name", "indices", "id_str", "id"]
                }
              },
              "urls": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "url": {
                      "type": "string",
                      "format": "uri"
                    },
                    "expanded_url": {
                      "type": "string",
                      "format": "uri"
                    },
                    "display_url": {
                      "type": "string"
                    },
                    "indices": {
                      "type": "array",
                      "items": {
                        "type": "string"
                      }
                    }
                  },
                  "required": ["url", "expanded_url", "display_url", "indices"]
                }
              }
            },
            "required": ["hashtags", "symbols", "user_mentions", "urls"]
          },
          "display_text_range": {
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "favorite_count": {
            "type": "string"
          },
          "id_str": {
            "type": "string"
          },
          "truncated": {
            "type": "boolean"
          },
          "retweet_count": {
            "type": "string"
          },
          "id": {
            "type": "string"
          },
          "created_at": {
            "type": "string",
            "format": "date-time"
          },
          "favorited": {
            "type": "boolean"
          },
          "full_text": {
            "type": "string"
          },
          "lang": {
            "type": "string"
          },
          "possibly_sensitive": {
            "type": "boolean"
          }
        },
        "required": [
          "edit_info",
          "retweeted",
          "source",
          "entities",
          "display_text_range",
          "favorite_count",
          "id_str",
          "truncated",
          "retweet_count",
          "id",
          "created_at",
          "favorited",
          "full_text",
          "lang"
        ]
      }
    },
    "required": ["tweet"]
  }
}

This JSON schema defines the structure and data types for the provided tweet example data.

In human readable form it looks like the following:

Certainly! Here's a table that describes the JSON schema with nested objects:

PropertyTypeDescription
tweetobjectThe main tweet object.
tweet.edit_infoobjectInformation about tweet edits.
tweet.edit_info.initialobjectInitial edit information.
tweet.edit_info.initial.editTweetIdsarrayAn array of edit tweet IDs.
tweet.edit_info.initial.editableUntilstringThe datetime until which the tweet can be edited.
tweet.edit_info.initial.editsRemainingstringThe number of edits remaining for the tweet.
tweet.edit_info.initial.isEditEligiblebooleanIndicates whether the tweet is eligible for editing.
tweet.retweetedbooleanIndicates whether the tweet has been retweeted.
tweet.sourcestringThe source of the tweet.
tweet.entitiesobjectEntities associated with the tweet, such as hashtags, symbols, user mentions, and URLs.
tweet.entities.hashtagsarrayAn array of hashtags used in the tweet.
tweet.entities.symbolsarrayAn array of symbols mentioned in the tweet.
tweet.entities.user_mentionsarrayAn array of user mentions in the tweet.
tweet.entities.urlsarrayAn array of URLs mentioned in the tweet.
tweet.display_text_rangearrayThe range of displayable text within the tweet.
tweet.favorite_countstringThe number of favorites (likes) for the tweet.
tweet.id_strstringThe string representation of the tweet ID.
tweet.truncatedbooleanIndicates whether the tweet is truncated.
tweet.retweet_countstringThe number of retweets for the tweet.
tweet.idstringThe ID of the tweet.
tweet.created_atstringThe creation datetime of the tweet.
tweet.favoritedbooleanIndicates whether the tweet has been favorited.
tweet.full_textstringThe full text of the tweet.
tweet.langstringThe language of the tweet.
tweet.possibly_sensitivebooleanIndicates whether the tweet may contain sensitive content.

This table provides an overview of the properties, their types, and descriptions of the nested objects within the JSON schema.

2. Flatten tweet structure to essential fields

jq -r \
  '.[].tweet | {id: .id_str, favorite_count, retweet_count, created_at, full_text}' \
  tweets.json > flat_tweets.json

The second step involves using the jq command to extract and filter the tweet data from the tweets.json file. The jq command is a command-line tool for parsing and manipulating JSON data. In this step, the jq command extracts the tweet data from the tweets.json file and saves the flattened tweet JSON structured output to a new file called flat_tweets.json. The jq command uses a filter expression to specify which data to extract. In this case, the filter expression extracts the id, favorite_count, retweet_count, created_at, and full_text fields for each tweet in the tweets.json file.

3. Optional: Slurp flat_tweets.json content and create CSV file via jq

This step is purely optional because DuckDB now supports importing JSON directly now. If you are using DuckDB instead of something that requires CSV then you can just skip to the next step.

jq -r \
  'map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at}) | (first | keys_unsorted) as $keys | map([to_entries[] | .value]) as $rows | $keys,$rows[] | @csv' \
    -s ./flat_tweets.json > flat_tweets.csv

The third step involves using the jq command again to convert the tweet data from flat_tweets.json to a CSV file. CSV stands for Comma Separated Values and is a common format for storing and exchanging tabular data. In this step, the jq command uses a filter expression to convert the tweet data from JSON to CSV format and save it to a new file called flat_tweets.csv.

Here is a breakdown of the filter expressions in this command:

  • map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at}): This expression uses the map function to transform each element in the input array (which represents a tweet) into a new object with the specified properties. The favorite_count and retweet_count properties get converted to numbers using the builtin tonumber function.
  • (first | keys_unsorted) as $keys: This expression extracts the keys of the first element in the input array, then assigns them to a variable called $keys. The keys_unsorted builtin function returns the keys in the order they appear in the object, rather than sorting them alphabetically. These will form the CSV file header row.
  • map([to_entries[] | .value]) as $rows: This expression uses the map function to transform each element in the input array into an array of values, by extracting the values of each property using the to_entries builtin. The resulting arrays of values get assigned to a variable named $rows.
  • $keys,$rows[] | @csv: This expression outputs the contents of the $keys and $rows variables as a CSV file. The @csv function converts the input data into CSV format.
  • Finally, the -s jq command-line option tells jq to read the input from a file rather than from standard input, and the > flat_tweets.csv part of the command redirects the standard output to a file called flat_tweets.csv.

4. Create duckdb database file

duckdb tweets.duckdb

The fourth step involves using the duckdb command to create a new duckdb database. duckdb is an embeddable SQL database designed to be embeddable, lightweight and fast. In this step, the duckdb command creates a new duckdb database file called tweets.duckdb and opens up the duckdb shell immediately after.

5. (In duckdb shell): CREATE TABLE tweets AS SELECT * FROM read_csv_auto('./flat_tweets.csv');

If you are importing directly from JSON you would use the following:

CREATE TABLE tweets AS
SELECT * FROM read_json_auto('./flat_tweets.json');

If you created a CSV file for other purposes you can follow along with the following:

CREATE TABLE tweets AS
SELECT * FROM read_csv_auto('./flat_tweets.csv');

In this fifth step we use the CREATE TABLE SQL command to create a new table in the tweets.duckdb database inserting the data produced by a CSV import SELECT statement. Note that the schema is automatically inferred from the data in the CSV file upon the CSV importing function read_csv_auto.

Note: if you want to use the timestamp fields as timestamps you will need to look in the later section 'Query for Top Hashtags' for the bright way to create the table from the JSON or CSV file.

6. (Still in duckdb shell): select count(1) from tweets;

The output from this should show you the number of tweets in the original tweets.json you started with in step 2 that you can cross reference using the command jq '. | length' tweets.json to validate.

If those numbers match, your raw tweet archive has successfully been imported into duckdb (without entities).

Example Queries to Analyze your Twitter Archive

Who cares? Now we can query the DuckDB database for tweets that match the criterion you are looking to migrate.

Query for Popular Tweets

All of us indulge in a little vanity, so I suspect the most common simple query will be to retrieve tweets that gained significant popularity, with more than, say, 10 favorites and 3 retweets, where you can use the following SQL query:

SELECT *
   FROM tweets
  WHERE favorite_count > 10
    AND retweet_count > 3;

Adjust the thresholds based on what you deem consequential for each of your Twitter accounts.

Query for Tweets by Year

Sure, you can do this with an advanced Twitter search query in the web UI for your own tweets (and others) but surely soon that will be behind a paywall, so if you're interested in analyzing your tweet activity over time, you can use the following SQL query to retrieve the number of tweets within each calendar year and sort by descending year:

SELECT
  count(1),
  datepart('year', created_at) AS year
FROM tweets
GROUP BY year
ORDER BY year DESC;

When I first ran this I got the following scary error:

Error: Binder Error: No function matches the given name and argument types 'datepart(VARCHAR, VARCHAR)'.
You might need to add explicit type casts.
        Candidate functions:
        datepart(VARCHAR, DATE) -> BIGINT
        datepart(VARCHAR, TIMESTAMP) -> BIGINT
        datepart(VARCHAR, TIME) -> BIGINT
        datepart(VARCHAR, INTERVAL) -> BIGINT
        datepart(VARCHAR[], DATE) -> STRUCT()
        datepart(VARCHAR[], TIMESTAMP) -> STRUCT()
        datepart(VARCHAR[], TIME) -> STRUCT()
        datepart(VARCHAR[], INTERVAL) -> STRUCT()
        datepart(VARCHAR, TIMESTAMP WITH TIME ZONE) -> BIGINT
        datepart(VARCHAR[], TIMESTAMP WITH TIME ZONE) -> STRUCT()

LINE 1: select count(1), datepart('year', created_at) as year fr...

The reason for this is that when we imported the JSON previously via read_json_auto function, DuckDB auto determined the data types for the columns. If we look at the schema for the auto detected column types we see that created_at was set to VARCHAR (a text type) instead of TIMESTAMPTZ.

SHOW TABLE tweets;

The output for me looked like:

│  column_name   │ column_type │  null   │
│    varchar     │   varchar   │ varchar │
├────────────────┼─────────────┼─────────┼
│ id             │ BIGINT      │ YES     │
│ favorite_count │ BIGINT      │ YES     │
│ retweet_count  │ BIGINT      │ YES     │
│ created_at     │ VARCHAR     │ YES     │
│ full_text      │ VARCHAR     │ YES     │

The way to fix this is by telling read_json_auto how to parse timestamps with a set format.

The following is what I used to create a new table tweets2 which will parse created_at as a TIMESTAMP with timezone:

CREATE TABLE tweets3 AS
SELECT * FROM read_json_auto(
         './flat_tweets.json',
         timestampformat='%a %b %d %H:%M:%S %z %Y'
);

Now for the tweets2 schema we get the following when we run SHOW TABLE tweets2:

│  column_name   │ column_type │  null   │
│    varchar     │   varchar   │ varchar │
├────────────────┼─────────────┼─────────┼
│ id             │ BIGINT      │ YES     │
│ favorite_count │ BIGINT      │ YES     │
│ retweet_count  │ BIGINT      │ YES     │
│ created_at     │ TIMESTAMP   │ YES     │
│ full_text      │ VARCHAR     │ YES     │

Now we can run the original SQL query on the new table like so:

│  column_name   │ column_type │  null   │
│    varchar     │   varchar   │ varchar │
├────────────────┼─────────────┼─────────┼
│ id             │ BIGINT      │ YES     │
│ favorite_count │ BIGINT      │ YES     │
│ retweet_count  │ BIGINT      │ YES     │
│ created_at     │ TIMESTAMP   │ YES     │
│ full_text      │ VARCHAR     │ YES     │

And we will get something like the following:

┌──────────┬───────┐
│ count(1) │ year  │
│  int64   │ int64 │
├──────────┼───────┤
│      952 │  2023 │
│     3721 │  2022 │
│     4203 │  2020 │
│     4475 │  2019 │
│     4128 │  2018 │
│     9402 │  2017 │
│     5689 │  2016 │
│     2321 │  2015 │
│     3618 │  2014 │
│     2320 │  2013 │
│     2831 │  2012 │
│     2303 │  2011 │
│      624 │  2010 │
├──────────┴───────┤
│     13 rows      │
└──────────────────┘

Query for Top Hashtags

For hashtags, we will need to go back to the data shaping phase, probably using jq again.

We will need to get the hashtag per tweet data to import into a new table from the JSON but first let's do some spelunking to get a feel for the data:

jq '[.[].tweet.entities | select(.hashtags | length > 0)] | length' tweets.json

Breaking down this query we get:

  • [ expr ]: surrounding another expression in square brackets creates a new array.
  • .[].tweet.entities: This digs into the tweet key of each item in the outer array and returns the entities object.
  • map(select(.hashtags | length > 0)): This applies the select function to each object in the array, filtering out the objects that have a non-empty hashtags array.
  • length This calculates the length of the resulting array, giving you the count of objects that passed the filter.

This query gets us the count of all objects in the outer array that has a non-empty hashtags array.

The result here gives us an idea about the number of tweets we used hashtags in. In my Twitter Archive data, I used hashtags in about 6% of my tweets.

To create a hashtags join table with a tweet id and hashtag text we can create a JSON file containing an array with an object that looks like { "id": "123455667", "hashtags": [ "first", "second", "third" ] }. We can do that with the following jq command:

jq '[.[].tweet | select(.entities.hashtags | length > 0) | { "id": .id, "hashtags": [.entities.hashtags[].text] }]' \
  tweets.json > hashtags.json

We can then import the JSON into a table with a 1-1 mapping between tweet id and hashtag text like so:

CREATE TABLE hashtags AS
SELECT id, unnest(hashtags) AS hashtag
FROM read_json_auto('./hashtags.json');

Our new hashtags table has a schema that looks like this:

│ column_name │ column_type │  null   │
│   varchar   │   varchar   │ varchar │
├─────────────┼─────────────┼─────────┼
│ id          │ BIGINT      │ YES     │
│ hashtag     │ VARCHAR     │ YES     │

Now to stitch together my top 10 hashtags of all time we can create the following SQL query:

SELECT h.hashtag, COUNT(*) AS count
FROM hashtags AS h
JOIN tweets AS t ON h.id = t.id
GROUP BY h.hashtag
ORDER BY count DESC
LIMIT 10;

Here we are joining the hashtags table with the tweets table by id then grouping by the hashtag column in the hashtags table and sorting the results in descending order by the count of each hashtag then cutting off after the first ten.

Obviously there is more we could do here to make this a better schema w.r.t. user mentions, edits, urls, symbols, etc.

Concluding thoughts

The above walked you through santizing and shaping your tweet archive via jq and then importing it into DuckDB to be able to query with the familiar SQL.

Happy exploring with your Twitter archive in DuckDB! Cheers!

If you enjoyed this content, please consider sharing via social media, following my accounts, or subscribing to the RSS feed.