Updates:
- July 2023: I previously used the JSON format from December 2022 which did not account for tweet edits.
If you're interested in extracting data from your Twitter archive and exploring it in a fun and educational way, I've got you covered!
Assumptions
- You have already requested your Twitter archive through the Twitter website.
- Your Twitter archive is available for download.
- You have
jq
andduckdb
command-line tools installed, or you know how to install them. If you use Nix as your package manager, you can easily install them by runningnix-shell -p jq duckdb
, which will provide you with a temporary shell environment containing these two tools or you can use a Nixflake.nix
I created below for this project:
{
description = "Extract and analyze your Twitter Archive";
nixConfig.bash-prompt = "\[analyze-twitter-archive\]> ";
inputs = {
nixpkgs.url = "github:nixos/nixpkgs/release-23.05";
flake-utils.url = "github:numtide/flake-utils";
};
outputs = { self, nixpkgs, flake-utils }:
flake-utils.lib.eachDefaultSystem (system:
let
pkgs = nixpkgs.legacyPackages."${system}";
inherit (pkgs) mkShell;
in
{
devShell = pkgs.mkShell {
buildInputs = with pkgs; [
duckdb
jq
];
};
}
);
}
Importing Your Twitter Archive into DuckDB
0. Download and unzip your Twitter Archive:
Before you can start working with your Twitter archive, you need to download it from your Twitter account settings. Once you have downloaded the zip archive, you need to unzip it to access the files inside. This will extract the files and folders contained in the zip archive to your current location.
1. Edit data/tweets.js saving pure JSON to tweets.json.
The first step is to extract the tweets data from the tweets.js
file. I was able to do this by running the following sed
command:
sed 's/^window\.YTD\.tweets\.part0 = //g' tweets.js.bak > tweets.json
Now we have a tweets.json
which is purely JSON data. This file contains your tweets data in JavaScript Object Notation (JSON) format.
However, if the above command doesn't work for you then you need to edit the tweets.js
file and remove the first line that contains the global assignment and replace with just [
to open the JSON Array. Once you have removed this line, you can save the remaining JSON data to a new file called tweets.json
.
To check this file is well formed JSON run:
jq -e '.' tweets.json 1>/dev/null
If it isn't you will usually see a parse error
message that tells you what line and column the parsing error occurred for pointers. For instance if you run it on your tweets.js
by accident you will get the following:
parse error: Invalid numeric literal at line 1, column 24
Now let's review the basic JSON schema of the new JSON file:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "array",
"items": {
"type": "object",
"properties": {
"tweet": {
"type": "object",
"properties": {
"edit_info": {
"type": "object",
"properties": {
"initial": {
"type": "object",
"properties": {
"editTweetIds": {
"type": "array",
"items": {
"type": "string"
}
},
"editableUntil": {
"type": "string",
"format": "date-time"
},
"editsRemaining": {
"type": "string"
},
"isEditEligible": {
"type": "boolean"
}
},
"required": ["editTweetIds", "editableUntil", "editsRemaining", "isEditEligible"]
}
},
"required": ["initial"]
},
"retweeted": {
"type": "boolean"
},
"source": {
"type": "string"
},
"entities": {
"type": "object",
"properties": {
"hashtags": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {
"type": "string"
},
"indices": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["text", "indices"]
}
},
"symbols": {
"type": "array",
"items": {
"type": "string"
}
},
"user_mentions": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"screen_name": {
"type": "string"
},
"indices": {
"type": "array",
"items": {
"type": "string"
}
},
"id_str": {
"type": "string"
},
"id": {
"type": "string"
}
},
"required": ["name", "screen_name", "indices", "id_str", "id"]
}
},
"urls": {
"type": "array",
"items": {
"type": "object",
"properties": {
"url": {
"type": "string",
"format": "uri"
},
"expanded_url": {
"type": "string",
"format": "uri"
},
"display_url": {
"type": "string"
},
"indices": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["url", "expanded_url", "display_url", "indices"]
}
}
},
"required": ["hashtags", "symbols", "user_mentions", "urls"]
},
"display_text_range": {
"type": "array",
"items": {
"type": "string"
}
},
"favorite_count": {
"type": "string"
},
"id_str": {
"type": "string"
},
"truncated": {
"type": "boolean"
},
"retweet_count": {
"type": "string"
},
"id": {
"type": "string"
},
"created_at": {
"type": "string",
"format": "date-time"
},
"favorited": {
"type": "boolean"
},
"full_text": {
"type": "string"
},
"lang": {
"type": "string"
},
"possibly_sensitive": {
"type": "boolean"
}
},
"required": [
"edit_info",
"retweeted",
"source",
"entities",
"display_text_range",
"favorite_count",
"id_str",
"truncated",
"retweet_count",
"id",
"created_at",
"favorited",
"full_text",
"lang"
]
}
},
"required": ["tweet"]
}
}
This JSON schema defines the structure and data types for the provided tweet example data.
In human readable form it looks like the following:
Certainly! Here's a table that describes the JSON schema with nested objects:
Property | Type | Description |
---|---|---|
tweet | object | The main tweet object. |
tweet.edit_info | object | Information about tweet edits. |
tweet.edit_info.initial | object | Initial edit information. |
tweet.edit_info.initial.editTweetIds | array | An array of edit tweet IDs. |
tweet.edit_info.initial.editableUntil | string | The datetime until which the tweet can be edited. |
tweet.edit_info.initial.editsRemaining | string | The number of edits remaining for the tweet. |
tweet.edit_info.initial.isEditEligible | boolean | Indicates whether the tweet is eligible for editing. |
tweet.retweeted | boolean | Indicates whether the tweet has been retweeted. |
tweet.source | string | The source of the tweet. |
tweet.entities | object | Entities associated with the tweet, such as hashtags, symbols, user mentions, and URLs. |
tweet.entities.hashtags | array | An array of hashtags used in the tweet. |
tweet.entities.symbols | array | An array of symbols mentioned in the tweet. |
tweet.entities.user_mentions | array | An array of user mentions in the tweet. |
tweet.entities.urls | array | An array of URLs mentioned in the tweet. |
tweet.display_text_range | array | The range of displayable text within the tweet. |
tweet.favorite_count | string | The number of favorites (likes) for the tweet. |
tweet.id_str | string | The string representation of the tweet ID. |
tweet.truncated | boolean | Indicates whether the tweet is truncated. |
tweet.retweet_count | string | The number of retweets for the tweet. |
tweet.id | string | The ID of the tweet. |
tweet.created_at | string | The creation datetime of the tweet. |
tweet.favorited | boolean | Indicates whether the tweet has been favorited. |
tweet.full_text | string | The full text of the tweet. |
tweet.lang | string | The language of the tweet. |
tweet.possibly_sensitive | boolean | Indicates whether the tweet may contain sensitive content. |
This table provides an overview of the properties, their types, and descriptions of the nested objects within the JSON schema.
2. Flatten tweet structure to essential fields
jq -r \
'.[].tweet | {id: .id_str, favorite_count, retweet_count, created_at, full_text}' \
tweets.json > flat_tweets.json
The second step involves using the jq
command to extract and filter the tweet data from the tweets.json file. The jq
command is a command-line tool for parsing and manipulating JSON data. In this step, the jq
command extracts the tweet data from the tweets.json file and saves the flattened tweet JSON structured output to a new file called flat_tweets.json. The jq
command uses a filter expression to specify which data to extract. In this case, the filter expression extracts the id, favorite_count, retweet_count, created_at, and full_text fields for each tweet in the tweets.json file.
3. Optional: Slurp flat_tweets.json content and create CSV file via jq
This step is purely optional because DuckDB now supports importing JSON directly now. If you are using DuckDB instead of something that requires CSV then you can just skip to the next step.
jq -r \
'map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at}) | (first | keys_unsorted) as $keys | map([to_entries[] | .value]) as $rows | $keys,$rows[] | @csv' \
-s ./flat_tweets.json > flat_tweets.csv
The third step involves using the jq
command again to convert the tweet data from flat_tweets.json
to a CSV file. CSV stands for Comma Separated Values and is a common format for storing and exchanging tabular data. In this step, the jq
command uses a filter expression to convert the tweet data from JSON to CSV format and save it to a new file called flat_tweets.csv
.
Here is a breakdown of the filter expressions in this command:
map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at})
: This expression uses the map function to transform each element in the input array (which represents a tweet) into a new object with the specified properties. The favorite_count and retweet_count properties get converted to numbers using the builtin tonumber function.(first | keys_unsorted) as $keys
: This expression extracts the keys of the first element in the input array, then assigns them to a variable called $keys. The keys_unsorted builtin function returns the keys in the order they appear in the object, rather than sorting them alphabetically. These will form the CSV file header row.map([to_entries[] | .value]) as $rows
: This expression uses the map function to transform each element in the input array into an array of values, by extracting the values of each property using the to_entries builtin. The resulting arrays of values get assigned to a variable named $rows.$keys,$rows[] | @csv
: This expression outputs the contents of the $keys and $rows variables as a CSV file. The @csv function converts the input data into CSV format.- Finally, the
-s
jq
command-line option tellsjq
to read the input from a file rather than from standard input, and the> flat_tweets.csv
part of the command redirects the standard output to a file calledflat_tweets.csv
.
4. Create duckdb
database file
duckdb tweets.duckdb
The fourth step involves using the duckdb
command to create a new duckdb
database. duckdb
is an embeddable SQL database designed to be embeddable, lightweight and fast. In this step, the duckdb
command creates a new duckdb
database file called tweets.duckdb
and opens up the duckdb
shell immediately after.
5. (In duckdb
shell): CREATE TABLE tweets AS SELECT * FROM read_csv_auto('./flat_tweets.csv');
If you are importing directly from JSON you would use the following:
CREATE TABLE tweets AS
SELECT * FROM read_json_auto('./flat_tweets.json');
If you created a CSV file for other purposes you can follow along with the following:
CREATE TABLE tweets AS
SELECT * FROM read_csv_auto('./flat_tweets.csv');
In this fifth step we use the CREATE TABLE
SQL command to create a new table in the tweets.duckdb
database inserting the data produced by a CSV import SELECT
statement. Note that the schema is automatically inferred from the data in the CSV file upon the CSV importing function read_csv_auto
.
Note: if you want to use the timestamp fields as timestamps you will need to look in the later section 'Query for Top Hashtags' for the bright way to create the table from the JSON or CSV file.
6. (Still in duckdb
shell): select count(1) from tweets;
The output from this should show you the number of tweets in the original tweets.json you started with in step 2 that you can cross reference using the command jq '. | length' tweets.json
to validate.
If those numbers match, your raw tweet archive has successfully been imported into duckdb (without entities).
Example Queries to Analyze your Twitter Archive
Who cares? Now we can query the DuckDB database for tweets that match the criterion you are looking to migrate.
Query for Popular Tweets
All of us indulge in a little vanity, so I suspect the most common simple query will be to retrieve tweets that gained significant popularity, with more than, say, 10 favorites and 3 retweets, where you can use the following SQL query:
SELECT *
FROM tweets
WHERE favorite_count > 10
AND retweet_count > 3;
Adjust the thresholds based on what you deem consequential for each of your Twitter accounts.
Query for Tweets by Year
Sure, you can do this with an advanced Twitter search query in the web UI for your own tweets (and others) but surely soon that will be behind a paywall, so if you're interested in analyzing your tweet activity over time, you can use the following SQL query to retrieve the number of tweets within each calendar year and sort by descending year:
SELECT
count(1),
datepart('year', created_at) AS year
FROM tweets
GROUP BY year
ORDER BY year DESC;
When I first ran this I got the following scary error:
Error: Binder Error: No function matches the given name and argument types 'datepart(VARCHAR, VARCHAR)'.
You might need to add explicit type casts.
Candidate functions:
datepart(VARCHAR, DATE) -> BIGINT
datepart(VARCHAR, TIMESTAMP) -> BIGINT
datepart(VARCHAR, TIME) -> BIGINT
datepart(VARCHAR, INTERVAL) -> BIGINT
datepart(VARCHAR[], DATE) -> STRUCT()
datepart(VARCHAR[], TIMESTAMP) -> STRUCT()
datepart(VARCHAR[], TIME) -> STRUCT()
datepart(VARCHAR[], INTERVAL) -> STRUCT()
datepart(VARCHAR, TIMESTAMP WITH TIME ZONE) -> BIGINT
datepart(VARCHAR[], TIMESTAMP WITH TIME ZONE) -> STRUCT()
LINE 1: select count(1), datepart('year', created_at) as year fr...
The reason for this is that when we imported the JSON previously via read_json_auto
function, DuckDB auto determined the data types for the columns. If we look at the schema for the auto detected column types we see that created_at
was set to VARCHAR
(a text type) instead of TIMESTAMPTZ
.
SHOW TABLE tweets;
The output for me looked like:
│ column_name │ column_type │ null │
│ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼
│ id │ BIGINT │ YES │
│ favorite_count │ BIGINT │ YES │
│ retweet_count │ BIGINT │ YES │
│ created_at │ VARCHAR │ YES │
│ full_text │ VARCHAR │ YES │
The way to fix this is by telling read_json_auto
how to parse timestamps with a set format.
The following is what I used to create a new table tweets2
which will parse created_at
as a TIMESTAMP with timezone:
CREATE TABLE tweets3 AS
SELECT * FROM read_json_auto(
'./flat_tweets.json',
timestampformat='%a %b %d %H:%M:%S %z %Y'
);
Now for the tweets2
schema we get the following when we run SHOW TABLE tweets2
:
│ column_name │ column_type │ null │
│ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼
│ id │ BIGINT │ YES │
│ favorite_count │ BIGINT │ YES │
│ retweet_count │ BIGINT │ YES │
│ created_at │ TIMESTAMP │ YES │
│ full_text │ VARCHAR │ YES │
Now we can run the original SQL query on the new table like so:
│ column_name │ column_type │ null │
│ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼
│ id │ BIGINT │ YES │
│ favorite_count │ BIGINT │ YES │
│ retweet_count │ BIGINT │ YES │
│ created_at │ TIMESTAMP │ YES │
│ full_text │ VARCHAR │ YES │
And we will get something like the following:
┌──────────┬───────┐
│ count(1) │ year │
│ int64 │ int64 │
├──────────┼───────┤
│ 952 │ 2023 │
│ 3721 │ 2022 │
│ 4203 │ 2020 │
│ 4475 │ 2019 │
│ 4128 │ 2018 │
│ 9402 │ 2017 │
│ 5689 │ 2016 │
│ 2321 │ 2015 │
│ 3618 │ 2014 │
│ 2320 │ 2013 │
│ 2831 │ 2012 │
│ 2303 │ 2011 │
│ 624 │ 2010 │
├──────────┴───────┤
│ 13 rows │
└──────────────────┘
Query for Top Hashtags
For hashtags, we will need to go back to the data shaping phase, probably using jq
again.
We will need to get the hashtag per tweet data to import into a new table from the JSON but first let's do some spelunking to get a feel for the data:
jq '[.[].tweet.entities | select(.hashtags | length > 0)] | length' tweets.json
Breaking down this query we get:
[ expr ]
: surrounding another expression in square brackets creates a new array..[].tweet.entities
: This digs into thetweet
key of each item in the outer array and returns theentities
object.map(select(.hashtags | length > 0))
: This applies the select function to each object in the array, filtering out the objects that have a non-empty hashtags array.length
This calculates the length of the resulting array, giving you the count of objects that passed the filter.
This query gets us the count of all objects in the outer array that has a non-empty hashtags
array.
The result here gives us an idea about the number of tweets we used hashtags in. In my Twitter Archive data, I used hashtags in about 6% of my tweets.
To create a hashtags
join table with a tweet id
and hashtag text we can create a JSON file containing an array with an object that looks like { "id": "123455667", "hashtags": [ "first", "second", "third" ] }
. We can do that with the following jq
command:
jq '[.[].tweet | select(.entities.hashtags | length > 0) | { "id": .id, "hashtags": [.entities.hashtags[].text] }]' \
tweets.json > hashtags.json
We can then import the JSON into a table with a 1-1 mapping between tweet id
and hashtag
text like so:
CREATE TABLE hashtags AS
SELECT id, unnest(hashtags) AS hashtag
FROM read_json_auto('./hashtags.json');
Our new hashtags
table has a schema that looks like this:
│ column_name │ column_type │ null │
│ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼
│ id │ BIGINT │ YES │
│ hashtag │ VARCHAR │ YES │
Now to stitch together my top 10 hashtags of all time we can create the following SQL query:
SELECT h.hashtag, COUNT(*) AS count
FROM hashtags AS h
JOIN tweets AS t ON h.id = t.id
GROUP BY h.hashtag
ORDER BY count DESC
LIMIT 10;
Here we are joining the hashtags
table with the tweets
table by id
then grouping by the hashtag
column in the hashtags
table and sorting the results in descending order by the count of each hashtag then cutting off after the first ten.
Obviously there is more we could do here to make this a better schema w.r.t. user mentions, edits, urls, symbols, etc.
Concluding thoughts
The above walked you through santizing and shaping your tweet archive via jq
and then importing it into DuckDB to be able to query with the familiar SQL.
Happy exploring with your Twitter archive in DuckDB! Cheers!
If you enjoyed this content, please consider sharing this link with a friend, following my GitHub or LinkedIn accounts, or subscribing to my RSS feed.