Since I heard a number of people interested in extracting the data from their Twitter archive, I thought it would be beneficial to describe the steps.
Assumptions
you have already requested your account's Twitter archive via the Twitter site
your Twitter archive is currently available for download
you have
jq
andduckdb
command-line tools installed or know how to install them. If you use Nix as your package manager it should be as simple as runningnix-shell -p jq duckdb
which enters you into a temporary shell with those two command installed and brought into scope.
Instructions
0. Download and unzip zip archive:
Before you can start working with your Twitter archive, you need to download it from your Twitter account settings. Once you have downloaded the zip archive, you need to unzip it to access the files inside. This will extract the files and folders contained in the zip archive to your current location.
1. Edit data/tweets.js saving pure JSON to tweets.json. Only first line, remove global assignment leaving only JSON:
The first step is to extract the tweets data from the tweets.js file. This file contains your tweets data in JavaScript Object Notation (JSON) format. However, the file also contains some additional code that wraps the JSON data in a global assignment. To extract the pure JSON data, you need to edit the tweets.js file and remove the first line that contains the global assignment. Once you have removed this line, you can save the remaining JSON data to a new file called tweets.json.
2. Flatten tweet structure to essential fields
jq -r '.[].tweet | {id: .id_str, favorite_count, retweet_count, created_at, full_text}' tweets.json > flat_tweets.json
The second step involves using the jq
command to extract and filter the tweet data from the tweets.json
file. The jq
command is a command-line tool for parsing and manipulating JSON data. In this step, the jq
command extracts the tweet data from the tweets.json
file and saves the flattened tweet JSON structured output to a new file called flat_tweets.json
. The jq
command uses a filter expression to specify which data to extract. In this case, the filter expression extracts the id
, favorite_count
, retweet_count
, created_at
, and full_text
fields for each tweet in the tweets.json
file.
3. Slurp flat_tweets.json
content and create CSV file via jq
jq -r 'map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at}) | (first | keys_unsorted) as $keys | map([to_entries[] | .value]) as $rows | $keys,$rows[] | @csv' -s ./flat_tweets.json > flat_tweets.csv
The third step involves using the jq
command again to convert the tweet data from flat_tweets.json
to a CSV file. CSV stands for Comma Separated Values and is a common format for storing and exchanging tabular data. In this step, the jq command uses a filter expression to convert the tweet data from JSON to CSV format and save it to a new file called flat_tweets.csv
.
Here is a breakdown of the filter expressions in this command:
map({id, favorite_count: .favorite_count | tonumber, retweet_count: .retweet_count | tonumber, full_text,created_at})
: This expression uses themap
function to transform each element in the input array (which represents a tweet) into a new object with the specified properties. Thefavorite_count
andretweet_count
properties get converted to numbers using the builtintonumber
function.(first | keys_unsorted) as $keys
: This expression extracts the keys of the first element in the input array, then assigns them to a variable called$keys
. Thekeys_unsorted
builtin function returns the keys in the order they appear in the object, rather than sorting them alphabetically. These will form the CSV file header row.map([to_entries[] | .value]) as $rows
: This expression uses themap
function to transform each element in the input array into an array of values, by extracting the values of each property using theto_entries
builtin. The resulting arrays of values get assigned to a variable named$rows
.$keys,$rows[] | @csv
: This expression outputs the contents of the$keys
and$rows
variables as a CSV file. The@csv
function converts the input data into CSV format.Finally, the
-s
jq
command-line option tellsjq
to read the input from a file rather than from standard input, and the> flat_tweets.csv
part of the command redirects the standard output to a file calledflat_tweets.csv
.
4. Create DuckDB database file
duckdb tweets.duckdb
The fourth step involves using the duckdb command to create a new DuckDB database. DuckDB is an embeddable SQL database designed to be embeddable, lightweight and fast. In this step, the duckdb
command creates a new DuckDB database file called tweets.duckdb
and opens up the duckdb
shell immediately after.
5. (In duckdb shell): CREATE TABLE tweets AS SELECT * FROM read_csv_auto('./flat_tweets.csv');
The fifth step involves using the CREATE TABLE
SQL command to create a new table in the DuckDB database inserting the data produced by a CSV import SELECT
statement. Note that the schema is automatically inferred from the data in the CSV file upon the CSV importing function read_csv_auto
.
6. (Still in duckdb shell): select count(1) from tweets;
The output from this should show you the number of tweets in the original tweets.json
you started with in step 2 that you can cross reference using the command jq '. | length
to validate.
Who cares?
Now you can query the database for tweets that match the criterion you are looking to migrate.
For example, you might want to only republish "popular" tweets so you might write a SELECT
query like the following:
SELECT * from tweets WHERE favorite_count > 10 AND retweet_count > 3;
This is totally arbitrary but I wanted to explain why having the tweets in DuckDB is useful. :)
You might also want to extract different data into CSV and import into a new database in the same DuckDB database file that describe the entities so you can do more interesting joins. For me this wasn't an interesting use case yet.
Cheers!
By Susan Potter
Susan is a software development entrepreneur, team lead, systems architect, developer, enabler, wearer of many hats with a career spanning almost 25 years specifically in software development at varying levels. Even though she prefers developing in referentially transparent functional codebases, she is not above working in whatever language is necessary.