(Go: >> BACK << -|- >> HOME <<)

Snapshot data format
Here are the details on where the OpenAlex data lives and how it's structured.
  • All the data is stored in Amazon S3, in the openalex bucket.
  • The data files are gzip-compressed JSON Lines, one row per entity.
  • The bucket contains one prefix (folder) for each entity type: work, author, venue, institution, and concept.
  • Records are partitioned by updated_date. Within each entity type prefix, each object (file) is further prefixed by this date. For example, if an Author has an updated_date of 2021-12-30 it will be prefixed/data/authors/updated_date=2021-12-30/.
    • If you're initializing a fresh snapshot, the updated_date partitions aren't important yet. You need all the entities, so for Authors you would get /data/authors/*/*.gz
  • There are multiple objects under each updated_date partition. Each is under 2GB.
  • The manifest file is JSON (in redshift manifest format) and lists all the data files for each object type - /data/works/manifest lists all the works.
The structure of each entity type is documented here: Work, Author, Venue, Institution, and Concept.

Downloading updated Entities

Once you have a copy of the snapshot, you'll probably want to keep it up to date. The updated_date partitions make this easy, but the way they work may be unfamiliar. Unlike a set of dated snapshots that each contain the full dataset as of a certain date, each partition contains the records that last changed on that date.
If we imagine launching OpenAlex on 2021-12-30 with 1000 Authors, each being newly created on that date, /data/authors/ looks like this:
1
/data/authors/
2
├── manifest
3
└── updated_date=2021-12-30 [1000 Authors]
4
   ├── 0000_part_00.gz
5
...
6
   └── 0031_part_00.gz
Copied!
If, on 2022-01-04, we made changes to 50 of those Authors, they would come out of one of the files in /data/authors/updated_date=2021-12-30 and go into one in /data/authors/updated_date=2022-01-04:
1
/data/authors/
2
├── manifest
3
├── updated_date=2021-12-30 [950 Authors]
4
│   ├── 0000_part_00.gz
5
│ ...
6
│   └── 0031_part_00.gz
7
└── updated_date=2022-01-04 [50 Authors]
8
├── 0000_part_00.gz
9
...
10
└── 0031_part_00.gz
Copied!
If we also discovered 50 new Authors, they would go in that same partition, so the totals would look like this:
1
/data/authors/
2
├── manifest
3
├── updated_date=2021-12-30 [950 Authors]
4
│   ├── 0000_part_00.gz
5
│ ...
6
│   └── 0031_part_00.gz
7
└── updated_date=2022-01-04 [100 Authors]
8
├── 0000_part_00.gz
9
...
10
└── 0031_part_00.gz
Copied!
So if you made your copy of the snapshot on 2021-12-30, you would only need to download /data/authors/updated_date=2022-01-04 to get everything that was changed or added since then.
To update a snapshot copy that you created or updated on date X, insert or update the records in objects where updated_date > X.
You never need to go back for a partition you've already downloaded. Anything that changed isn't there anymore, it's in a new partition.
At the time of writing, these are the Author partitions and the number of records in each (in the actual dataset):
  • updated_date=2021-12-30/ - 62,573,099
  • updated_date=2022-12-31/ - 97,559,192
  • updated_date=2022-01-01/ - 46,766,699
  • updated_date=2022-01-02/ - 1,352,773
This reflects the creation of the dataset on 2021-12-30 and 145,678,664 combined updates and inserts since then - 1,352,773 of which were on 2022-01-02. Over time, the number of partitions will grow. If we make a change that affects all records, the partitions before the date of the change will disappear.

The manifest file

When we start writing a new updated_date partition for an entity, we'll delete that entity's manifest file. When we finish writing the partition, we'll recreate the manifest, including the newly-created objects. So if manifest is there, all the entities are there too.
The file is in redshift manifest format. To use it as part of the update process for an Entity type (we'll keep using Authors as an example):
  1. 1.
    Download s3://openalex/data/authors/manifest.
  2. 2.
    Get the file list from the url property of each item in the entries list.
  3. 3.
    Download any objects with an updated_date you haven't seen before.
  4. 4.
    Download s3://openalex/data/authors/manifest again. If it hasn't changed since (1), no records moved around and any date partitions you downloaded are valid.
  5. 5.
    Decompress the files you downloaded and parse one JSON Author per line. Insert or update into your database of choice, using each entity's ID as a primary key.
If you’ve worked with dataset like this before and have a toolchain picked out, this may be all you need to know. If you want more detailed steps, proceed to download the data.
Last modified 18d ago