Author
has an updated_date of 2021-12-30 it will be prefixed/data/authors/updated_date=2021-12-30/
.updated_date
partitions aren't important yet. You need all the entities, so for Authors
you would get /data/authors/*/*.gz
updated_date
partition. Each is under 2GB./data/works/manifest
lists all the works.updated_date
partitions make this easy, but the way they work may be unfamiliar. Unlike a set of dated snapshots that each contain the full dataset as of a certain date, each partition contains the records that last changed on that date.Authors
, each being newly created on that date, /data/authors/
looks like this:Authors
, they would come out of one of the files in /data/authors/updated_date=2021-12-30
and go into one in /data/authors/updated_date=2022-01-04:
/data/authors/updated_date=2022-01-04
to get everything that was changed or added since then.X
, insert or update the records in objects where updated_date
> X
.Author
partitions and the number of records in each (in the actual dataset):updated_date=2021-12-30/
- 62,573,099 updated_date=2022-12-31/
- 97,559,192 updated_date=2022-01-01/
- 46,766,699 updated_date=2022-01-02/
- 1,352,773manifest
file updated_date
partition for an entity, we'll delete that entity's manifest
file. When we finish writing the partition, we'll recreate the manifest, including the newly-created objects. So if manifest
is there, all the entities are there too.s3://openalex/data/authors/manifest.
url
property of each item in the entries
list.updated_date
you haven't seen before.s3://openalex/data/authors/manifest
again. If it hasn't changed since (1), no records moved around and any date partitions you downloaded are valid.Author
per line. Insert or update into your database of choice, using each entity's ID as a primary key.