(Go: >> BACK << -|- >> HOME <<)

Page MenuHomePhabricator

Make CX Corpora dump generation incremental
Open, Stalled, LowPublic

Description

The CX Corpora dumps we provide at https://dumps.wikimedia.org/other/contenttranslation/ are generated everytime from the entire CX published data.
This implies or assumes the cx_corpora table will have the data forever for all published translation. Due to T183890: Remove very old translation drafts from CX database we are planning to change that assumption and going to remove old published translation.

This means, CX Corpora dumps will be using a start time and end time to fetch the data. And a user of this dumps need to collect the dumps for all these time intervals. For example, en-es dump will be cx-corpora.en2es.201706.text.json.gz + cx-corpora.en2es.201707.text.json.gz + cx-corpora.en2es.201708.text.json.gz and so on if we generate it on monthly basis.

This also implies that the dumps once generated should not be deleted at all.

Event Timeline

santhosh triaged this task as Medium priority.Jan 4 2018, 10:41 AM
santhosh created this task.

Is it acceptable to make the dumps harder to use for end users by splitting them into multiple files over time? Because drafts can change over time, they would need to take into account that newer dumps might have content for same drafts that they need to merge.

  1. If not, how do we build a reliable way to do incremental dumps that result in files that are not split over time?
  2. We would no longer use --threshold (because it makes little sense as groupings would appear random to end users) to reduce the number of files we generate. Do we only produce one all2all file or each pair separately? Former would force people to download all the files regardless of the languages they are interested in, latter could lead to us producing too many files.
  3. How about a case where we would need to purge some content? Currently we can just remove from the DB and remove dumps, because next dump run would generate a clean dump.
  4. Is monthly frequence enough? We currently do weekly. In my research proposal I argue that we want to build a short cycle of continuous improvements. For that even weekly dumps could be too slow, if the dumps are going to used for that at all.

I spoke with Ariel. From his point of view we are changing the permanent storage of some data from DB to dumps. The dumps server(s) are already well utilized, and there is no guarantee that data will stay there forever. Even now we keep last 14 dumps only.

We could treat published drafts as secondary data and not care if we lose it, but I don't think that is a good idea. This is useful data after all and it cannot be recovered later. We should explore also other options for permanently storing this data.

Nikerabbit changed the task status from Open to Stalled.Dec 10 2019, 9:31 AM
Nikerabbit lowered the priority of this task from Medium to Low.

This is an alternative to T189093: Alternative storage for old published drafts and I think that approach is preferable over this one. There may be other reasons for doing incremental dumps, such as performance and resource use. But currently I see no need to go this way.