Make CX Corpora dump generation incremental
Open, Stalled, LowPublic
Actions

Assigned To

None

Authored By

	santhosh
	Jan 4 2018, 10:41 AM

Description

The CX Corpora dumps we provide at https://dumps.wikimedia.org/other/contenttranslation/ are generated everytime from the entire CX published data.
This implies or assumes the cx_corpora table will have the data forever for all published translation. Due to T183890: Remove very old translation drafts from CX database we are planning to change that assumption and going to remove old published translation.

This means, CX Corpora dumps will be using a start time and end time to fetch the data. And a user of this dumps need to collect the dumps for all these time intervals. For example, en-es dump will be cx-corpora.en2es.201706.text.json.gz + cx-corpora.en2es.201707.text.json.gz + cx-corpora.en2es.201708.text.json.gz and so on if we generate it on monthly basis.

This also implies that the dumps once generated should not be deleted at all.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T183485 Please consider purging/moving the cx_corpora table at x1
		Stalled		None	T184169 Make CX Corpora dump generation incremental

Event Timeline

santhosh triaged this task as Medium priority.Jan 4 2018, 10:41 AM

santhosh created this task.

Is it acceptable to make the dumps harder to use for end users by splitting them into multiple files over time? Because drafts can change over time, they would need to take into account that newer dumps might have content for same drafts that they need to merge.

If not, how do we build a reliable way to do incremental dumps that result in files that are not split over time?
We would no longer use --threshold (because it makes little sense as groupings would appear random to end users) to reduce the number of files we generate. Do we only produce one all2all file or each pair separately? Former would force people to download all the files regardless of the languages they are interested in, latter could lead to us producing too many files.
How about a case where we would need to purge some content? Currently we can just remove from the DB and remove dumps, because next dump run would generate a clean dump.
Is monthly frequence enough? We currently do weekly. In my research proposal I argue that we want to build a short cycle of continuous improvements. For that even weekly dumps could be too slow, if the dumps are going to used for that at all.

Hydriz added a project: Datasets-Archiving.Jan 4 2018, 11:51 AM

ArielGlenn subscribed.Jan 4 2018, 4:07 PM

I spoke with Ariel. From his point of view we are changing the permanent storage of some data from DB to dumps. The dumps server(s) are already well utilized, and there is no guarantee that data will stay there forever. Even now we keep last 14 dumps only.

We could treat published drafts as secondary data and not care if we lose it, but I don't think that is a good idea. This is useful data after all and it cannot be recovered later. We should explore also other options for permanently storing this data.

Arrbee moved this task from Needs Triage to April-June 2018 on the ContentTranslation board.Jan 5 2018, 1:13 PM

Arrbee moved this task from April-June 2018 to Enhancements on the ContentTranslation board.Jul 6 2018, 9:15 AM

This is an alternative to T189093: Alternative storage for old published drafts and I think that approach is preferable over this one. There may be other reasons for doing incremental dumps, such as performance and resource use. But currently I see no need to go this way.

Nikerabbit edited parent tasks, added: T183485: Please consider purging/moving the cx_corpora table at x1; removed: T183890: Remove very old translation drafts from CX database.Dec 10 2019, 9:32 AM

Pginer-WMF moved this task from Enhancements to Long term on the ContentTranslation board.Dec 10 2019, 9:37 AM

Frostly subscribed.Jun 10 2023, 7:59 PM

Make CX Corpora dump generation incrementalOpen, Stalled, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

Make CX Corpora dump generation incremental
Open, Stalled, LowPublic
Actions

Related Objects
Search...