(Go: >> BACK << -|- >> HOME <<)

Page MenuHomePhabricator

Integrate template parameter alignments in Content Translation to improve automatic template support
Closed, ResolvedPublic

Description

Content translation looks for different metadata to transfer the contents from a template in the source document to the equivalent template in the translation.

A machine learning approach was applied in this context (T221211) to identify the mappings/alignments of parameters for the most used templates and language pairs (generated alignments).

This ticket proposes to integrate the generated alignments in Content translation as an additional criteria to consider during the parameter mapping process. When a template is added to the translation for a language pair with alignment data available,the alignments will be used to identify additional mappings that could not be identified with the default approaches. That is, metadata from templateData and parsoid will still be used anyways, the alignments will surface additional possible mappings that were not considered before.

Since the alignment information comes with probability data, we need to define a reasonable threshold. In this case I think it makes more sense to err on the side of the information being copied to the wrong parameter (high coverage) rather than being lost (high accuracy), but we may need to experiment and iterate on the exact value.

Regarding metrics, it would be great to measure how many templates can be adapted with this method, in general, and compared to those incomplete or not adapted. Depending on the complexity of this, a separate ticket can be created.

Event Timeline

After a brief analysis of the generated JSON mapping, this is what I am planning to do:

The JSON files are big. Parsing and doing lookup is going to be slow. Even if we plan to load and keep in cache, it is 12 MB for the 210 files for 15 languages. And expected to grow as we add more languages. So I plan to load all these data to an sqlite database and use queries to check if a mapping exist and retrieve its mapping.

I wrote a script to do that:

1const fs = require('fs'),
2 ArgumentParser = require('argparse').ArgumentParser,
3 sqlite = require('sqlite'); // https://github.com/kriasoft/node-sqlite
4
5async function createTemplate(db, from, to, templateName) {
6 const mapping = await db.get(`SELECT rowid FROM templates
7 WHERE source_lang = ? AND target_lang = ? AND template =?`,
8 from, to, templateName);
9 if (mapping && mapping.rowid) {
10 return mapping.rowid
11 }
12 const result = await db.run(`INSERT OR IGNORE INTO templates
13 (source_lang, target_lang, template) VALUES(?,?,?)`,
14 from, to, templateName);
15 return result.lastID;
16}
17
18async function main(databaseFile, mapping, from, to) {
19 const db = await sqlite.open(databaseFile, { Promise });
20
21 await db.run(`CREATE TABLE IF NOT EXISTS templates (
22 source_lang TEXT NOT NULL,
23 target_lang TEXT NOT NULL,
24 template TEXT NOT NULL,
25 UNIQUE(source_lang, target_lang, template)
26 )`
27 );
28 await db.run(`CREATE TABLE IF NOT EXISTS mapping (
29 template_mapping_id INTEGER NOT NULL,
30 source_param TEXT NOT NULL,
31 target_param TEXT NOT NULL,
32 score REAL NOT NULL,
33 UNIQUE(template_mapping_id, source_param, target_param)
34 )`);
35
36 for (const templateName in mapping) {
37 let mappingId, mappingData = mapping[templateName];
38 mappingId = await createTemplate(db, from, to, templateName);
39 console.log(`${mappingId} ${from} ${to} ${templateName}`);
40 for (let index in mappingData) {
41 let paramMapping = mappingData[index];
42 if (!mappingId || !paramMapping[from] || !paramMapping[to]) {
43 continue;
44 }
45
46 await db.run(`INSERT OR IGNORE INTO mapping
47 (template_mapping_id, source_param, target_param, score)
48 VALUES(?,?,?,?)`,
49 mappingId, paramMapping[from], paramMapping[to], paramMapping.d)
50 console.log(`${paramMapping[from]} -> ${paramMapping[to]} [${paramMapping.d}]`);
51 }
52 }
53 await db.close()
54};
55
56
57const argparser = new ArgumentParser({
58 addHelp: true,
59 description: 'Prepare template mapping database'
60});
61
62argparser.addArgument(
63 ['-d', '--database'],
64 {
65 help: 'template mapping database file',
66 defaultValue: 'templatemapping.db'
67 }
68);
69argparser.addArgument(
70 ['-i', '--input'],
71 {
72 help: 'JSON file with mapping.',
73 required: true
74 }
75);
76argparser.addArgument(
77 ['--from'],
78 {
79 help: 'Source language',
80 required: true
81 }
82);
83argparser.addArgument(
84 ['--to'],
85 {
86 help: 'Target language',
87 required: true
88 }
89);
90const args = argparser.parseArgs();
91const databaseFile = args.database;
92const input = args.input;
93if (!fs.existsSync(input)) {
94 throw Error(`File ${input} does not exist`);
95}
96
97const mapping = JSON.parse(fs.readFileSync(input));
98main(databaseFile, mapping, args.from,args.to)

Change 517056 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Add scripts to load the template mapping json to sqlite database

https://gerrit.wikimedia.org/r/517056

Change 517057 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Integrate template parameter alignments

https://gerrit.wikimedia.org/r/517057

For the purpose of review/test, I am going to explain a template and its mapping using the databse here. This is the same template used for the unit test of above patch

Let us use Cite Conference template - This is a reference template

Source template: https://es.wikipedia.org/wiki/Plantilla:Cita_conferencia
Target tempalte: https://ca.wikipedia.org/wiki/Plantilla:Citar_confer%C3%A8ncia

You can see that both does not have template data.

A real content that uses this template:

{{Cita conferencia |
autor=Naciones Unidas |
enlaceautor=Naciones Unidas |
fecha=18 de marzo de 2015 |
año=2015 |
mes=marzo |
título=Tercera Conferencia Mundial de las Naciones Unidas sobre la Reducción del Riesgo de Desastres (Manual de la Conferencia) |
conferencia=Tercera Conferencia Mundial de las Naciones Unidas |
editor=WCDRR |
ubicación=Sendai, Japón |
páginas=21 |
url=http://www.wcdrr.org/uploads/UN-WCDRR-CH-Es.pdf |
formato=PDF |
fechaacceso=8 de mayo de 2015}}

The template alignment system gave us the following mapping and scores:

Source paramTarget ParamScore
apellidocognom0.733333677522359
añoany0.621203767666562
conferenciaconferència0.79699115204052
fechadata0.562465710685373
nombrenom0.770534231312654
títulotítol0.674103936218413
urlurl0.599459685454743
Source param (es)Source valuetarget param(ca)Explanation
autorNaciones UnidasautorEven though these params are same in source and target, the alignment tool did not give this mapping. CXserver's mapping algorith used
enlaceautorNaciones Unidas-Not able to map
fecha18 de marzo de 2015dataUsing the database
año2015anyUsing the database
mesmarzomesCXServer algorithm
títuloTercera Conferencia Mundial de las Naciones Unidas sobre la Reducción del Riesgo de Desastres (Manual de la Conferencia)títolUsing the database
conferenciaTercera Conferencia Mundial de las Naciones UnidasconferènciaUsing the database
editorWCDRReditorCXServer algorithm. alignment tool did not give this mapping
ubicaciónSendai, Japón-Not able to map. Expected: location
páginas21-Not able to map. Expected: pages
urlhttp://www.wcdrr.org/uploads/UN-WCDRR-CH-Es.pdfurlCXServer algorithm. Alignment tool also gave the mapping
formatoPDF-Not able to map. Expected: format
fechaacceso8 de mayo de 2015-Not able to map. Expected: consulta

The template alignment system gave us the following mapping and scores:

Source paramTarget ParamScore
apellidocognom0.733333677522359
añoany0.621203767666562
conferenciaconferència0.79699115204052
fechadata0.562465710685373
nombrenom0.770534231312654
títulotítol0.674103936218413
urlurl0.599459685454743

Great work, and thanks for the clear example, @santhosh. It is great to see that this is automatically providing extra mappings that we were not finding before.

It's interesting that the mapping was found for word pairs that are very different such as apellido/cognom, or fecha/data; but it failed to find common similar words such as formato/format or páginas/pàgines. Maybe @diego can confirm whether these were cut out because of the threshold, because those words were not available in the corpora used, or something else.

Given that there are no false positives in the obtained mappings, if this example were representative, we may even consider making the threshold a bit less strict to get some more mappings.

Change 517056 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Add scripts to load the template mapping json to sqlite database

https://gerrit.wikimedia.org/r/517056

Change 517057 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Integrate template parameter alignments

https://gerrit.wikimedia.org/r/517057

Change 537145 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Update cxserver to 2019-09-16-152511-production

https://gerrit.wikimedia.org/r/537145

Change 537145 merged by KartikMistry:
[operations/deployment-charts@master] Update cxserver to 2019-09-16-152511-production

https://gerrit.wikimedia.org/r/537145

Change 537432 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Add templatemapping to cxserver config

https://gerrit.wikimedia.org/r/537432

Change 537432 merged by KartikMistry:
[operations/deployment-charts@master] Add templatemapping to cxserver config

https://gerrit.wikimedia.org/r/537432

Jpita subscribed.

I tried translating https://es.wikipedia.org/wiki/Biblioteca_Nacional_de_Espa%C3%B1a from es to ca, it has the template mentioned in the comment

Source template: https://es.wikipedia.org/wiki/Plantilla:Cita_conferencia
Target tempalte: https://ca.wikipedia.org/wiki/Plantilla:Citar_confer%C3%A8ncia

when I translate (using any MT engine) reference number 12, I get a missing reference warning and no parameters are translated.

I tried translating https://es.wikipedia.org/wiki/Biblioteca_Nacional_de_Espa%C3%B1a from es to ca, it has the template mentioned in the comment

Source template: https://es.wikipedia.org/wiki/Plantilla:Cita_conferencia
Target tempalte: https://ca.wikipedia.org/wiki/Plantilla:Citar_confer%C3%A8ncia

when I translate (using any MT engine) reference number 12, I get a missing reference warning and no parameters are translated.

Looking to the source article, reference 12 is based on Harvnp, not Cita conferencia.

Screenshot 2019-09-25 at 13.10.32.png (140×727 px, 52 KB)

The reference contents in the source article are as follows:

{{Harvnp|Carrión Gútiez|1996|pp=20-21}}

Maybe you are referring to a different reference.

For some reason on cx it appears as 12

image.png (136×768 px, 29 KB)

image.png (1×2 px, 419 KB)

image.png (287×1 px, 43 KB)

For some reason on cx it appears as 12

Ok. So it seems that inside Content Translation the reference numbers are different because those references inside the infobox are not counted due to the lack of support from Visual Editor (T52896).
Note that the the reference in the first paragraph is rendered as [3] in the article view, but is rendered as [1] when edited with Visual Editor.

Read viewVisual Editor
Screenshot 2019-09-25 at 13.42.09.png (253×724 px, 73 KB)
Screenshot 2019-09-25 at 13.42.24.png (249×724 px, 53 KB)

Please @Jpita, check again and capture any problematic case on a separate document (e.g., test page under user namespace) so that it is easy to check whether the new mappings were expected to work or not.
Thanks!!

Translating this page from es to ca throws a missing template warning.
According to this task, it should map the template to the target language.

EDIT:
For some unknown reason, now the template is being mapped to the target language.
But only one parameter is being mapped.

Translating this page from es to ca throws a missing template warning.
According to this task, it should map the template to the target language.

This is what I get:

Screenshot 2019-10-11 at 13.09.17.png (326×862 px, 59 KB)

Source referenceTarget reference
screencapture-ca-wikipedia-org-wiki-Special-ContentTranslation-2019-10-11-13_10_11.png (1×834 px, 78 KB)
Screenshot 2019-10-11 at 13.11.29.png (739×840 px, 44 KB)

My analysis:

  • Source document has Spanish Cita conferencia template with some parameters filled (nombre, apellido, fecha, url...) and others empty (mes, formato).
  • When added to the translation, a Catalan Citar conferència is added with one parameter filled (url) and another empty ("mes").
  • The alignment information for Spanish-to-Catalan has the mapping for these templates (shown below) including parameters such as "apellido" or "fecha" which were expected to be transferred to the target template in the example, but were not. @santhosh, any idea why that may happen? any additional info needed for debugging?
"Plantilla:Cita conferencia": [{
  "d": 0.20300884795954766,
  "ca": "confer\u00e8ncia",
  "es": "conferencia"
}, {
  "d": 0.22946576868734592,
  "ca": "nom",
  "es": "nombre"
}, {
  "d": 0.26666632247764077,
  "ca": "cognom",
  "es": "apellido"
}, {
  "d": 0.3258960637815874,
  "ca": "t\u00edtol",
  "es": "t\u00edtulo"
}, {
  "d": 0.3787962323334376,
  "ca": "any",
  "es": "a\u00f1o"
}, {
  "d": 0.40054031454525685,
  "ca": "url",
  "es": "url"
}, {
  "d": 0.4375342893146268,
  "ca": "data",
  "es": "fecha"
}]

In my local testing I see that all params are adapted. Next step is to investigate the cxserver deployment related issues if any.

image.png (712×1 px, 53 KB)

Change 542897 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Also add templatemapping to cxserver prod config

https://gerrit.wikimedia.org/r/542897

For Quick QA: Using the content of

1<section id='cxTargetSection41' data-mw-cx-source='undefined'><p id='mwAR8'><span data-segmentid='178' class='cx-segment'>En 2009, la Biblioteca Nacional inició un proyecto en colaboración con <a href='./Internet_Archive' rel='mw:WikiLink' data-linkid='179' class='cx-link' id='mwASA' title='Internet Archive'>Internet Archive</a>, con el objetivo de 'recolectar, archivar, y preservar el dominio.es.'<sup typeof='mw:Extension/ref' data-mw='{"name":"ref","attrs":{"name":"archivoweb"},"body":{"id":"mw-reference-text-cite_note-archivoweb-36","html":"<span about=\"#mwt117\" class=\"citation conferencia\" data-mw=\"{&amp;quot;parts&amp;quot;:[{&amp;quot;template&amp;quot;:{&amp;quot;target&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;Cita conferencia &amp;quot;,&amp;quot;href&amp;quot;:&amp;quot;./Plantilla:Cita_conferencia&amp;quot;},&amp;quot;params&amp;quot;:{&amp;quot;apellido&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;Peréz Morillo&amp;quot;},&amp;quot;nombre&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;Mar&amp;quot;},&amp;quot;enlaceautor&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;&amp;quot;},&amp;quot;coautores&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;Icíar Mugerza López&amp;quot;},&amp;quot;fecha&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;8 de junio de 2011&amp;quot;},&amp;quot;mes&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;&amp;quot;},&amp;quot;título&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;El archivo web de la BNE&amp;quot;},&amp;quot;conferencia&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;Cita en la BNE&amp;quot;},&amp;quot;urlconferencia&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;&amp;quot;},&amp;quot;url&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;http://www.slideshare.net/bne/el-archivo-web-de-la-bne&amp;quot;},&amp;quot;formato&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;&amp;quot;},&amp;quot;fechaacceso&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;18 de mayo de 2012&amp;quot;}},&amp;quot;i&amp;quot;:0}}]}\" id=\"CITAREFPeréz_Morillo8_de_junio_de_2011\" typeof=\"mw:Transclusion\" data-ve-no-generated-contents=\"true\">Peréz Morillo, Mar; Icíar Mugerza López (8 de junio de 2011). <a class=\"external text\" href=\"http://www.slideshare.net/bne/el-archivo-web-de-la-bne\" id=\"mwBV0\" rel=\"mw:ExtLink\"><i id=\"mwBV4\">El archivo web de la BNE</i></a>. Cita en la BNE<span class=\"reference-accessdate\" id=\"mwBV8\">. Consultado el 18 de mayo de 2012</span>.</span><span about=\"#mwt117\" class=\"Z3988\" id=\"mwBWA\" title=\"ctx_ver=Z39.88-2004&amp;amp;rfr_id=info%3Asid%2Fes.wikipedia.org%3ABiblioteca+Nacional+de+Espa%C3%B1a&amp;amp;rft.au=Per%C3%A9z+Morillo%2C+Mar&amp;amp;rft.aufirst=Mar&amp;amp;rft.aulast=Per%C3%A9z+Morillo&amp;amp;rft.btitle=El+archivo+web+de+la+BNE&amp;amp;rft.date=8+de+junio+de+2011&amp;amp;rft.genre=book&amp;amp;rft_id=http%3A%2F%2Fwww.slideshare.net%2Fbne%2Fel-archivo-web-de-la-bne&amp;amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook\" data-ve-ignore=\"true\"><span id=\"mwBWE\" style=\"display:none;\"><span id=\"mwBWI\" typeof=\"mw:Entity\">&amp;nbsp;</span></span></span><span about=\"#mwt117\" id=\"mwBWM\" data-ve-ignore=\"true\"> </span><span about=\"#mwt117\" class=\"error citation-comment\" id=\"mwBWQ\" style=\"display:none;font-size:100%\" data-ve-ignore=\"true\">La referencia utiliza el parámetro obsoleto <code id=\"mwBWU\"><span id=\"mwBWY\" typeof=\"mw:Entity\">|</span>coautores=</code> (<a class=\"cx-link\" data-linkid=\"737\" href=\"./Ayuda:Errores_en_las_referencias#deprecated_params\" id=\"mwBWc\" rel=\"mw:WikiLink\" title=\"Ayuda:Errores en las referencias\">ayuda</a>)</span><link about=\"#mwt117\" href=\"./Categoría:Wikipedia:Páginas_con_referencias_con_parámetros_obsoletos\" id=\"mwBWg\" rel=\"mw:PageProp/Category\" data-ve-ignore=\"true\">"}}' class='mw-ref' data-cx='{}' about='#mwt328' id='cite_ref-archivoweb_36-0' rel='dc:references'><a href='./Biblioteca_Nacional_de_España#cite_note-archivoweb-36' id='mwASE' style='counter-reset: mw-Ref 1;'><span class='mw-reflink-text' id='mwASI'>[1]</span></a></sup> </span><span data-segmentid='180' class='cx-segment'>En octubre de 2010, la BNE inauguró el 'Quijote Interactivo', una versión digitalizada e interactiva de la obra de <a href='./Cervantes' rel='mw:WikiLink' data-linkid='181' class='mw-redirect cx-link' id='mwASM' title='Cervantes'>Cervantes</a>, que incluye contenidos que ayudan a contextualizar la lectura, como un mapa con las aventuras del Quijote y apartados sobre la vida en el siglo XVII.<sup typeof='mw:Extension/ref' data-mw='{"name":"ref","attrs":{"name":"QuijoteInteractivo"},"body":{"id":"mw-reference-text-cite_note-QuijoteInteractivo-37","html":"<span about=\"#mwt118\" class=\"citation publicación\" data-mw=\"{&amp;quot;parts&amp;quot;:[{&amp;quot;template&amp;quot;:{&amp;quot;target&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;cita publicación &amp;quot;,&amp;quot;href&amp;quot;:&amp;quot;./Plantilla:Cita_publicación&amp;quot;},&amp;quot;params&amp;quot;:{&amp;quot;apellidos&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;Medina&amp;quot;},&amp;quot;nombre&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;Luis&amp;quot;},&amp;quot;enlaceautor&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;&amp;quot;},&amp;quot;año&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;2010&amp;quot;},&amp;quot;título&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;La aventura interactiva de Don Quijote&amp;quot;},&amp;quot;publicación&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;El País&amp;quot;},&amp;quot;url&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;http://cultura.elpais.com/cultura/2010/10/26/actualidad/1288044008_850215.html&amp;quot;},&amp;quot;fechaacceso&amp;quot;:{&amp;quot;wt&amp;quot;:&amp;quot;17 de mayo de 2012&amp;quot;}},&amp;quot;i&amp;quot;:0}}]}\" id=\"CITAREFMedina2010\" typeof=\"mw:Transclusion\" data-ve-no-generated-contents=\"true\">Medina, Luis (2010). <a class=\"external text\" href=\"http://cultura.elpais.com/cultura/2010/10/26/actualidad/1288044008_850215.html\" id=\"mwBWs\" rel=\"mw:ExtLink\">«La aventura interactiva de Don Quijote»</a>. <i id=\"mwBWw\">El País</i><span class=\"reference-accessdate\" id=\"mwBW0\">. Consultado el 17 de mayo de 2012</span>.</span><span about=\"#mwt118\" class=\"Z3988\" id=\"mwBW4\" title=\"ctx_ver=Z39.88-2004&amp;amp;rfr_id=info%3Asid%2Fes.wikipedia.org%3ABiblioteca+Nacional+de+Espa%C3%B1a&amp;amp;rft.atitle=La+aventura+interactiva+de+Don+Quijote&amp;amp;rft.au=Medina%2C+Luis&amp;amp;rft.aufirst=Luis&amp;amp;rft.aulast=Medina&amp;amp;rft.date=2010&amp;amp;rft.genre=article&amp;amp;rft.jtitle=El+Pa%C3%ADs&amp;amp;rft_id=http%3A%2F%2Fcultura.elpais.com%2Fcultura%2F2010%2F10%2F26%2Factualidad%2F1288044008_850215.html&amp;amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal\" data-ve-ignore=\"true\"><span id=\"mwBW8\" style=\"display:none;\"><span id=\"mwBXA\" typeof=\"mw:Entity\">&amp;nbsp;</span></span></span>"}}' class='mw-ref' data-cx='{}' about='#mwt318' id='cite_ref-QuijoteInteractivo_37-0' rel='dc:references'><a href='./Biblioteca_Nacional_de_España#cite_note-QuijoteInteractivo-37' id='mwASQ' style='counter-reset: mw-Ref 2;'><span class='mw-reflink-text' id='mwASU'>[2]</span></a></sup></span></p>
2
3</section>
with https://cxserver.wikimedia.org/v2/?doc#!/Machine_translation/post_v2_translate_from_to_provider and es->ca as language pair should give result with the following search. The content is minimal content based on https://es.wikipedia.org/wiki/Usuario:Jpita23/t224721

<sup about=\"#mwt328\" class=\"mw-ref\" data-cx=\"{&#34;adapted&#34;:true,&#34;partial&#34;:false}\"

Change 542897 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Also add templatemapping to cxserver prod config

https://gerrit.wikimedia.org/r/542897