(Go: >> BACK << -|- >> HOME <<)

Page MenuHomePhabricator

SVG image files, switch systemLanguage, lang argument, and mixed-case IETF langtags
Closed, ResolvedPublic

Description

Mixed-case langtags are present and not properly handled.

Go to a Commons SVG file such as File:First Ionization Energy.svg

https://commons.wikimedia.org/wiki/File:First_Ionization_Energy.svg

Use the "Render this image in" dropdown to select simplified Chinese (zh-Hans). Click Go.

The image is displayed in English rather than Chinese. The page url is

https://commons.wikimedia.org/w/index.php?lang=zh-Hans&title=File%3AFirst_Ionization_Energy.svg

So the user interface for showing different language versions of switch-translated SVG files does not work.

The URL is generated by a form that manages the dropdown. The input comes from a select with name="lang" and an option with value="zh-Hans"; the dropdown result is the URL argument ?lang=zh-Hans.

If the URL is modified to ?lang=zh-hans, then Chinese is displayed. A working (lowercase langtag) URL is:

https://commons.wikimedia.org/w/index.php?lang=zh-hans&title=File%3AFirst_Ionization_Energy.svg

The problem is more widespread than just the Commons file page interface.

SVG files that inserted using mixed-case IETF langtags such as

[[File:First Ionization Energy.svg|thumb|lang=zh-Hans]]

do not work (ie, display English) while lowercase-only IETF langtags such as

[[File:First Ionization Energy.svg|thumb|lang=zh-hans]]

do work (display Chinese). Same bug in a different place.

With the mixed-case langtag, the generated HTML goes directly to the English version; apparently the wiki markup did an API call with the mixed-case langtag, the mixed-case langtag was not found, so the English default was returned. Providing a lowercase langtag found the Chinese version and the HTML accesses the *-langzh-hans-*.png.

The "Render this image in" problem could be fixed by lowercasing the value= in the option elements when the page built.

More generally, the lang= argument should accept mixed-case IETF langtags and lowercase them to canonize them.

Giving mixed-case langtags used to work....

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It is not only a problem of the case, the problem is that librsvg uses only the system POSIX language and does not have a parameter for the IETF BCP 47 language code.

Anomie subscribed.

This does not seem to have anything to do with the MediaWiki Action API. Removing tag.

I'm a newbie who doesn't know many details.

The user File GUI at Commons should work with PhiLiP's patch to ImagePage.php mentioned at T125710:

https://gerrit.wikimedia.org/r/269108

I don't know enough to find out if the patch was committed or subsequently reverted. It seems to enforce lowercase canonization of URL arguments; it explicitly lowercases the select/option to feed lowercase so it would at least display a Chinese image rather than a default English image.

Interestingly, a review patch comment states, "Since MediaWiki are always use all lowercase language code in the URL." I don't know if that is in the MediaWiki spec. Since mixed-case (IETF preferred) langtags don't work, making URL langs lowercase is an expedient fix. Certainly IETF allows all lowercase langtags.

That patch does not fix everything. A user should be able to include a file with [[File:xyz.svg|lang=zh-Hans]] (ie, using the preferred langtag) rather than requiring the user to input a lowercase langtag.

@Fomafix seems to refer to a separate problem with librsvg not handling langtags correctly. That problem seems independent because by using lowercase lang arguments in wiki markup, one can ask for and get a Chinese version of the image rather than the default English version. That suggests that wiki and librsvg handle lowercase langtags quasi-correctly. I suspect that librsvg has langtag matching bugs, but that seems to be outside of the URL and wiki markup issues mentioned here.

https://gerrit.wikimedia.org/r/269108 hasn't been merged into the repository, is still pending review (status: Needs Code-Review Label).

I'm glad that this issue is brought back to table by someone else again, finally.

@Shizhao Please explain the SVG wikisyntax issue. If it cannot use "lang=zh-hant", then it seems it cannot use "lang=zh-Hant" either.

From some rsvg snippets I've seen, I can believe that rsvg does not do systemLanguage correctly. That issue seems separable.

I think wiki uses lowercase to canonize the filenames.

If I use wikimarkup with lowercase lang= zh-hans and zh-hant versions, the HTML has img elements with

src="//upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langzh-hans-440px-First_Ionization_Energy.svg.png"

src="//upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langzh-hant-440px-First_Ionization_Energy.svg.png"

Even if the images are incorrect, it looks like wiki has the mechanism to handle lowercase, hyphenated, langtags.

I don't know PHP, but I see nothing wrong with @PhiLiP patch to ImagePage.php.

It lowercases the langtag when it pulls it out of the URL lang arg and determines it is not NULL at 335.

(Presumably handler->validateParam('lang', 'zh-Hant') will fail. I haven't looked.)

It lowercases the lang= option at 1092 and 1105 and 1116.

I'd appreciate it if someone could check it an cause the commit. It would help to diagnose other issues with sublangs: sr-Cyrl and sr-Latn.

@PhiLiP yes, I would like this picked up and resolved. I didn't know about this patch until recently.

I don't know what I'm doing, but researching the lang issue I found...

https://doc.wikimedia.org/mediawiki-core/master/php/classSvgHandler.html

which gives a link to description of validateParam which points to

https://doc.wikimedia.org/mediawiki-core/master/php/SVG_8php_source.html

and line 472

Language::isValidBuiltInCode( $value )

That takes me to

https://doc.wikimedia.org/mediawiki-core/master/php/classLanguage.html#a48746d6fef809bf42bc33c4179a22ed8

"Returns true if a language code is of a valid form for the purposes of internal customisation of MediaWiki, via Messages*.php or *.json."

source at

https://doc.wikimedia.org/mediawiki-core/master/php/Language_8php_source.html

line 374 checks the lang $code with a regex:

return (bool)preg_match( '/^[a-z0-9-]{2,}$/', $code );

I don't know PHP, but I read this as beginning-to-end match; lowercase letters, digits, hypen at least two characters long. Uppercase letters are not in the pattern, so mixed case/uppercase lang arg is not legal internally for wiki.

So I see "zh-hant" as completely legal.

@Shizhao has built a test page at

https://zh.wikipedia.org/wiki/User:Shizhao/T154237

where zh-hans and zh-hant display the same text even though the file has different text for the two languages. That could lead Shizhao to believe "wikisyntax cannot use zh-hant", but I think it is a flawed matching algorithm in rsvg that causes lang zh-XXXX to match the first zh-YYYY in an SVG switch. Since zh-Hans is first in the switch clauses, only zh-Hans is displayed -- even if zh-hant is requested.

Method makeParamString in SVG.php explicitly lowercases the 'lang' parameter at line 492 (but lang='EN' slips through!), so some lowercasing in PhiLiP's patch is defensive only.

getAvailableLanguages() should return lowercased langtags rather than arbitrary case. That the typical JSON convention. Without forcing lower case, the method could return an assortment of equivalent langtags.

When an SVG file is processed, the systemLanguage attributes are split and processed.

The relevant code is in

https://doc.wikimedia.org/mediawiki-core/master/php/SVGMetadataExtractor_8php_source.html

traverses the SVG tree

animateFilterAndLang() on line 247 finds systemLanguage attr, explodes it and trims it into $langItem

(there may be an issue with default lang processing: systemLanguage=",en" may be legit)

Problem: $langItem is not canonized at line 270

270                         $langItem = trim( $langItem );

should be

270                         $langItem = strtolower( trim( $langItem ) );

There is no requirement that SVG files use consistent names.
systemLanguage="en-us,en-US,EN-us,En-Us,en-Us" are all equivalent but produce separate PHP keys

$langItem or a substr are used as keys in the dictionaries:
$this->languages
$this->languagePrefixes

The intent is set union, so the keys should be lowercased at line 270 to canonize them.

Language::isWellFormedLanguageTag() is good method for langtag checking; it could be added to line 472 of SVG to cull "&lang=--" (but original subr needs to stay to enforce all lowercase).
But isWellFormedLanguageTag() also has screwiness in it. For example, it has a regex for grandfathered "en-GB-oed" and "sgn-BE-fr" which cannot possibly match because the input argument has been strtolower()'d at line 325.
https://doc.wikimedia.org/mediawiki-core/master/php/Language_8php_source.html#l00283

Dirty Union at line 189
https://doc.wikimedia.org/mediawiki-core/master/php/SVGMetadataExtractor_8php_source.html#l00189
189  $this->metadata['translations'] = $this->languages + $this->languagePrefixes;
assume systemLanguage="en,en-GB"
$this->languages would have a key 'en'
$this->languagePrefixes would have a key 'en'
test for equality of value to LANG_PREFIX_MATCH would fail; duplicate keys in second argument are tossed, so the union prefers LANG_FULL_MATCH.
That's OK at SVG.php line 99 (which just wants $this->languages anyway).

The bottom line is I can make an SVG file with 16 different capitalizations of "en-GB" which will cause the ImagePage.php to produce a language select dropdown with 16 options that all say British English.

Failure to canonize in animateFilterAndLang() means getAvailableLanguages() returns multiple spellings of equivalent langtags.

Go to Commons [[File:System language attribute bug demo.svg]]
https://commons.wikimedia.org/wiki/File:System_language_attribute_bug_demo.svg

and look at the "Render this image in" dropdown; there will be 16 entries for American English..

File also shows selection does not work for hyphenated en-XXX varieties.

Another twist is MediaWiki assumes the default langtag is en rather than use the xml:lang attribute on the svg element.

Consequently, a German SVG file (one whose switch defaults are all German) but has systemLanguage=en clauses does not do the language dropdown box correctly. The dropdown just displays one language option: English. German is not presented.

Given that MediaWiki always uses en as the default, I guess all SVG files should assume en is the default. That's too bad, because the file I was using was originally generated on the de.WP.

Shizhao:

Yes, I noticed on 6 September that lang= zh-hans and zh-hant stopped working.

See https://en.wikipedia.org/w/index.php?title=User:Glrx/sandbox&diff=799299323&oldid=799292424

But I also noticed that lang=zh will work (and given the broken nature of librsvg langtag handling) would be just as effective.

I suspect the code for grabbing img src= (I added https: so you can click the URL)

https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langzh-hans-440px-First_Ionization_Energy.svg.png

no longer works. (If you click it, it gives 404 not found.)

The code for grabbing

https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langzh-440px-First_Ionization_Energy.svg.png

still works (access with browser displays).

In times past, the Commons File: page for a switch-translated file would show a "default" language option along with the explicit systemLanguages used. I don't see that behavior anymore.

[[:File:Globe valve diagram.svg]] is a switch-translated file; it does not specify an xml:lang on the svg element. The switch elements default to displaying numbered labels, but they have systemLanguage clauses for cs, de, en, and ru. The Commons language dropdown does not show default. It shows the en option, but that option still displays the numbers rather than English text.

The display is for

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/409px-Globe_valve_diagram.svg.png

When it should be:

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langen-409px-Globe_valve_diagram.svg.png

This behavior suggests a poor notion of language defaulting. The first URL should be when lang is not specified; the second should be when lang is specified. My guess is that when lang=en, the second form is bypassed and the first is used.

Except it is even crazier than that. Going to the Commons page shows the numbered illustration:

https://commons.wikimedia.org/wiki/File:Globe_valve_diagram.svg (link to Commons page)

That's fine, but if I click "edit" on the Summary heading and then click "show preview", Wikimedia displays the English labels (lang=en version) rather than the numbered version!

@PhiLiP I have a new patch ready for this that I think addresses everything in the mediawiki code. Doesn't fix the rsvg bug, but I think it's the best we can do until that does get fixed. Is it ok with you if I mark your patch as abandoned and push my own?

Change 384052 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/core@master] Use language codes from SVG without modification

https://gerrit.wikimedia.org/r/384052

I'm not a MW programmer and ignorant of gerrit.

I'm not sure that SVG.php's makeParamString() edit is right (delete line 502). I think all file parameter strings should use lowercase parameter strings, so lowercasing lang is a good thing to do. My take is MW should be enforcing/passing lowercase langtags all the time; it is a cheap canonization. I don't know if this code impacts file inclusions. A user should be able to enter [[File:Test.svg|lang=zh-Hans]] and have it request a lowercase langzh-hans-... Today, if I specify lang=zh-hans, I get nothing; if I specify lang=zh-Hans, then I get the default file. That's not the right thing. Furthermore, not lowercasing may introduce a caching / multiple copy issue.

I'm also leery of the && $params['lang'] !== 'en' condition (on line 501) , but it may have other ramifications. I think the condition should be deleted. In an SVG file, there is a difference between the default language and specifying 'en'. If lang=en is specified, the SVG file may be different from its default. Default may display numbers, but en may display English text. IIRC, an SVG file from a German editor had bizarre behavior because the default language was German; I had to switch the default to en.

The line 514 change to case insensitive match seems to catch both "zh-Hans" and "zh-hans", but it does not canonize the match on line 515: it should lower case $m[1].

As far as I can decipher it, the changes at ImagePage.php look OK except that line 1069 relies on $langChoices being a canonical set, but it is not.
Line 1075 might use $code, the preferred langtag.
Line 1078 assumes $curLang is $lang normal; that may not be true.
Line 1081 assumes $defaultLang is $lang normal; that may not be true.

That array comes from $this->displayImg->getAvailableLanguages(); (possibly line 545; I don't know how to traverse the source tree)

SVG's getAvailableLanguages() (possibly line 52) depends on $metadata['translations'] (set on line 187) and depends on $this->languages.

but line 268 of SVGMetadataExtractor.php only trims $langItem; it does not canonize it before using it as key for $this->languages (on line 270).

That means the getAvailableLanguages() may include IETF langtag semantic duplicates of "zh-hans" and "zh-Hans".

I don't see these patches fixing the failure to display some image for ".../langzh-hans-400px-Test.svg.png". The path is lowercase, so the /I change would not affect the match. I have no idea where the failure to display happens.

Ach!

ImagePage.php
line 1057 makes $curLang IETF canonical
line 1058 makes $defaultLang IETF canonical

line 1069 makes $lang not canonical and not IETF canonical
line 1070 makes $code IETF canonical

therefore 1078 and 1071 must compare to $code rather than $lang!

All of this means that the simple cases of two-letter IETF langtags (e.g., "en", "de", "es") will work, but script langtags will fail (e.g., "zh-hans" becomes IETF canonical "zh-Hans").

Also, I don't know the case of $request->getVal( 'lang' ) on line 313.

I'd downvote the patch.

@Glrx thanks for the feedback

My take is MW should be enforcing/passing lowercase langtags all the time; it is a cheap canonization

I don't understand why you think they should be canonized. We don't enforce any rules on the systemLanguage attributes when the SVG file is uploaded, so they're essentially arbitrary strings, and therefore IMO we should treat them as arbitrary strings and pass them around unmodified.

I don't know how the lang param in a [[File:xyz|lang=abc]] tag is handled internally, but I suspect that the expected value is an internal wikimedia language identifier that won't necessarily match the (arbitrary) systemLanguage attribute. I'll look into that but I'm not sure it'll be possible to solve

therefore 1078 and 1071 must compare to $code rather than $lang!

Aha! Yes you are right on this. Will fix on Monday, finishing for the day now

My take on MW is that the system code uses lowercase langtags everywhere. If I look at language objects, then they will have lowercase langtags such as "zh-hant" rather than "zh-Hant". Only in the display of a langtag to the user is it made IETF canonical, and that is only because it is prettier.

When somebody made the dropdown box pretty with IETF langtags, he got carried away and started using IETF canonization in too many places.

Line 1057 should not IETF canonize. It should just lowercase.
Line 1058 should not IETF canonize. It should just lowercase.
(Neither of the above two lines should be needed; the caller of doRenderLangOp() should guarantee lowercase, but they don't hurt.)

The contract for getAvailableLanguages() should be it all its keys are lowercase. Lowercase would follow MW convention. I will explain duplicate keys below.

Consequently, $lang at line 1069 would be lowercase canonical.
$code at line 1070 is made IETF canonical. That variable is for display only; $code should not be compared to anything. It is only a pretty string for display.

Line 1078 should compare $curLang to $lang
Line 1081 should compare $defaultLang to $lang

Do not make your changes to SVG.php. There is no reason that MW should accept all these pathnames:

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langde-409px-Globe_valve_diagram.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langDE-409px-Globe_valve_diagram.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langdE-409px-Globe_valve_diagram.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langDe-409px-Globe_valve_diagram.svg.png

MW should only accept the first.

Accepting many variations will confuse caching. The paths are generated by MW, and they should be canonical. Think of it as there is only one true pathname. That means SVG.php must insure a lowercase langtag.

SVGMetadataExtractor.php must canonize $langItem.

"I don't understand why you think they should be canonized. We don't enforce any rules on the systemLanguage attributes when the SVG file is uploaded, so they're essentially arbitrary strings, and therefore IMO we should treat them as arbitrary strings and pass them around unmodified."

The spec for SVG does not require canonical langtags, but it does require case-insensitive matching. An SVG file might have systemLanguage="en-us" in one place but systemLanguage="en-US" in another. MW must treat those capitalizations as the same, and the cheapest way to do that is just lowercase them. Objects in PHP are case sensitive; "en-us" and "en-US" would be separate keys. We don't want the set of language tags used in the SVG file to contain such semantic duplicates.

For a concrete example of the problem, go to
https://commons.wikimedia.org/wiki/File:System_language_attribute_bug_demo.svg

and look at the "Render this image in " dropdown. It has 16 identical entries for "American English (en-US)". I made that happen by using the 16 different ways to capitalize "eN-uS".

The rule should be that MW system software ALWAYS uses lowercase langtags. When MW reads an SVG file, then it should convert all those langtags to lowercase to be compatible with the rest of MW. Only when MW want to paint a pretty picture should it go to the trouble of calling LanguageCode::bcp47( $langtag ).

In ImagePage.php, you've removed $handler->validateParam() which is overloaded at SVG.php. I'm not sure that is a wise thing to do because it upsets the previous architecture. validateParam() for SVG can get the available languages and check them.

Change 385352 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/core@master] Treat svg systemLanguage attribs as arbitrary strings

https://gerrit.wikimedia.org/r/385352

Change 384052 abandoned by Cparle:
Use language codes from SVG without modification

Reason:
See https://gerrit.wikimedia.org/r/#/c/385352/ instead

https://gerrit.wikimedia.org/r/384052

Please abandon 385352.

It has changes that break the architecture and intent.
There's a reason for LANG_FULL_MATCH and LANG_PREFIX_MATCH.
The File: UI page is only interested in LANG_FULL_MATCH.
If the SVG file has zh-Hans, then the UI should offer to display zh-Hans but not offer just the prefix zh. There is no just zh to display.
The image insertion code could be interested in LANG_PREFIX_MATCH.
If an article asks for zh, then the system should recognize zh is available and generate the "langzh-440px..." path to display it.
If there's not zh- langtag, then the system recognize that zh is not available and generate the default path "440px.." to display it.
I don't know where it does it, but MW seems to have that pathname behavior.

You're trying to replace that functionality with getMatchedLanguage, but
list ( $svgLangPrefix, ) = explode( '-', $svgLang );
does not do SVG lang matching semantics.
$userPreferredLang of "zh-Hans" will not match $svgLang of "zh-Hans-CN".
Yes, the existing code uses a gross first hyphen hack.
Futhermore, I don't see getMatchedLanguage tying into image insertion code. Maybe it does.

The patch has code that does not achieve its goal.
SVGMetadataExtractor.php does not even generate an appropriate object; array_unique does not cut it.

Language::IsValidBuiltInCode is the right thing: we should insist on lowercase IETF. Don't change it.

"SVG does not have a way of specifying which language is default, therefore return the user's interface language"
SVG does have a way of specifying the default language for an image; it is xml:lang.
In addition, an SVG image should always have a default language for display; that is, there should always be a (default) in the option list. Even if there's a match to the requested language.
I'm not chasing the rest of the default/requested language; changing ImagePage.php seems far afield for fixing SVG problems.

Abandon the patch.

A primary confusion is a desire to preserve case in IETF langtags. That is the wrong thing.

Start out small. Do the one line fix to SVGMetadataExtractor.php at line 268 that lowercases and trims the language tag $langItem.

Do that as one patch. It will fix a bug.

Then fix the option list in another patch.

@Glrx let's address your concerns one at a time

There's a reason for LANG_FULL_MATCH and LANG_PREFIX_MATCH.
The File: UI page is only interested in LANG_FULL_MATCH.
If the SVG file has zh-Hans, then the UI should offer to display zh-Hans but not offer just the prefix zh. There is no just zh to display.
The image insertion code could be interested in LANG_PREFIX_MATCH.
If an article asks for zh, then the system should recognize zh is available and generate the "langzh-440px..." path to display it.
If there's not zh- langtag, then the system recognize that zh is not available and generate the default path "440px.." to display it.
I don't know where it does it, but MW seems to have that pathname behavior.

With this patch the file page UI will display a dropdown containing all the systemLanguage attributes from the SVG file. It does not create prefixes, as you seem to be suggesting.

You can use the a prefix rather than the full code in the image insertion code if you wish - it gets passed on directly to librsvg and matched by the library as appropriate

$userPreferredLang of "zh-Hans" will not match $svgLang of "zh-Hans-CN"

Ok. I had interpreted "a prefix" in the SVG spec as meaning "the string before the first hyphen" but I see how it could be interpreted your way. I'll make a change so that it covers both interpretations

SVGMetadataExtractor.php does not even generate an appropriate object; array_unique does not cut it.

I don't understand what you mean. The original and the patched code both write language data to arrays

Language::IsValidBuiltInCode is the right thing: we should insist on lowercase IETF

I don't understand your reasoning. If we are going to insist on anything we need to do our insisting when the user uploads the SVG file in the first place, not allow them to upload arbitrary strings as systemLanguage attributes and then second-guess their intentions afterwards

SVG does have a way of specifying the default language for an image; it is xml:lang

Hmm ok. I'm not sure how that ties in with the systemLanguage switch though. On a File: page the default language can only reasonably be the user's interface language, falling back as specified within the <switch> in SVG.

When including the file in an article, however, it's not so clear. Are you suggesting that if we have no lang=xyz value we should try and match the xml:lang value (if it exists) against the systemLanguage attributes? This would probably need to be a separate patch, as it would need some investigation

... and I think that's it. Have I missed anything?

"With this patch the file page UI will display a dropdown containing all the systemLanguage attributes from the SVG file."

It is not just about the file page UI dropdown. The prefixes may be used elsewhere. Why did the original programmer include the prefix matches in the table? We know right now that [[File:Test.svg|lang=zh-hans]] does not display any diagram.

"You can use the a prefix rather than the full code in the image insertion code if you wish - it gets passed on directly to librsvg and matched by the library as appropriate"

That may be true, but it does not explain the failure to serve an image bug. Somewhere in MW, prefixed langtags such as "zh-hans" and "sr-ec" are being rejected. To get any image, I must use just the prefix. I'm hesitant about any significant changes without understanding that bug.

To put it more bluntly, why are you changing code that does not need to be changed?

"Ok. I had interpreted "a prefix" in the SVG spec as meaning "the string before the first hyphen" but I see how it could be interpreted your way. I'll make a change so that it covers both interpretations"

There is only one interpretation.

The spec is http://www.w3.org/TR/SVG/struct.html#ConditionalProcessingSystemLanguageAttribute

It states, "Evaluates to "true" if one of the languages indicated by user preferences exactly equals one of the languages given in the value of this parameter, or if one of the languages indicated by user preferences exactly equals a prefix of one of the languages given in the value of this parameter such that the first tag character following the prefix is "-".

"I don't understand what you mean. The original and the patched code both write language data to arrays."

SVGMetadataExtractor used an object to avoid duplicates (there would be only one "en-us" key) but has the mixed case bug. array_unique will remove same case dupes from an array, but it will not fix the case bug: it will keep both "en-us" and "en-US".

"I don't understand your reasoning. If we are going to insist on anything we need to do our insisting when the user uploads the SVG file in the first place, not allow them to upload arbitrary strings as systemLanguage attributes and then second-guess their intentions afterwards"

Language::IsValidBuiltInCode insists on lowercase subset of IETF langtags. Language::IsValidCode allows a lot of other crap.

WTF? I don't want to insist and the spec prohibits me from insisting that the SVG file contain uniform case language tags. That does not mean MW should treat "zh-hans" differently from "zh-Hans". IETF says case does not matter, and the simplest way of enforcing that is for MW to lowercase every langtag that comes from the outside world. Instead you're on some mission to preserve the case of the SVG langtags. The langtag case has no meaning. The user expresses no intention by using a different case.

"On a File: page the default language can only reasonably be the user's interface language, falling back as specified within the <switch> in SVG."

You are confusing the user's default language and the SVG's default language. Just because my default language is "en" does not mean the SVG's default language is "en". The dropdown is the list of languages in the SVG file; the default should be the SVG's default language. I've seen files whose default language is "de"; last night I was looking at SVG files whose default language is "fr".

https://commons.wikimedia.org/wiki/File:Globe_valve_diagram.svg

is an example of how broken the file page UI is. Go to that URL and it displays English. The dropdown box does not have a "default".

The image's default language is not "en"; it has no default language: the last clause in the switch statements display numbers rather than any particular language. Here's what the "default language" display should look like:

https://commons.wikimedia.org/wiki/File:Globe_valve_diagram.svg?lang=tlh

I should be able to click default language and get the equivalent of

https://commons.wikimedia.org/wiki/File:Globe_valve_diagram.svg?lang=

and that should not display English (but librsvg will not cooperate). There's a lot of other code that is broken, so I don't expect a full fix anytime soon, but the dropdown list should offer to show the default display. (A workaround to see the default now is ?lang=qqq or ?lang=tlh in the dropdown.)

I am not suggesting that the SVG's xml:lang be used for article inclusions; that's clearly nonsense. An SVG file should declare its default language with xml:lang or lang, but almost none of them do that. It is not the case today, but in the future article insertions of [[File:Test.svg]] should have its |lang= parameter default to the Wikipedia's language. So if I were on de.WP, then [[File:Test.svg]] would be equivalent to [[File:Test.svg|lang=de]]. That's not done now because most SVGs are not switch translated, so there would be pointless duplicate svg.pngs of the file. But that is clearly the goal.

Abandon the patch. It is too extensive and too risky.

Fix line 268 in a single patch.

Then let's talk about fixing ImagePage options to be consistent.

General request: Please mark quotes as such (> prefix), as it's hard to follow long comments and as many different aspects are discussed. I'd also suggest to phrase requests and proposals as such and not using an imperative mood, and to criticize ideas instead of people. Thanks for keeping Phabricator a respectful place.

you're on some mission to preserve the case of the SVG langtags.

Hahaha. It's just simplest not to mess with the case. I don't see any reason to change it

The langtag case has no meaning. The user expresses no intention by using a different case.

Why would the user use a different case if not to express an intention? If a user uploads something like this

<text systemLanguage="en-us">Some text</text>
<text systemLanguage="en-US">Some other text</text>

then one of them cannot be displayed if we have lowercased everything. I've no idea why a user might do this, but if we were going to prevent them we would need to enforce compliance with BCP47 on upload (and allow them to take action based on that), rather than on display

Somewhere in MW, prefixed langtags such as "zh-hans" and "sr-ec" are being rejected. To get any image, I must use just the prefix. I'm hesitant about any significant changes without understanding that bug.

There are a number parts to that bug.

First, for language strings containing uppercase characters

  1. the language string fails validation as it's not a built-in language code, so an <img> tag gets generated such that the url contains no language information
  2. when there is no language information in the url the png with the English text gets rendered

Second, for language strings containing only lowercase characters

  1. the language string passes validation, and an <img> tag gets generated such that the url contains the language information
  2. the regular expression in the ImagesHandler class on the thumbor server (which generates the thumbnail) fails to match a language string containing a dash, and a 404 gets returned

This patch addresses everything except the ImagesHandler class, which is covered by https://phabricator.wikimedia.org/T178974

To put it more bluntly, why are you changing code that does not need to be changed?

With this comment I think you are referring to the changes to the code here https://gerrit.wikimedia.org/r/#/c/385352/4/includes/media/SVGMetadataExtractor.php I made the change because AFAICS that code doesn't serve any useful function, and IMO a responsible developer removes cruft when she/he can do so safely. I added the author of the code I changed as a reviewer to make sure it was safe.

... however I've since noticed that the SVG metadata is cached, and because I don't know how to go about purging caches when there's a release I've reverted that particular change

I should be able to click default language ...

Sorry, I can't really follow what it is you think should happen in the comments above. I've updated the patch a little so that both an included file and the file page itself will show the language that matches the user's interface language (so if I'm browsing in French then I'll get the French png), or the default text if there is no match. Also a 'default language' option will display in the dropdown on the File page UI.

Does this seem incorrect to you? This approach will generate more files than are generated currently, I will check with ops if this is a problem

@Glrx I've had some input from @Gilles and there are a few more changes on the way ...

Hahaha. It's just simplest not to mess with the case. I don't see any reason to change it

There's an important reason to mess with case. SVG does not care whether a user type "en-US" or "en-us"; it makes no difference to the semantic interpretation of the SVG.

It does make a difference to the MW UI. The UI should not display redundant options. As has been repeated many times above, the language dropdown box at

https://commons.wikimedia.org/wiki/File:System_language_attribute_bug_demo.svg

should not have 16 options in the dropdown box that all say "American English (en-US)" and do exactly the same thing when clicked.

It's a one line fix.

Furthermore, as stated above, MW probably should not bother to handle mixed-case language arguments in URLs:

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langde-409px-Globe_valve_diagram.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langDE-409px-Globe_valve_diagram.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langdE-409px-Globe_valve_diagram.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langDe-409px-Globe_valve_diagram.svg.png

because they would all produce the same image. KISS. Just because IETF langtags may be mixed case does not mean MW must accept mixed case in URLs.

Now I don't know how MW dispatches the above URLs. I don't know the code, and I don't have the time to learn it. My guess had been that

Thumb.php 549: $extraParams = $handler->parseParamString( $paramString );

was involved and invoked parseParamString in SVG.php which has the quite reasonable lowercase-enforcing-with-hyphen lang extraction:

SVG.php 514: if ( preg_match( '/^lang([a-z]+(?:-[a-z]+)*)-(\d+)px$/', $str, $m ) )

The regex reasonable rejects "langDE-400px" and accepts "langzh-hans-400px".

From the comment above, the hyphen error is fixed by @Gilles with a recent edit to Python code that uses the far more lenient regex

images.py 135: r'(?:lang(?P<lang>[0-9a-zA-Z-]+)-)?'

Now, I don't really care if MW accepts "en-US" as a valid language tag in a URL as long as it canonizes/lowercases it further down the road. MW need not and should not generate 16 versions of .svg.png if some dweebs happen to ask for all 16 capitalization variations of "eN-uS". Building one "en-us" version is sufficient, and despite "It's just simplest not to mess with the case", it is good engineering to use a canonical case to avoid extra work and duplicated files. MW doesn't need to be ready to build 64 .svg.png of "zh-hans", either.

Well, maybe I do care. Insisting on lowercase langtags in URLs means caching at my end works a little better. I won't have two copies of the same crap in my cache.

From a simplicity standpoint, I don't think images.py should accept A-Z. Just reject mixed case URLs. Users can say [[File:Test.svg|lang=zh-hans]]. It would be better for the parser to lowercase langtags. I think that's what makeParamString SVG.php:502 is trying to do. The recent Python edit also added 0-9. I'm not going to complain about that because some rare IETF langtags may have digits. I will say the SVG.php lang regex is the better one because it insists the first block be entirely alphabetic (which is an IETF requirement). The Python regex accepts "lang---300px"; the PHP regex insists on nonempty subtags.

Why would the user use a different case if not to express an intention? If a user uploads something like this

<text systemLanguage="en-us">Some text</text>
<text systemLanguage="en-US">Some other text</text>

then one of them cannot be displayed if we have lowercased everything. I've no idea why a user might do this, but if we were going to prevent them we would need to enforce compliance with BCP47 on upload (and allow them to take action based on that), rather than on display

Once again, SVG makes no distinction about case.

First, if we presume the given clauses are inside a switch element, then the second clause can NEVER be displayed. The only intention that the writer expresses is avoiding the second clause.

Second, if we presume the given clauses are not inside a switch element (rare, and the text will overwrite), then either both text elements will display or neither will display.

Saying it again, telling an SVG user agent that the preferred language is "en-US" will match both "en-us" and "en-US" (and 14 siblings and crap like "en-us-x-monarch"). SVG language matching does not care about case.

I've no idea why a user might do this, but if we were going to prevent them we would need to enforce compliance with BCP47 on upload (and allow them to take action based on that), rather than on display

This comment shows a fundamental misunderstanding of the task. Langtag case does not matter. It cannot influence SVG display. Any langtag case is BCP-47 compliant. The langtags "zh-Hans", "zh-hans", and "ZH-HANS" comply with BCP-47, are unambiguous, and are identical under BCP-47. They must be treated the same.

There are a number parts to that bug.

No, there is only one part to the bug. The examples I gave were lowercase hyphenated language tags: "zh-hans" and "sr-ec". The complaint was MW stopped processing the hyphenated lowercase langtags. I showed that appropriate language tags were being generated by wikimarkup. It was just that the "zh-hans" tags were 404ing.

Now, I hope that the hyphenated language tag issue has been fixed.

With this comment I think you are referring to the changes to the code here

Yes and no. Your patch proposed changing 200+ lines of code in 7 different files. That is outrageous. Not only did you want to change the systemLanguage parsing (something that was working), but you were also making changes in File and MediaHandler.

The code mostly worked before; it should be fixed with small changes. Gilles fix is a one liner. Fixing the option dropdown box can be a one- or two-liner. Fixing the multiset problem is a one-liner.

Sorry, I can't really follow what it is you think should happen in the comments above. I've updated the patch a little so that both an included file and the file page itself will show the language that matches the user's interface language (so if I'm browsing in French then I'll get the French png), or the default text if there is no match. Also a 'default language' option will display in the dropdown on the File page UI.

I don't see an updated patch, so I don't know what changes you've proposed.

I don't care if the semantics of the ImagePage change a little, but that is not an important issue. I'd tend to defer them to a second patch because it is far more important to get the basics working.

The basics are the dropdown box on ImagePage should use lowercase langtags instead of the "zh-Hans" it currently uses. Thumb URLs with uppercase langtags would not display, so uppercase langtags should not be stuffed into URLs.

After that, we get into the little niceties that could be / should be deferred. I really hesitate to go into this because I don't want to see changes until bugs have been resolved.

If somebody comes to a page with ?uselang=fr, for example

https://commons.wikimedia.org/wiki/File:First_Ionization_Energy.svg?uselang=fr

then they get a French UI with an English image displayed ("anglais (en-US)"). (Maybe it is different for Accept-Language access.) Fundamentally, that is not a bad result. It could be left that way. In fact, that way may be the most appropriate for "international" illustrations that use numbers as their default.

One could argue that it would be better if an unspecified "lang" argument defaulted to "uselang" so the above would be the same as

https://commons.wikimedia.org/wiki/File:First_Ionization_Energy.svg?uselang=fr&lang=fr

That defaulting has the problem mentioned above. It also has to be careful because the French user and the English user should be able to view the default image even if their language is present in the SVG file. How MW specifies view-the-default-image is a problem because it must get around MW conventions and librsvg conventions.

It would be nice if

https://commons.wikimedia.org/wiki/File:First_Ionization_Energy.svg?uselang=fr&lang=

did the trick, but it doesn't.

The simple trick is to always provide a default option that uses lang="qqq", a private langtag.

https://commons.wikimedia.org/wiki/File:Globe_valve_diagram.svg?uselang=fr&lang=qqq

But such a move is a hack; it can be done better by other methods. It is also not necessary to fix now.

The langtags "zh-Hans", "zh-hans", and "ZH-HANS" comply with BCP-47, are unambiguous, and are identical under BCP-47. They must be treated the same.

Aha! Now I understand what you're getting at - https://tools.ietf.org/html/bcp47#section-2.1.1

Latest revision of the patch - https://gerrit.wikimedia.org/r/#/c/385352/9 ... matches lang tags case-insensitively, and removes redundant lang tags

I've removed the defaulting of the language to the user's interface language on @Gilles advice - he reckons the community is used to needing to specify the language in the mediawiki tag, and unlikely to want existing behaviour to change

Abandon this patch. It is still confused and far too complicated for issues.

Poor addressing of the langtag multiset issue. Lowercasing was done in the wrong place (too late); array_unique not needed if done right.

The patch apparently deletes a reasonable UI sort of the langtags. (If the sort were ignore case, then all American English (en-US) options would have been together. Case is irrelevant for a canonical list.)

The patch removes a lowercase langtag paramlist enforcement. That runs a risk of damaging code elsewhere. Don't fix something that ain't broke.

It strangely wants to implement SVG user agent language matching at GetMatchedLanguage().
MW is not an SVG user agent.
GetMatchedLanguage has a dubious inner loop; the SVG superset match is trivial and described in spec.
If previous discussions are being read, then they are not being understood.
The apparent intent of GetMatchedLanguage() will destroy SVG display semantics.
Test case (en, [en-GB, en-US]) returning en-GB confuses user request with user agent.
Furthermore, there is no reason for MW to turn "en" into "en-GB".
GetMatchedLanguage could reasonably assert() if fed most test cases at SVGTest.php.

There is no reason to create createXmlOptionStringForLanguage(). The routine is used once, and where it is called still has similar code for creating another XmlOptionString. Silly changes like that hide diffs.

The patch still switches to the wrong langtag validation. And apparently knows that change caused problems because the unit test of embedded periods was changed to embedded colons. Embedded periods are not allowed in a langtag. parserTests.txt:15049. Why would anybody think that

[[File:Foobar.svg|lang=invalid.code]]

should pass as using a valid BCP47 langtag. When a unit test fails, that means something went wrong. Tripping on a valid unit test should not result in changing the test so you can ignore it. It means something is wrong with your code. Especially if the code has been passing the test before. Failed regression..

Kill this patch.

If @Gilles' patch fixes the hyphen, the only serious issues are a one-line fix for the multiset in SVG metadata extractor code and a few lines to fix mixed-case langtags being generated and shoved into select.option[].value attributes on ImagePage. Yes, I see the qqq option flourish, but there are deeper issues there. Everything else should not be touched.

@Gilles @Shizhao

Hyphenated langtags such as zh-hans are working again at least on en.WP and meta (well, modulo librsvg's limitations). I imagine that implies everywhere because they use the same URL host.

URLs with "langde-DE-400px" also work, but I think that is unnecessary. Are the mixed case URLs generating duplicate files on the servers?

The new URL flexibility does not help ImagePage. It's mixed case langtags display the default image. Lowercasing the hyphenated langtag in the URL works.

@Gilles @Shizhao

Hyphenated langtags such as zh-hans are working again at least on en.WP and meta (well, modulo librsvg's limitations). I imagine that implies everywhere because they use the same URL host.

URLs with "langde-DE-400px" also work, but I think that is unnecessary. Are the mixed case URLs generating duplicate files on the servers?

The new URL flexibility does not help ImagePage. It's mixed case langtags display the default image. Lowercasing the hyphenated langtag in the URL works.

see https://zh.wikipedia.org/wiki/User:Shizhao/T154237
zh-hans work on zhwiki; but zh-hant can't work, zh-hant display as zh-hans

@Shizhao

That's why I said "modulo librsvg's limitations". Librsvg IETF langtag matching only uses the first subtag, so "zh-hans" and "zh-hant" are both treated as if they were "zh". If the "zh-hans" clauses always precede the "zh-hant" clauses, then only "zh-hans" will display.

I submitted a blind patch to gnome to fix that, but I don't have the dev tools to test it.

Right, under the hood it's based on rsvg-convert (the command-line utility for librsvg). It uses the language defined by the LANG environment variable (the only method I know to make rsvg-convert handle multilingual SVGs):

https://phabricator.wikimedia.org/diffusion/THMBREXT/browse/master/wikimedia_thumbor/engine/svg/svg.py;4423244bd67195e91da2c0e0a1606923eab3b373$49-53

env = None
if hasattr(self.context.request, 'lang'):
   env = {'LANG': self.context.request.lang.upper()}

png = self.command(command, env)

URLs with "langde-DE-400px" also work, but I think that is unnecessary. Are the mixed case URLs generating duplicate files on the servers?

Yes, but invalid URIs are even more taxing, as they traverse all our caching layers to the backend. Which means that rejecting mixed case URIs at the Thumbor level could be more costly infrastructure-wise than allowing the cache/thumbnail storage split, because it has a recurrent processing cost (errors are only cached for a few minutes).

That kind of cache split is really not a big deal as long as the URIs generated by Mediawiki are consistent. It's not the only part of thumbnails URI you can modify arbitrarily and generate cache splits with. What matters is that the main source of those URI hits (Mediawiki) doesn't generate a wide variety of URIs resulting in the same content.

Since other parts of the thumbnail URI are case-sensitive for "valid reasons" (like the file name), we can't just lowercase the whole thing at the edge. And we're trying to avoid too much logic at the edge (which causes overhead for every request), so lowercasing just the language part isn't an option. We could avoid the split at the backend, but really the volume we're talking about here would be negligible if Mediawiki is consistent.

Just make sure that Mediawiki generates only lowercase lang parameters consistently if that's what's desirable and that's enough, imho.

Just make sure that Mediawiki generates only lowercase lang parameters consistently if that's what's desirable and that's enough, imho

Cool, that's already done

Change 385352 merged by jenkins-bot:
[mediawiki/core@master] Treat langtags in SVG switch case-insensitively

https://gerrit.wikimedia.org/r/385352

Will somebody who knows how to use gerrit please pull this abomination back.

Most of the changes are stupid and done in the wrong place.

For example, the edit neuters some unit tests.

Change 250 lines of code when it can be done in 10.

The coders do not understand the SVG specification.

Will somebody who knows how to use gerrit please pull this abomination back.

Most of the changes are stupid and done in the wrong place.

For example, the edit neuters some unit tests.

Change 250 lines of code when it can be done in 10.

The coders do not understand the SVG specification.

[process note, I came here due to a message on my commons talk page]

Please avoid using heated and personal language such as "stupid" and "abomination".

I haven't been following this bug super closely, but I'm happy to take a second look if need be. Can you describe specificly and concisely what your technical concerns with the patch are?

The present bug report is about a few SVG problems. MW supports SVG files with switch element translations. That facility was added a few years ago. It had a few bugs (some not its fault), but it mostly worked. The bugs should be simple to fix. It should not be 250 lines of code.

A number of editors have invested time in getting SVG switch illustrations to work.

The implementation has been degrading over time. Stuff that worked now doesn't.

As of three weeks ago, there were five issues. One is out of reach, and one has been solved and is on the servers.

  1. MW uses librsvg to convert SVG to bitmap, but that library has problems that are beyond WM reach. For langtags, librsvg essentially only uses the part before the first hyphen. Consequently, librsvg treats "zh-Hans", "zh-Hant", and "zh-Hans-CN" as if they were just "zh". There's a trivial fix for that problem, but that is in Gnome's court and out of WMF's reach. WMF cannot fix it.
  1. MW does not canonize the langtags it reads from SVG files. Consequently, MW believes "en-US" and "en-us" are different languages and offers to display both in the language dropdown box. For an exaggerated version of the problem, go to Commons [[File:System language attribute bug demo.svg]]

https://commons.wikimedia.org/wiki/File:System_language_attribute_bug_demo.svg

and look at the "Render this image in" dropdown; there will be 16 entries for American English (all possible capitalizations of "en-us"). The options will all look the same because the different capitalizations have been canonized to the IETF capitalization.

I'll refer to this issue as the multiset problem.

That's a one line fix to SVGMetadataExtractor.php at line 268. Just lowercase and trim $langItem.

$langItem = trim( $langItem );

should be

$langItem = strtolower( trim( $langItem ) );

This one-line fix to the multiset problem has been described a couple times above but ignored. There is no reason for the SVGMetadataExtractor code to distinguish different langtag capitalizations. The canonization should happen in the extractor.

Cparle's patch does not fix the multiset problem at its origin in the extractor. Instead, it does the lowercasing at SVG.php line 100 and then reduces the multiset with array_unique at SVG.php:105. Fix it where it is broken and SVG.php can be left alone.

Cparle's patch does other work in SVG.php, but that work seems to be inappropriate or unneeded.

Large changes to the code do not have a clear purpose or are just confused. The getMatchedLanguage() code is both not needed and wrong. The SVG langtag superset comparison is a simple test rather than the nested loop at 129-133. The SVG spec gives an English version of the code. Some of the unit tests for the function show that the function does not implement SVG semantics. The logic of gML("en", ["en-gb", "en-us"]) returning "en-gb" is confused. There is no guarantee that switch clauses always use systemLanguage in the same order. It is easy to make SVG("en") look different from SVG("en-gb") and different from SVG("en-us") even when systemLanuage="en" does not appear in the SVG file: just swap the order of the langtags in two switch statements. Fundamentally, if the user specifies "en", then the user agent is not allowed to change that preference to "en-gb" or "en-us".

The change at line 482 is not the appropriate test. The original code looked for well-formed IETF langtags. That's the right thing. The change does not help the code and apparently permits langtags with periods in them. Consequently, the patch dikes out the reject-langtags-with-periods unit test at parserTests.txt line 15049. The unit test was valid before and should not have been changed. The parser should reject langtags with periods. IETF langtags are letters, digits, and hyphens. No periods, no colons, no underscores; no hash marks.

At line 502, the patch removes a lowercase side-effect on $params['lang']. Other code could reasonably use that side-effect; MW wants its langtags to be lower case. The side effect may not be used later, but it seems to be a reasonable practice. Why dike it out?

  1. The dropdown option values (which are langtags) in ImagePage should be lowercase. IIRC, when it was first implemented, the MW switch processing sensibly used all lowercase langtags, and the desired result happened: the form URLs would have lowercase langtags. A user would select the option for "zh-hans" on the file page, click go, and the browser would go to new URL for that option:

https://commons.wikimedia.org/w/index.php?lang=zh-hans&title=File%3AFirst_Ionization_Energy.svg

And the user would see a diagram in Chinese (it might be the wrong Chinese, zh-hant, due to the librsvg bug that only looks up to the first hyphen).

Then somebody got cute with the language tags. Instead of displaying lowercase langtags, they wanted to display the preferred IETF capitalization. Instead of "zh-hans", IETF prefers "zh-Hans". Somewhere, that fix derailed ImagePage.php from just displaying the preferred capitalization to actually putting "&lang=zh-Hans" into the URL instead of "&lang=zh-hans".

The end result would be a URL such as

https://commons.wikimedia.org/w/index.php?lang=zh-Hans&title=File%3AFirst_Ionization_Energy.svg

Instead of displaying Chinese, that URL would just display English. MW parsing wanted to see a lowercase lang; when the lang didn't match due to an uppercase character. (If the URL were hand modified to use lowercase, then Chinese would display. Well, that used to be the case until bug #4 came along. Today, the file page with "lang=zh-Hans" displays the no-lang version (img src=".../800px-First_Ionization_Energy.svg.png") and "lang=zh-hans" displays a 404 alt (img src="/langzh-hans-800px-First_Ionization_Energy.svg.png" fails to load so alt= displayed).)

The fix is a few tweaks to ImagePage.php. The right thing is to clean up the handling of the prettified langtag (IETF capitalization should only be used in select option label=), but a quick and dirty solution is just lowercase the strings fed to the option value= attributes. It's fixing a few lines of code.

Consequently, a few lines of code should change in ImagePage.php. Instead, there's extensive movement of code and unclear logical changes.

I agree that displaying a language option that isn't actually mentioned in the SVG file is confusing. Yes, it could be removed, but it is not the fundamental issue. The problem is the user selects zh-Hans and gets English instead of Chinese. That problem is fixed by downcasing the option value.

Guiding principle: MW should only generate URLs with lowercase langtags. It need not process mixed case langtags.

  1. By early September 2017, MW image display stopped accepting URLs with hyphenated langtags such as "zh-hans" and "sr-ec". If you clicked on this "zh-hans" link, you will not get an image.

https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/First_Ionization_Energy.svg/langzh-hans-440px-First_Ionization_Energy.svg.png

Instead, it would give 404 not found.

Gilles did a one line fix for this display hyphenated-language problem. The production code is currently displaying an image rather 404'ing.

That's what should be happening to most of these issues. The code worked before, but now it is busted. Some simple edits should get it working again.

#5 There's a subtle issue about an SVG file's default language (language of the unconditional clause in a switch element). MW assumes the default is "en". Essentially, MW does not know the SVG's default language: it might be "en", but it might be "de" or "fr" or something else. A user should be able to view the SVG's. The simple way to do that (since libsrv will assume "en") is always include a default language select option whose value is an unlikely langtag such as "qqq" (private langtag) or "tlh" (Klingon).

To illustrate the point, this URL (no lang specified) should be the language default, but it is "en":

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/409px-Globe_valve_diagram.svg.png

The Klingon version will show the default (which is numbers):

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Globe_valve_diagram.svg/langtlh-409px-Globe_valve_diagram.svg.png

The logic is a fix to one file, ImagePage.php, but "en" is so embedded in MW logic right now that fixing it right now is not important.

All the patch must do is always add value="qqq" (or some other appropriate private-use (qaa-qtz) langtag but not mul), option="<Default message>". That can be done in a few lines.

But I'd leave the default out for now because ImagePage.php patch changes 113 lines, and that is too much. Some verification has been diked out, some code has moved, and lots of logic has changed.

The patch is too extensive for what it needs to do. It need only fix one line in SVGMetadataExtractor.php and a few lines in ImagePage.php.

It should not be adding getMatchedLanguage() to File.php. That's not what this bug report is about. When a handler needs to do language matching, then it can look at $params['lang']. The File interface does not need to be augmented.

Throw everything away. Kill the whole patch and start over.

Then do a patch that fixes the problems simply and directly in ten lines or less.

I would do it as two patches. Patch one is the multiset with a one line edit to the metadata extractor. Patch two is just to ImagePage to make sure the option values are lowercase and whatever other simple window dressing is added.

[snip background]

This one-line fix to the multiset problem has been described a couple times above but ignored. There is no reason for the >SVGMetadataExtractor code to distinguish different langtag capitalizations. The canonization should happen in the extractor.

Cparle's patch does not fix the multiset problem at its origin in the extractor. Instead, it does the lowercasing at SVG.php line 100 and then >reduces the multiset with array_unique at SVG.php:105. Fix it where it is broken and SVG.php can be left alone.

Which stage to canonicalize the language tag at is a minor issue, which quite frankly I don't really think matters, and certainly not something worth reverting the patch over.

Cparle's patch does other work in SVG.php, but that work seems to be inappropriate or unneeded.

Large changes to the code do not have a clear purpose or are just confused. The getMatchedLanguage() code is both not needed and wrong. The SVG langtag superset comparison is a simple test rather than the nested loop at 129-133. The SVG spec gives an English version of the code. Some of the unit tests for the function show that the function does not implement SVG semantics. The logic of gML("en", ["en-gb", "en-us"]) returning "en-gb" is confused. There is no guarantee that switch clauses always use systemLanguage in the same order. It is easy to make SVG("en") look different from SVG("en-gb") and different from SVG("en-us") even when systemLanuage="en" does not appear in the SVG file: just swap the order of the langtags in two switch statements. Fundamentally, if the user specifies "en", then the user agent is not allowed to change that preference to "en-gb" or "en-us".

The code is used in the commit, and seems needed in order that thumbnail names have only a single canonical name. Thus in my view the function is neccessary.

The algorithm for getMatchedLanguage() is correct. Per the svg spec, if the user specifies "en", then the implementation is allowed to use either "en-gb" or "en-us". The spec is ambigious as to what to do when multiple tags match. So there is no error in respect to that.

The loop is a valid way to implement the spec in regards to tags with multiple dashes in them (e.g. user preference en-US and switch systemLanguage of en-US-x-piglatin). Its not the only way to do it, but this is not a hot code path, so I think its a totally fine way to implement the algorithm.

if the user specifies "en", then the user agent is not allowed to change that preference to "en-gb" or "en-us".

That's literally the opposite of what the spec says. The spec is very clear that en user preference matches en-GB systemLanguage attribute.

The change at line 482 is not the appropriate test. The original code looked for well-formed IETF langtags. That's the right thing. The change does not help the code and apparently permits langtags with periods in them. Consequently, the patch dikes out the reject-langtags-with-periods unit test at parserTests.txt line 15049. The unit test was valid before and should not have been changed. The parser should reject langtags with periods. IETF langtags are letters, digits, and hyphens. No periods, no colons, no underscores; no hash marks.

I'm not sure about the rationale for the change in SVG::validateParam from Language::isValidBuiltInCode() to Language::isValidCode(). You're right that bcp47 bans periods and colons (Underscores are an interesting case historically due to locale names. Many implementations convert between the two forms like libicu).

At line 502, the patch removes a lowercase side-effect on $params['lang']. Other code could reasonably use that side-effect; MW wants its >langtags to be lower case. The side effect may not be used later, but it seems to be a reasonable practice. Why dike it out?

I have no idea what this is referring to. Line 502 of what file?

The dropdown option values (which are langtags) in ImagePage should be lowercase. IIRC, when it was first implemented, the MW switch processing sensibly used all lowercase langtags, and the desired result happened: the form URLs would have lowercase langtags. A user would select the option for "zh-hans" on the file page, click go, and the browser would go to new URL for that option:

I don't see any reason to prefer all lowercase over bcp47 canonicalization. If I was doing the change, I would probably lowercase it in urls, but it really doesn't matter, and I think its 100% fine to use bcp47 canonicalization in urls.

Then somebody got cute with the language tags. Instead of displaying lowercase langtags, they wanted to display the preferred IETF capitalization. Instead of "zh-hans", IETF prefers "zh-Hans". Somewhere, that fix derailed ImagePage.php from just displaying the preferred capitalization to actually putting "&lang=zh-Hans" into the URL instead of "&lang=zh-hans".

The end result would be a URL such as

https://commons.wikimedia.org/w/index.php?lang=zh-Hans&title=File%3AFirst_Ionization_Energy.svg

Instead of displaying Chinese, that URL would just display English. MW parsing wanted to see a lowercase lang; when the lang didn't match due to an uppercase character. (If the URL were hand modified to use lowercase, then Chinese would display. Well, that used to be the case until bug #4 came along. Today, the file page with "lang=zh-Hans" displays the no-lang version (img src=".../800px-First_Ionization_Energy.svg.png") and "lang=zh-hans" displays a 404 alt (img src="/langzh-hans-800px-First_Ionization_Energy.svg.png" fails to load so alt= displayed).)

The fix is a few tweaks to ImagePage.php. The right thing is to clean up the handling of the prettified langtag (IETF capitalization should only be used in select option label=), but a quick and dirty solution is just lowercase the strings fed to the option value= attributes. It's fixing a few lines of code.

Consequently, a few lines of code should change in ImagePage.php. Instead, there's extensive movement of code and unclear logical changes.

I agree that displaying a language option that isn't actually mentioned in the SVG file is confusing. Yes, it could be removed, but it is not the fundamental issue. The problem is the user selects zh-Hans and gets English instead of Chinese. That problem is fixed by downcasing the option value.

This seems to be about an unrelated bug to the task at hand.

#5 There's a subtle issue about an SVG file's default language (language of the unconditional clause in a switch element). MW assumes the default is "en". Essentially, MW does not know the SVG's default language: it might be "en", but it might be "de" or "fr" or something else. A user should be able to view the SVG's. The simple way to do that (since libsrv will assume "en") is always include a default language select option whose value is an unlikely langtag such as "qqq" (private langtag) or "tlh" (Klingon).

We are not going to use qqq or tlh for that purpose. If this is really an issue (I kind of doubt it), we should just use no language tag to specify the "not english" condition and require an explicit "en" language tag when we want english. Nonetheless, not an issue with this specific task, and at worst probably an upstream issue in librsvg

It should not be adding getMatchedLanguage() to File.php. That's not what this bug report is about. When a handler needs to do language matching, then it can look at $params['lang']. The File interface does not need to be augmented.

I probably would not have added that to File.php (And instead I would have just done $file->getHandler()->getMatchedLanguage()) but the approach of adding getMatchedLanguage() to file is a valid one, and I have no objections to it.


Anyways, My conclusion

Of all the issues you bring up, swaping Language::isValidBuiltInCode() to Language::isValidCode() is in my opinion the only valid (albeit minor) issue. Perhaps @Cparle could clarify his reasoning on that particular part of the change.

Quite frankly, the language you (@Glrx) use to describe this patch is way over the top, to the point of marking it harder to take you seriously. Calling for the patch to be reverted over minor differences of opinion as to the proper approach is a massive overeaction.

The docblock for Language::isValidBuiltInCode() says "Returns true if a language code is of a valid form for the purposes of internal customisation of MediaWiki", which suggests to me that it's not an appropriate test for a language code from a non-MW source.

AFAICS we validate the language code in order to prevent a security hole (rather than to provide an SVG validator), and Language::isValidCode() is adequate for this

@Glrx: While comments on technical aspects in neutral language are very welcome, please read and follow the Code of Conduct when it comes to the tone of language if you would like to continue interacting here. Also see T154132#3707368. Thanks.

The docblock for Language::isValidBuiltInCode() says "Returns true if a language code is of a valid form for the purposes of internal customisation of MediaWiki", which suggests to me that it's not an appropriate test for a language code from a non-MW source.

AFAICS we validate the language code in order to prevent a security hole (rather than to provide an SVG validator), and Language::isValidCode() is adequate for this

Its really more about distinguishing between, things that are valid page titles, and then can be customized by creating pages MediaWiki:Foo/somestuff and using ?uselang=somestuff (The so called uselang hack). For example https://commons.wikimedia.org/wiki/special:upload?uselang=hu_(old) https://commons.wikimedia.org/wiki/special:upload?uselang=ownwork vs things that can be valid subclasses of the Language class.

The former is basically any valid MediaWiki title that's not a path traversal issue, where the latter is more restrictive, and more closely matches a valid bcp47 language tag (except it expects fully lower cased). But both a very tied to MediaWiki's internal notions of language as opposed to external notions. I would say that Language::isValidBuiltInCode() more closely matches the definition of what the SVG spec considers a valid language tag (if you lowercase it first) where isValidCode() is basically just anything that the uselang url parameter will consider valid, which is basically anything.

On the other hand, you're right that we are mostly just passing things on to librsvg to deal with as it will, and are not generally attempting to validate the svg.

For futher context see 1e67922842.

Ok ... so I can see nothing there that suggests I should revert this. Passing it over for QA if that's ok with you @Bawolff

I am having trouble finding a file on beta commons that has similar settings/attributes. Is there an example problem or can we get a copy of this file on beta (https://commons.wikimedia.org/wiki/File:First_Ionization_Energy.svg)?

@ABorbaWMF anyone can upload files to Beta. You can just download the original of that file from Commons and upload it to Beta: https://commons.wikimedia.beta.wmflabs.org/wiki/Special:UploadWizard remember to copy/paste the licensing information found on Commons (license and list of authors).

Thanks all. @Ramsey-WMF also added a version of the article yesterday. We found issue with the non-latin characters (both Chinese variants, Cyrillic Serbian, etc). BUT, that seems to be fixed now. I am seeing all the variants working:

Screen Shot 2017-11-28 at 10.47.26 AM.png (1×1 px, 297 KB)

Screen Shot 2017-11-28 at 10.48.37 AM.png (1×1 px, 299 KB)