(Go: >> BACK << -|- >> HOME <<)

Page MenuHomePhabricator

SVG image wikisyntax can't use "lang=zh-hant"
Open, Stalled, Needs TriagePublicBUG REPORT

Description

SVG image wikisyntax can't use "lang=zh-hant". for example:
https://commons.wikimedia.org/wiki/File:First_Ionization_Energy.svg

In the SVG, systemLanguage include en, fr, de, zh-hans, zh-hant etc.

[[File:First Ionization Energy.svg|thumb|lang=en]] displayed in English;
[[File:First Ionization Energy.svg|thumb|lang=fr]] displayed in French;
[[File:First Ionization Energy.svg|thumb|lang=zh-hans]] displayed in Simplified Chinese;
but [[File:First Ionization Energy.svg|thumb|lang=zh-hant]] error is displayed as Simplified Chinese, the correct display should be Traditional Chinese.

see test case:

PS: Like T154132, "lang=zh-Hant" can't work:

[[File:First Ionization Energy.svg|thumb|lang=zh-Hant]] error is displayed as English

Event Timeline

This is the same problem as in T125710. The reason is that the language code from the SVG is in BCP 47 format and this value is set as environment variable LANG. The environment variable LANG must be in the POSIX format which has a different syntax. rsvg tries to convert this value back to a BCP 47 format and compare this with the SVG languages. As far as I know it uses the first matching value and therefor it uses zh-hans. When the order in the SVG is different an other language may selected.

The only way to solve this problem is to separate the language variables. The LANG variable must be in POSIX format and should be only used for system messages in rsvg. rsvg should get a new variable or parameter for selecting the language for the SVG.

There is a problem, but it is not with wikisyntax but rather librsvg. See T154132 which shows that different PNG filenames are generated for zh-hans and zh-hant, so the implication is librsvg mishandled the lang argument. @Fomafix is correct that this is also discussed at T125710, but I don't know enough about librsvg to assess the details.

IIRC, SVG.php notices there is a lang param and sets the LANG environment variable before calling librsvg.

That should be a typecheck violation because SVG.php is putting an IETF langtag string (which uses hyphens) into something that should be a Unix locale string (which uses underscores and may have other dirt at the end).

But there's also a typecheck violation inside librsvg because it, from what I've seen, plays fast and loose with langtag strings and locale strings.

The two may balance out.

Sometime back, @PhiLiP submitted a patch for librsvg:
https://bugzilla.gnome.org/attachment.cgi?id=320316&action=diff

That patch is ill advised, but it also shows that the unpatched librsvg is improperly matching langtags. Look at the left side diff around rsvg-cond.c line 105 to 111. The code does an improper substr match. Let a = "zh-Hans" and b = "zh-Hant". The strings should not match, but the code sees a hyphen at position 2 in b, so it checks the case-insensitive match up to the hyphen at line 111, sees "zh" equals "zh", and declares erroneously the two match. The matching algorithm does not follow the rules laid out in the SVG 1.1 specification. I gave an example of what the code should be, but I can barely use phabricator and don't know Bugzilla or even have an account there.

librsvg doing an improper hyphen match explains why @Shizhao only saw zh-Hans when he asked for zh-Hant.

librsvg may do what we want it to do with the LANG environment variable. I suspect it converts the underscores to hyphens and may even strip off the Unicode charset locale info, but wiki feeds it with a string that already has hyphens, so librsvg does not have to do any character manipulation. The matching bug could also be explained with a="zh_Hans" and b="zh-Hant", but that would mean librsvg took the input langtag and changed its hyphen to an underscore. That just seems implausible, but I have not examined the librsvg code beyond looking at PhLiP's patch diff.

Right now, I think this issue is just an upstream librsvg bug. It's dirty, but I doubt that MediaWiki needs to convert the langtag to a UNIX/POSIX locale string for librsvg even though that would make the LANG environment variable the proper type according to the operating system specification. Moreover, feeding librsvg a locale string would not fix librsvg's langtag matching bug.

Ideally, librsvg should take a command line argument that is an acceptLanguages preference string. The SVG systemLanguage matching algorithm is more sophisticated that a single user agent language. With SVG 2.0 allowReorder-style processing, we could have language dependent preferences. If the image does not have uk, then it might be better to fallback to ru rather than en.

The good news: Gnome #131 and Gnome #256 have been closed.

https://gitlab.gnome.org/GNOME/librsvg/blob/master/rsvg_internals/src/cond.rs

The possibly bad news: the fix uses a library that may want locale strings and may throw exceptions for invalid locale strings or invalid langtags (e.g., "ru-1"). That will have to be checked.

Previously, librsvg figured out the agent's language by editing the LANG environment variable to make it look like a langtag. Now it will use a locale library to do the langtag matching against a locale.

In the past (before Thumbor), MW setenv LANG to $lang and then exec'd librsvg. See https://doc.wikimedia.org/mediawiki-core/master/php/SvgHandler_8php_source.html at 319. I suspect Thumbor does the same thing.

We will need to check if that will still work. Will from_unix("en-US") work? Otherwise, MW will need to convert the $lang langtag (e.g., "en-US") into a locale string (e.g., "en_US").

Or wait until librsvg will take the preferred languages as a list of langtags:

https://gitlab.gnome.org/GNOME/librsvg/issues/356 "Provide a way to specify the user's preferred languages"

SVG uses language codes from BCP 47 and MediaWiki uses language code from BCP 47 (at least there is a mapping from the internal MediaWiki language code to a BCP 47 conform language code). The Unix system environment variable LANG contains a language code following a different standard. It is not possible to convert a BCP 47 language code to a Unix language code and back to a BCP 47 language code without loosing information. Therefor the language code must not transferred from MediaWiki to librsvg via the environment variable LANG. librsvg need a separate parameter with a BCP 47 language code.

Currently, Thumbor takes whatever was passed to it by MW in the URL, upper-cases it, and sets LANG to that value.

Here's a test case with zh-Hant, zh-Hans, zh-cn, zh-hk, zh-tw-, zh-sg and a fallback option.

rsvg-convert 2.40.21 will use zh-Hans for all values of zh-foo (zh-Hans, zh-Hant, zh-cn, zh-hk, etc.) Fontconfig will complain about it, but zh-Hans text will be rendered.

$ LANG=ZH-foo rsvg-convert -f png -u -o zhtest.svg.20.png zhtest.svg

zhtest.svg.20.png (500×500 px, 2 KB)

Using rsvg-convert 2.44.10 in a fresh Debian Buster container (with fonts-noto installed):

$ LANG=ZH-HANT rsvg-convert -f png -u -o zhtest.svg.png zhtest.svg
Fontconfig warning: ignoring ZH-HANT: not a valid region tag
$ LANG=ZH-HANS rsvg-convert -f png -u -o zhtest.svg.png zhtest.svg
Fontconfig warning: ignoring ZH-HANS: not a valid region tag

zh-hant and zh-hans both produce a warning because Fontconfig uses and is expecting a lang-territory tag, but it got a language tag (See "Lang Tags" in https://www.freedesktop.org/software/fontconfig/fontconfig-user.html). So, let's try the four possible zh locales:

$ LANG=ZH-TW rsvg-convert -f png -u -o zhtest.svg.png zhtest.svg
$ LANG=ZH-CN rsvg-convert -f png -u -o zhtest.svg.png zhtest.svg
$ LANG=ZH-HK rsvg-convert -f png -u -o zhtest.svg.png zhtest.svg
$ LANG=ZH-SG rsvg-convert -f png -u -o zhtest.svg.png zhtest.svg

No errors. That's good, right? Not so fast. Even though fontconfig could parse the locales, rsvg couldn't find any text for them. All six commands resulted in the same file with no translated text. Most importantly, that means that librsvg did not fall back to the switch option without a systemLanguage.

zhtest.svg.png (500×500 px, 1 KB)

$ LANG=ZH rsvg-convert -f png -u -o First_Ionization_Energy.svg.png First_Ionization_Energy.svg 

With just the bare zh tag, we do actually get zh-Hans text.

zhtest.svg.png (500×500 px, 2 KB)

To get the fallback text, LANG has to be blank.

Finally, with rsvg-convert 2.50.1, we get the fallback text instead of nothing.

$ LANG=zh-hant rsvg-convert -f png -u -o zhtest.svg.1.png zhtest.svg          
Fontconfig warning: ignoring zh-hant: not a valid region tag
$ LANG=zh-hans rsvg-convert -f png -u -o zhtest.svg.1.png zhtest.svg          
Fontconfig warning: ignoring zh-hans: not a valid region tag
$ LANG=zh-cn rsvg-convert -f png -u -o zhtest.svg.1.png zhtest.svg

zhtest.svg.1.png (500×500 px, 2 KB)

LANG=zh still results in zh-Hans.

So while some upstream bugs have been fixed, the core problem that <switch> SVG translation doesn't work for zh-Hant has not been. https://gitlab.gnome.org/GNOME/librsvg/-/issues/356 seems to be the most likely to produce a usable solution, but this depends on if it's more of a Fontconfig bug or a librsvg bug.

SVG language name handling is a mess on many levels.

The short observation is installing the latest version of librsvg may break SVG multilingual file display on MediaWiki.

SVG Spec

The SVG specification is clear about how systemLanguage should be processed. SVG 1.1 wants the attributes to hold a comma-separated list of BCP 47 langtags. However, the specification is permissive in that it does not require conforming langtags; in SVG 1.1, the strings are "language names" rather than langtags. SVG specifies how the provided language name tokens are matched. The matching does not involve langtag canonization or langtag field awareness. Consequently, en-US does not SVG match en-Latn-US or en-*-US. Neither does en-Latn-US match en-US.

See https://www.w3.org/TR/SVG11/struct.html#SystemLanguageAttribute

The SVG 2 recommendation makes a stronger statement about the tokens being langtags, but it still uses the same string-matching algorithm. See https://svgwg.org/svg2-draft/struct.html#SystemLanguageAttribute .

In particular, SVG requires simple string matching of langtags. It does not seek IETF canonization or extended filtering.

Also, SVG specifies that the language preferences for the user agent (e.g., the browser) are matched against the language names in systemLanguage. SVG does not use the lang or xml:lang atttributes of the including document. Consequently, an HTML page written in German (lang="de") and including a multilingual SVG document will display the SVG in the user agent's preferred language rather than the HTML documents preferred language. Thus a user viewing the page might see German text with English illustrations rather than German throughout.

Agents

Browsers had been confused about how SVG systemLanguage matching worked. A few years ago, Chrome, Edge, and Firefox did not do it correctly. A couple years ago, Chrome and Firefox conformed. Edge recently switched to Chromium, so it may follow the rules. In any event, the major browsers offer better SVG support than the version of librsvg that MediaWiki currently uses.

librsvg is more tortured. It's initial string matching algorithm was flawed so it only matched to the first hyphen. Consequently, zh-Hans would match zh-Hant because zh is equal to zh. These were Gnome bug reports 131 and 256.

https://gitlab.gnome.org/GNOME/librsvg/-/issues/131
https://gitlab.gnome.org/GNOME/librsvg/-/issues/256

It's been a couple years since I last looked at this. IIRC, here is what is going on.

Gnome closed those bugs, but instead of using SVG's string matching algorithm, Gnome decided to use a locale library to do the matches, and that brings about confused types. A langtag has a specific text format. A Unix locale string does not have an absolute specification. Typically, the first part of a Unix locale string looks like a langtag but has underscores instead of hyphens: "en_US" instead of "en-US". A Unix locale string may have many other components. It might be, for example, "en_GB.UTF-8@euro".

The SVG agent librsvg can only change its preferred language via the LANG environment variable. In the ordinary Unix world, LANG is a locale string rather than a langtag. Prior to the bug fixes, librsvg would just use the string in the environment variable as its preferred language. After the fixes, the LANG environment variable is parsed as a locale string to get a locale library object. That locale library object is then queried for its IETF langtag, and that langtag is used for the systemLanguage comparisons. The locale library is called to do those matches. It is possible that the library canonizes the systemLanguage langtags before doing its comparison. The comparison routines may also detect invalid langtags.

Consequently, if librsvg users want to specify an IETF langtag, then they must supply librsvg with a Unix locale string in the LANG environment variable that the locale library will resolve to the desired IETF langtag. That may be an impossible task. Some simple locale strings may easily work: "en_US" probably resolves to "en-US", and "en_GB" probably resolves to "en-GB". It may not be the same for "de_DE"; that might resolve to just "de". The zh (Chinese) macrolanguage may be impossible. Consider "zh-Hans" (simplified Chinese). Mainland China uses simplified Chinese, so "zh_CN" might produce a langtag of "zh-Hans", but it more likely produces the langtag "zh-Hans-CN". Alternatively, it may produce "cmn-Hans-CN".

Even if an appropriate langtag can be generated, librsvg may not follow the string matching semantics that SVG 1.1 and 2.0 require.

From the previous post, the font matcher is also expecting well-formed LANG environment variables.

Bottom line: librsvg users can no longer put IETF langtags into the LANG environment variable.

MediaWiki

MediaWiki also has its own confused aspects.

MW would be subject to SVG's confusion between user agent preferences and including document preferences if it serves SVG files. WM currently uses the document preference. If possible, multilingual SVG files are converted to PNG's using the Wikipedia's default langtag (or the lang argument in the transclusion). Consequently, the semantics of multilingual files will change if WMF starts serving SVG files directly. Directly served SVG files will default to the user agent's (browser's) preference rather than the Wikipedia's default langtag or the lang transclusion argument. If MW wants to keep the same semantics, it will need to localize SVG files before serving them.

MW is confused about the lang inclusion argument and the agent role. It reasonably checks for langtags within an SVG file to avoid duplicate PNGs, but it goes too far in other aspects.

MW inserts an IETF langtag into the LANG environment variable. That is not what librsvg will expect in the future. From the tests performed above, it seems like multilingual files may not display Chinese text (e.g., text that has systemLanguage="zh-Hans"). That would be a step backward from this bug report (where the first instance of "zh-" would be displayed).

MW needs issue 356 fixed, and then MW needs to change its code to use that fix.

Alternatively, MW could do its PNG conversion in a two step process. The first step localizes the SVG to the desired langtag. The second step passes the localized SVG file to librsvg.

librsvg 2.52.x will have a new --accept-language parameter, which will allow to specify the user's preferred languages by passing the HTTP Accept-Language header to librsvg: https://gitlab.gnome.org/GNOME/librsvg/-/issues/356 (Not sure if it will get backported to the 2.50.x series)

JoKalliauer changed the task status from Open to Stalled.Jun 30 2021, 2:28 PM
JoKalliauer moved this task from Backlog to librsvg (upstream) on the Wikimedia-SVG-rendering board.
Shizhao changed the subtype of this task from "Task" to "Bug Report".
Shizhao added a subscriber: Dringsim.

T310230 is the same as this bug, so I will close it.

The C version of librsvg has a backward langtag matching algorithm.

https://github.com/GNOME/librsvg/blob/librsvg-2-40/rsvg-cond.c

implements the language test with

if (locale) {
    for (i = 0; (i < nb_elems) && !permitted; i++) {
        if (rsvg_locale_compare (locale, elems[i]))
            permitted = TRUE;
    }

    g_free (locale);
}

where locale is gleaned from the $LANG environment variable and elems[i] are the systemLanguage attributes (obtained by calling rsvg_css_parse_list()). Notice that locale is passed as the first argument to rsvg_locale_compare. Here's the code:

/* http://www.w3.org/TR/SVG/struct.html#SystemLanguageAttribute */
static gboolean
rsvg_locale_compare (const char *a, const char *b)
{
    const char *hyphen;

    /* check for an exact-ish match first */
    if (!g_ascii_strncasecmp (a, b, strlen (b)))
        return TRUE;

    /* check to see if there's a hyphen */
    hyphen = strstr (b, "-");
    if (!hyphen)
        return FALSE;

    /* compare up to the hyphen */
    return !g_ascii_strncasecmp (a, b, (hyphen - b));
}

Notice that if the locale is tlh and one of the systemLanguage langtags is tl, then the first conditional will succeed incorrectly because the test will be

  • g_ascii_strncasecmp("tlh", "tl", 2)

The test here should be using the length of a rather than b; it has the test backwards.

This test is also the reason that zh-hans and zh-hant match incorrectly. In that case, the first conditional fails, but a hyphen is found at b+2. The relevant test then becomes

  • g_ascii_strncasecmp("zh-hans", "zh-hant", 2)

which succeeds incorrectly. It should never truncate the a argument.

So the bugs are the same: an incorrect SVG langtag test.

The locale compare test is fundamentally incorrect because it does not process hyphens correctly. The a langtag must be a prefix of the systemLanguage langtag. To follow the spec, the test should be

  • strcasecmpi(a, b) == 0 OR
  • a.len < b.len AND
  • b[a.len] == '-' AND
  • strncasecmp(a, b, a.len) == 0