u8_strconv_to_locale() misbehaves on OSX (Travis CI runner)

Discussion:

Tim Rühsen

2018-02-08 15:59:25 UTC

Trying to find out why the to_unicode tests of libidn2 fail since a few
months...

It happens on OSX Travis-CI runner, all the infos I have are

$ clang --version
Apple LLVM version 8.1.0 (clang-802.0.42)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
before_install.1

locale_charset() returns with "UTF-8".

u8_strconv_to_locale() and u8_strconv_from_locale() seem not to work as
expected:

One problem seems to be that u8_strconv_to_locale() outputs decomposed
characters, e.g. u8_strconv_to_locale(bÃŒcher.de) returns b"ucher.de.

Hex/u32:

Result: U+0062 U+0022 U+0075 U+0063 U+0068 U+0065 U+0072 U+002e U+0064
U+0065)

Expected: U+0062 U+00fc U+0063 U+0068 U+0065 U+0072 U+002e U+0064 U+0065

The second problem is that characters beyond 255 are translated into ?
(U+003f).

Do you have any hints how to fix these problems ?
I would expect u8_strconv_to_locale() to work in a defined manner on
UTF-8 locales - but maybe I am wrong. I could apply a normalization step
in the test itself, but not sure if that is the correct solution.

For problem 2 I see no solution right now.

With Best Regards, Tim

Bruno Haible

2018-02-08 17:05:34 UTC

Permalink

Hi Tim,

Post by Tim RÃ¼hsen
locale_charset() returns with "UTF-8".

That is as it should be on Mac OS X.

Post by Tim RÃ¼hsen
u8_strconv_to_locale() and u8_strconv_from_locale() seem not to work as
One problem seems to be that u8_strconv_to_locale() outputs decomposed
characters, e.g. u8_strconv_to_locale(bücher.de) returns b"ucher.de.
Result: U+0062 U+0022 U+0075 U+0063 U+0068 U+0065 U+0072 U+002e U+0064
U+0065)
Expected: U+0062 U+00fc U+0063 U+0068 U+0065 U+0072 U+002e U+0064 U+0065

This would indicate that locale_charset() returns "ASCII".
What happens then is that, because u8_strconv_to_locale invokes
u8_strconv_to_encoding, which invokes mem_iconveha with transliterate=true,
which appends '//TRANSLIT' when invoking iconv_open. you get the
transliteration, e.g. from 'ü' to '"u'.

Post by Tim RÃ¼hsen
The second problem is that characters beyond 255 are translated into ?
(U+003f).

This would indicate that locale_charset() returns "ISO-8859-1". The
question marks then come from the transliteration, again.

Post by Tim RÃ¼hsen
Do you have any hints how to fix these problems ?

I would compile without -O and with -ggdb, then single-step through the code,
paying particular attention to the value of locale_charset() and to
the arguments of iconv_open().

Post by Tim RÃ¼hsen
I would expect u8_strconv_to_locale() to work in a defined manner on
UTF-8 locales

That's certainly how it is intended to be.

Bruno

Tim Ruehsen

2018-02-08 19:22:02 UTC

Permalink

Hi Bruno,

thanks for your answer... after thinking about the ambiguous output of
locale_charset() this might be an explanation:

Libidn2 (and the tests) use libunistring installed from homebrew while
my direct call to locale_charset() is from gnulib.

So my build correctly says UTF-8, but the homwbrew libunistring has
been built on some unknown (OSX ?) system with their own version of
locale_charset() returning ASCII. I said I get ? from characters > 255,
but I didn't make sure. Maybe it is characters > 127.

The bad thing is, I only experience this on a Travis CI build and so
can't use gdb for single stepping.

But an option is to build libunistring from sources in the CI and
link/test with that.

Regards, Tim

Post by Bruno Haible
Hi Tim,

Post by Tim RÃ¼hsen
locale_charset() returns with "UTF-8".

That is as it should be on Mac OS X.

Post by Tim RÃ¼hsen
u8_strconv_to_locale() and u8_strconv_from_locale() seem not to work as
One problem seems to be that u8_strconv_to_locale() outputs
decomposed
characters, e.g. u8_strconv_to_locale(bücher.de) returns
b"ucher.de.
Result: U+0062 U+0022 U+0075 U+0063 U+0068 U+0065 U+0072 U+002e U+0064
U+0065)
Expected: U+0062 U+00fc U+0063 U+0068 U+0065 U+0072 U+002e U+0064 U+0065

This would indicate that locale_charset() returns "ASCII".
What happens then is that, because u8_strconv_to_locale invokes
u8_strconv_to_encoding, which invokes mem_iconveha with
transliterate=true,
which appends '//TRANSLIT' when invoking iconv_open. you get the
transliteration, e.g. from 'ü' to '"u'.

Post by Tim RÃ¼hsen
The second problem is that characters beyond 255 are translated into ?
(U+003f).

This would indicate that locale_charset() returns "ISO-8859-1". The
question marks then come from the transliteration, again.

Post by Tim RÃ¼hsen
Do you have any hints how to fix these problems ?

I would compile without -O and with -ggdb, then single-step through the code,
paying particular attention to the value of locale_charset() and to
the arguments of iconv_open().

Post by Tim RÃ¼hsen
I would expect u8_strconv_to_locale() to work in a defined manner on
UTF-8 locales

That's certainly how it is intended to be.
Bruno