threads + utf8::all == boom #42

schwern · 2016-08-08T21:33:14Z

As demonstrated in the code below, utf8::all and threads == segfault for Perl < 5.24.

It's a long, long, long standing bug that :encoding(utf-8) + threads == segfault. It was only just fixed. See https://rt.perl.org/Public/Bug/Display.html?id=31923

This is the root cause of a long standing issue in perl5i.
evalEmpire/perl5i#271

Since fork() on Windows uses threads, this also means utf8::all + fork() == boom on Windows.

One work around would be to switch to :utf8. This is one of my preferred options. It makes utf8::all just work (its goal). OTOH this introduces a backwards compat problem. OT3H most people don't know the difference between the two modes and will never notice.

Another is to use :utf8 when Perl < 5.24. This would also be my preferred option.

Another is to use :utf8 when threads are on. I don't like this because A) it introduces an ordering between utf8::all and threads and B) now utf8::all is acting differently depending on whether threads are on or off.

Another is to add an option to use :utf8. I don't like this because A) it throws the problem onto the user; they probably don't even know there is a problem and will have to track down a frustrating segfault. B) STD* and @ARGV are effectively global state, what happens when use utf8::all and use utf8::all ":utf8" are in the same process?

LMK what you'd like and I'll code it up.

#!perl

use strict;
use warnings;

use Test::More 0.96;
use Config;

BEGIN {
    plan skip_all => "Requires threads"
      if !$Config{usethreads};
}

use threads;
use threads::shared;
use utf8::all;

my $ok :shared = 0;
my $t = threads->create(sub { $ok = 1; });
$t->join();
ok $ok, "threads ok with utf8::all";

done_testing;

The text was updated successfully, but these errors were encountered:

HayoBaan · 2016-08-09T08:52:53Z

Hi Michael,
Thanks for checking this out. I would like utf8::all to be thread OK so your suggestions are great. We just need to think about the best way to configure this…
The reason we use :encoding(UTF-8) is because it is the strictest and “safest” method so I really would like that to still be the default somehow, also for backwards compatibility.

So I am now thinking an option sounds best, but indeed has the problems you mention. Regardless of the way we implement this, we sure must write a warning in the documentation about threading on < v5.24.

If we decide not to go for an option, I think I like your second option best, but then I would suggest to add another check (e.g., if $Config{usethreads}).

Your thoughts please 😃

schwern · 2016-08-09T16:49:32Z

Yes to clear docs about this.

I don't like breaking backwards compat, but I can't think of a scenario where :utf8 will break any production code. I'm not talking about code that deliberately looks at IO layers and assumes utf8::all will turn on strict-utf8. AFAIK running, working code will continue to work. So that's something.

Another backwards compatibility consideration is utf8::all does not have a large chain of dependent modules relying on its behavior. Very few CPAN modules depend on utf8::all, and they're mostly apps which have no dependencies. So I don't see there being a large cascade of hidden consequences.

How about something along the lines of...

Perl's Unicode implementation has bugs, particularly with threads. utf8::all will try to use the strictest encoding available, probably :encoding(utf-8), but may choose a less strict utf-8 encoding, probably :utf8, if it detects your Perl is vulnerable to Unicode bugs.

The target audience for utf8::all is people who don't need or want to know the details, they just want to safely work with utf8 data. By leaving the published details of exactly what encoding it will choose vague, users are discouraged from relying on it. And it allows utf8::all to change the details of how it chooses the encoding, like adding a check for $Config{usethreads}, or dealing with Yet Another Unicode Bug, or using a new utf-8 encoding.

I deliberately left out an option for users to choose their own encoding. This would lead to two pieces of code in the same process wanting two different encodings. Because utf8::all has global effects (@argv, STDOUT, STDERR, STDIN) that will be difficult to safely swap out lexically.

Leont · 2016-08-10T16:47:10Z

There is a third option here: use PerlIO::utf8_strict. It's about as performant as :utf8, but actually validates the data (similar to :encoding(utf-8) with FB_CROAK).

That means we don't have to sacrifice correctness and security for stability.

schwern · 2016-08-10T18:28:36Z

@Leont Sounds like a good option, but I can't get it to work with use open.

$ perl -wle 'use PerlIO::utf8_strict; use open ":std", ":utf8_strict"'
Unknown PerlIO layer class ':utf8_strict' (need IN, OUT or IO) at -e line 1.
BEGIN failed--compilation aborted at -e line 1.

Leon suggested using PerlIO::utf8_strict as the default. doherty#42 (comment) It seems to work. I'm assuming there's still value in trying to use :encoding(UTF-8)? Maybe it's faster?

Leont · 2016-08-10T22:00:46Z

Sounds like a good option, but I can't get it to work with use open.

Apparently open.pm is special-cased so that for :utf8, :encoding() and :locale, the IO argument isn't necessary, but for all other layers it is. This works fine for me:

perl -wle 'use open ":std", "IO", ":utf8_strict"'

Leon suggested using PerlIO::utf8_strict as the default. doherty#42 (comment) It seems to work. I'm assuming there's still value in trying to use :encoding(UTF-8)? Maybe it's faster?

schwern mentioned this issue Aug 9, 2016

Choose a utf-8 encoding that will not break threads. #43

Merged

HayoBaan closed this as completed in #43 Aug 9, 2016

schwern mentioned this issue Aug 10, 2016

Use PerlIO::utf8_strict as the default. #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

threads + utf8::all == boom #42

threads + utf8::all == boom #42

schwern commented Aug 8, 2016 •

edited

Loading

HayoBaan commented Aug 9, 2016

schwern commented Aug 9, 2016 •

edited

Loading

Leont commented Aug 10, 2016 •

edited

Loading

schwern commented Aug 10, 2016

Leont commented Aug 10, 2016 •

edited

Loading

threads + utf8::all == boom #42

threads + utf8::all == boom #42

Comments

schwern commented Aug 8, 2016 • edited Loading

HayoBaan commented Aug 9, 2016

schwern commented Aug 9, 2016 • edited Loading

Leont commented Aug 10, 2016 • edited Loading

schwern commented Aug 10, 2016

Leont commented Aug 10, 2016 • edited Loading

schwern commented Aug 8, 2016 •

edited

Loading

schwern commented Aug 9, 2016 •

edited

Loading

Leont commented Aug 10, 2016 •

edited

Loading

Leont commented Aug 10, 2016 •

edited

Loading