Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threads + utf8::all == boom #42

Closed
schwern opened this issue Aug 8, 2016 · 5 comments
Closed

threads + utf8::all == boom #42

schwern opened this issue Aug 8, 2016 · 5 comments

Comments

@schwern
Copy link
Contributor

schwern commented Aug 8, 2016

As demonstrated in the code below, utf8::all and threads == segfault for Perl < 5.24.

It's a long, long, long standing bug that :encoding(utf-8) + threads == segfault. It was only just fixed. See https://rt.perl.org/Public/Bug/Display.html?id=31923

This is the root cause of a long standing issue in perl5i.
evalEmpire/perl5i#271

Since fork() on Windows uses threads, this also means utf8::all + fork() == boom on Windows.

One work around would be to switch to :utf8. This is one of my preferred options. It makes utf8::all just work (its goal). OTOH this introduces a backwards compat problem. OT3H most people don't know the difference between the two modes and will never notice.

Another is to use :utf8 when Perl < 5.24. This would also be my preferred option.

Another is to use :utf8 when threads are on. I don't like this because A) it introduces an ordering between utf8::all and threads and B) now utf8::all is acting differently depending on whether threads are on or off.

Another is to add an option to use :utf8. I don't like this because A) it throws the problem onto the user; they probably don't even know there is a problem and will have to track down a frustrating segfault. B) STD* and @ARGV are effectively global state, what happens when use utf8::all and use utf8::all ":utf8" are in the same process?

LMK what you'd like and I'll code it up.

#!perl

use strict;
use warnings;

use Test::More 0.96;
use Config;

BEGIN {
    plan skip_all => "Requires threads"
      if !$Config{usethreads};
}

use threads;
use threads::shared;
use utf8::all;

my $ok :shared = 0;
my $t = threads->create(sub { $ok = 1; });
$t->join();
ok $ok, "threads ok with utf8::all";

done_testing;
@HayoBaan
Copy link
Collaborator

HayoBaan commented Aug 9, 2016

Hi Michael,
Thanks for checking this out. I would like utf8::all to be thread OK so your suggestions are great. We just need to think about the best way to configure this…
The reason we use :encoding(UTF-8) is because it is the strictest and “safest” method so I really would like that to still be the default somehow, also for backwards compatibility.

So I am now thinking an option sounds best, but indeed has the problems you mention. Regardless of the way we implement this, we sure must write a warning in the documentation about threading on < v5.24.

If we decide not to go for an option, I think I like your second option best, but then I would suggest to add another check (e.g., if $Config{usethreads}).

Your thoughts please 😃

@schwern
Copy link
Contributor Author

schwern commented Aug 9, 2016

Yes to clear docs about this.

I don't like breaking backwards compat, but I can't think of a scenario where :utf8 will break any production code. I'm not talking about code that deliberately looks at IO layers and assumes utf8::all will turn on strict-utf8. AFAIK running, working code will continue to work. So that's something.

Another backwards compatibility consideration is utf8::all does not have a large chain of dependent modules relying on its behavior. Very few CPAN modules depend on utf8::all, and they're mostly apps which have no dependencies. So I don't see there being a large cascade of hidden consequences.

How about something along the lines of...

Perl's Unicode implementation has bugs, particularly with threads. utf8::all will try to use the strictest encoding available, probably :encoding(utf-8), but may choose a less strict utf-8 encoding, probably :utf8, if it detects your Perl is vulnerable to Unicode bugs.

The target audience for utf8::all is people who don't need or want to know the details, they just want to safely work with utf8 data. By leaving the published details of exactly what encoding it will choose vague, users are discouraged from relying on it. And it allows utf8::all to change the details of how it chooses the encoding, like adding a check for $Config{usethreads}, or dealing with Yet Another Unicode Bug, or using a new utf-8 encoding.

I deliberately left out an option for users to choose their own encoding. This would lead to two pieces of code in the same process wanting two different encodings. Because utf8::all has global effects (@argv, STDOUT, STDERR, STDIN) that will be difficult to safely swap out lexically.

@Leont
Copy link
Contributor

Leont commented Aug 10, 2016

There is a third option here: use PerlIO::utf8_strict. It's about as performant as :utf8, but actually validates the data (similar to :encoding(utf-8) with FB_CROAK).

That means we don't have to sacrifice correctness and security for stability.

@schwern
Copy link
Contributor Author

schwern commented Aug 10, 2016

@Leont Sounds like a good option, but I can't get it to work with use open.

$ perl -wle 'use PerlIO::utf8_strict; use open ":std", ":utf8_strict"'
Unknown PerlIO layer class ':utf8_strict' (need IN, OUT or IO) at -e line 1.
BEGIN failed--compilation aborted at -e line 1.

schwern added a commit to schwern/utf8-all that referenced this issue Aug 10, 2016
Leon suggested using PerlIO::utf8_strict as the default.
doherty#42 (comment)

It seems to work.

I'm assuming there's still value in trying to use :encoding(UTF-8)?
Maybe it's faster?
@Leont
Copy link
Contributor

Leont commented Aug 10, 2016

Sounds like a good option, but I can't get it to work with use open.

Apparently open.pm is special-cased so that for :utf8, :encoding() and :locale, the IO argument isn't necessary, but for all other layers it is. This works fine for me:

perl -wle 'use open ":std", "IO", ":utf8_strict"'

schwern added a commit to schwern/utf8-all that referenced this issue Aug 10, 2016
Leon suggested using PerlIO::utf8_strict as the default.
doherty#42 (comment)

It seems to work.

I'm assuming there's still value in trying to use :encoding(UTF-8)?
Maybe it's faster?
schwern added a commit to schwern/utf8-all that referenced this issue Aug 10, 2016
Leon suggested using PerlIO::utf8_strict as the default.
doherty#42 (comment)

It seems to work.

I'm assuming there's still value in trying to use :encoding(UTF-8)?
Maybe it's faster?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants