Discussion:
[perl #125760] RFC remove strange behaviour of sysread()/syswrite() on UTF-8 streams
Tony Cook (via RT)
2015-08-06 07:00:42 UTC
Permalink
# New Ticket Created by Tony Cook
# Please include the string: [perl #125760]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=125760 >


One of the few remaining warts[1] in Perl's Unicode support is how
sysread() and syswrite() behave on streams with a unicode layer.

First sysread():

For example:

open my $fh, "<:utf8", "filewithutf8.txt" or die;
my $buf;
sysread $fh, $buf, 1000;

will reads up to 1000 unvalidated UTF-8[2] *characters* from the stream.
That seems all fine and good, but the following:

open my $fh, "<:encoding(UCS-2BE)", "filewithucs2be.txt" or die;
my $buf;
sysread $fh, $buf, 1000;

does exactly the same thing - only the fact that the stream is unicode
flagged (i.e., has the PERLIO_F_UTF8 flag) is referenced, the actual
layers are ignored.

This behaviour is mostly documented by:

Note that if the filehandle has been marked as C<:utf8> Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread() are in Unicode characters).
The C<:encoding(...)> layer implicitly introduces the C<:utf8> layer.
See L</binmode>, L</open>, and the C<open> pragma, L<open>.

which skips mentioning that the "Unicode characters" read are always
UTF-8 encoded.

This, beyond the broken :utf8 layer itself, is one of the few pure
perl vectors for badly encoded SVf_UTF8 strings in the perl
interpreter.

Also it can be confusing, even an experienced CPAN author managed to get
it wrong[3].

My suggestion is that (eventually) sysread() on a file with the
PERLIO_F_UTF8 flag on should either do a simple octet read, as it does
without that flag, or fail.

For the transition sysread() would warn when passed a handle with
PERLIO_F_UTF8, presumably something like "sysread() on a unicode
handle is deprecated".

So what's the desired behaviour after the transition:

1) sysread() would act as if the flag was not there, completely
ignoring the layers rather than ignoring the layers *except* for
the flag.

This has the advantage that sysread() behaves consistently after
the change. It may however make code that depends on the old
behaviour silently misbehave.

2) sysread() fails, probably with EINVAL.

While sysread() becomes no longer useful on handles with the flag,
mixing low- and high-level I/O is generally unsafe anyway, and
PerlIO layers are pretty much a high-level construct, so there
isn't much lost.

It prevents most silent mis-behaviour while remaining true to
sysread()'s contract to read bytes from a file.

3) sysread() croaks.

Similar to 2), but with more emphasis.

Then syswrite():

Unlike sysread(), syswrite() doesn't act a a vector for producing
corrupt internal perl data structures, but it does have the same issue
that it pays attention to only part of the layer state for the handle.

For example:

open my $fh, ">:utf8", "filetobeutf8.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;

will write UTF-8 encoded data to the file, which is fine, but:

open my $fh, ">:encoding(UCS-2BE)", "filetobeucs2.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;

does the same thing.

I believe if syswrite() is going to ignore any of the layer state of
the handle, it should ignore it all, so the examples above would throw
an exception, just as they do for handles without the flag when
syswrite() is called with wide characters.

Again, for the transition, syswrite() should produce a deprecation
warning.

Tony

[1] the other I can think of is that the :utf8 PerlIO layer doesn't
validate, it's just a flag

[2] or utf8, perl's internal encoding

[3] https://rt.cpan.org/Public/Bug/Display.html?id=83126 and a few
other tickets for the same distribution, and
https://rt.perl.org/Ticket/Display.html?id=121870
Peter Rabbitson
2015-08-06 07:12:26 UTC
Permalink
Post by Tony Cook (via RT)
# New Ticket Created by Tony Cook
# Please include the string: [perl #125760]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=125760 >
One of the few remaining warts[1] in Perl's Unicode support is how
sysread() and syswrite() behave on streams with a unicode layer.
Having read the excellent analysis, my 2c is that both failure cases for
sysread and syswrite should ultimately croak.

I do not have an informed opinion on what the deprecation cycle would
look like, as it is likely very beneficial to exercise the croaking as
early as 5.23.x on a CPAN smoke, yet it is clearly too early for code in
the wild.
Ricardo Signes
2015-08-06 23:11:59 UTC
Permalink
Post by Peter Rabbitson
Having read the excellent analysis, my 2c is that both failure cases for
sysread and syswrite should ultimately croak.
Yes. Thanks, Tony, and I agree.
Post by Peter Rabbitson
I do not have an informed opinion on what the deprecation cycle would look
like, as it is likely very beneficial to exercise the croaking as early as
5.23.x on a CPAN smoke, yet it is clearly too early for code in the wild.
We should definitely get the warnings in place soon.

I think it would be beneficial if we had a way to mark any deprecation warning
as fatal, process wide, for the purpose of smoking (and other places, like
integration testing), but I think it needs more thought than a "hey it would be
neat" from me.

But if we're going to make it croak in 5.28, time to make it warn now.
--
rjbs
Tony Cook via RT
2015-08-10 06:23:05 UTC
Permalink
Post by Ricardo Signes
But if we're going to make it croak in 5.28, time to make it warn now.
Patch attached.

As chansen mentioned in #p5p, send() and recv() have the same issue, so the patch
also deprecates them on :utf8 handles.

Tony


---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=125760

Jarkko Hietaniemi via RT
2015-08-06 13:10:49 UTC
Permalink
I think I am the original vector of these vectors... and I think I was wrong. Just make them croak.


---
via perlbug: queue: perl5 status: open
https://rt.perl.org/Ticket/Display.html?id=125760
Loading...