Tony Cook (via RT)
2015-08-06 07:00:42 UTC
# New Ticket Created by Tony Cook
# Please include the string: [perl #125760]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=125760 >
One of the few remaining warts[1] in Perl's Unicode support is how
sysread() and syswrite() behave on streams with a unicode layer.
First sysread():
For example:
open my $fh, "<:utf8", "filewithutf8.txt" or die;
my $buf;
sysread $fh, $buf, 1000;
will reads up to 1000 unvalidated UTF-8[2] *characters* from the stream.
That seems all fine and good, but the following:
open my $fh, "<:encoding(UCS-2BE)", "filewithucs2be.txt" or die;
my $buf;
sysread $fh, $buf, 1000;
does exactly the same thing - only the fact that the stream is unicode
flagged (i.e., has the PERLIO_F_UTF8 flag) is referenced, the actual
layers are ignored.
This behaviour is mostly documented by:
Note that if the filehandle has been marked as C<:utf8> Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread() are in Unicode characters).
The C<:encoding(...)> layer implicitly introduces the C<:utf8> layer.
See L</binmode>, L</open>, and the C<open> pragma, L<open>.
which skips mentioning that the "Unicode characters" read are always
UTF-8 encoded.
This, beyond the broken :utf8 layer itself, is one of the few pure
perl vectors for badly encoded SVf_UTF8 strings in the perl
interpreter.
Also it can be confusing, even an experienced CPAN author managed to get
it wrong[3].
My suggestion is that (eventually) sysread() on a file with the
PERLIO_F_UTF8 flag on should either do a simple octet read, as it does
without that flag, or fail.
For the transition sysread() would warn when passed a handle with
PERLIO_F_UTF8, presumably something like "sysread() on a unicode
handle is deprecated".
So what's the desired behaviour after the transition:
1) sysread() would act as if the flag was not there, completely
ignoring the layers rather than ignoring the layers *except* for
the flag.
This has the advantage that sysread() behaves consistently after
the change. It may however make code that depends on the old
behaviour silently misbehave.
2) sysread() fails, probably with EINVAL.
While sysread() becomes no longer useful on handles with the flag,
mixing low- and high-level I/O is generally unsafe anyway, and
PerlIO layers are pretty much a high-level construct, so there
isn't much lost.
It prevents most silent mis-behaviour while remaining true to
sysread()'s contract to read bytes from a file.
3) sysread() croaks.
Similar to 2), but with more emphasis.
Then syswrite():
Unlike sysread(), syswrite() doesn't act a a vector for producing
corrupt internal perl data structures, but it does have the same issue
that it pays attention to only part of the layer state for the handle.
For example:
open my $fh, ">:utf8", "filetobeutf8.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;
will write UTF-8 encoded data to the file, which is fine, but:
open my $fh, ">:encoding(UCS-2BE)", "filetobeucs2.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;
does the same thing.
I believe if syswrite() is going to ignore any of the layer state of
the handle, it should ignore it all, so the examples above would throw
an exception, just as they do for handles without the flag when
syswrite() is called with wide characters.
Again, for the transition, syswrite() should produce a deprecation
warning.
Tony
[1] the other I can think of is that the :utf8 PerlIO layer doesn't
validate, it's just a flag
[2] or utf8, perl's internal encoding
[3] https://rt.cpan.org/Public/Bug/Display.html?id=83126 and a few
other tickets for the same distribution, and
https://rt.perl.org/Ticket/Display.html?id=121870
# Please include the string: [perl #125760]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=125760 >
One of the few remaining warts[1] in Perl's Unicode support is how
sysread() and syswrite() behave on streams with a unicode layer.
First sysread():
For example:
open my $fh, "<:utf8", "filewithutf8.txt" or die;
my $buf;
sysread $fh, $buf, 1000;
will reads up to 1000 unvalidated UTF-8[2] *characters* from the stream.
That seems all fine and good, but the following:
open my $fh, "<:encoding(UCS-2BE)", "filewithucs2be.txt" or die;
my $buf;
sysread $fh, $buf, 1000;
does exactly the same thing - only the fact that the stream is unicode
flagged (i.e., has the PERLIO_F_UTF8 flag) is referenced, the actual
layers are ignored.
This behaviour is mostly documented by:
Note that if the filehandle has been marked as C<:utf8> Unicode
characters are read instead of bytes (the LENGTH, OFFSET, and the
return value of sysread() are in Unicode characters).
The C<:encoding(...)> layer implicitly introduces the C<:utf8> layer.
See L</binmode>, L</open>, and the C<open> pragma, L<open>.
which skips mentioning that the "Unicode characters" read are always
UTF-8 encoded.
This, beyond the broken :utf8 layer itself, is one of the few pure
perl vectors for badly encoded SVf_UTF8 strings in the perl
interpreter.
Also it can be confusing, even an experienced CPAN author managed to get
it wrong[3].
My suggestion is that (eventually) sysread() on a file with the
PERLIO_F_UTF8 flag on should either do a simple octet read, as it does
without that flag, or fail.
For the transition sysread() would warn when passed a handle with
PERLIO_F_UTF8, presumably something like "sysread() on a unicode
handle is deprecated".
So what's the desired behaviour after the transition:
1) sysread() would act as if the flag was not there, completely
ignoring the layers rather than ignoring the layers *except* for
the flag.
This has the advantage that sysread() behaves consistently after
the change. It may however make code that depends on the old
behaviour silently misbehave.
2) sysread() fails, probably with EINVAL.
While sysread() becomes no longer useful on handles with the flag,
mixing low- and high-level I/O is generally unsafe anyway, and
PerlIO layers are pretty much a high-level construct, so there
isn't much lost.
It prevents most silent mis-behaviour while remaining true to
sysread()'s contract to read bytes from a file.
3) sysread() croaks.
Similar to 2), but with more emphasis.
Then syswrite():
Unlike sysread(), syswrite() doesn't act a a vector for producing
corrupt internal perl data structures, but it does have the same issue
that it pays attention to only part of the layer state for the handle.
For example:
open my $fh, ">:utf8", "filetobeutf8.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;
will write UTF-8 encoded data to the file, which is fine, but:
open my $fh, ">:encoding(UCS-2BE)", "filetobeucs2.txt" or die;
my $data = "\x{101}";
syswrite $fh, $data;
does the same thing.
I believe if syswrite() is going to ignore any of the layer state of
the handle, it should ignore it all, so the examples above would throw
an exception, just as they do for handles without the flag when
syswrite() is called with wide characters.
Again, for the transition, syswrite() should produce a deprecation
warning.
Tony
[1] the other I can think of is that the :utf8 PerlIO layer doesn't
validate, it's just a flag
[2] or utf8, perl's internal encoding
[3] https://rt.cpan.org/Public/Bug/Display.html?id=83126 and a few
other tickets for the same distribution, and
https://rt.perl.org/Ticket/Display.html?id=121870