Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M::I doesn't read utf8-encoded files correctly #37

Open
sjn opened this issue Oct 20, 2014 · 12 comments
Open

M::I doesn't read utf8-encoded files correctly #37

sjn opened this issue Oct 20, 2014 · 12 comments

Comments

@sjn
Copy link

sjn commented Oct 20, 2014

Module/Install.pm's _read() and _write() functions don't set the encoding correctly, leading to utf8-bugs in the author field in META.yml. Here's how to reproduce this error:

Create two files; Makefile.PL...

use inc::Module::Install;
all_from 'lib/Broken.pm';

requires();
auto_install();
&WriteAll;

...and lib/Broken.pm

package Broken;

use 5.006;
our $VERSION = '1.0';

1;

__END__

=encoding utf8

=head1 NAME

Broken - A broken module

=head1 DESCRIPTION

This module is only here to show a bug

=head1 AUTHORS

Philippe "BooK" Bruhat, Éric Cholet, Olivier Mengué

=head1 LICENSE

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut

Make sure the AUTHORS field is saved correctly as utf8. You can check this with xxd -g1 -s 179 -l 70 lib/Broken.pm, which should give the following output:

00000b3: 3d 68 65 61 64 31 20 41 55 54 48 4f 52 53 0a 0a  =head1 AUTHORS..
00000c3: 50 68 69 6c 69 70 70 65 20 22 42 6f 6f 4b 22 20  Philippe "BooK" 
00000d3: 42 72 75 68 61 74 2c 20 c3 89 72 69 63 20 43 68  Bruhat, ..ric Ch
00000e3: 6f 6c 65 74 2c 20 4f 6c 69 76 69 65 72 20 4d 65  olet, Olivier Me
00000f3: 6e 67 75 c3 a9 0a                                ngu...

Then, run perl Makefile.PL, using Module::Install version 1.12. The following META.yml will then be generated:


---
abstract: 'A broken module'
author:
  - "Philippe \"BooK\" Bruhat, �\x89ric Cholet, Olivier Mengué"
build_requires:
  ExtUtils::MakeMaker: 6.59
configure_requires:
  ExtUtils::MakeMaker: 6.59
distribution_type: module
dynamic_config: 1
generated_by: 'Module::Install version 1.12'
license: perl
meta-spec:
  url: http://module-build.sourceforge.net/META-spec-v1.4.html
  version: 1.4
module_name: Broken
name: Broken
no_index:
  directory:
    - inc
requires:
  perl: 5.6.0
resources:
  license: http://dev.perl.org/licenses/
version: '1.0'

The error is in the author field, specifically �\x89ric Cholet.

The following patch to Module::Install with fix the issue:

--- lib/Module/Install.pm.orig  2014-10-20 15:35:13.731849265 +0200
+++ lib/Module/Install.pm   2014-10-20 15:36:52.326525499 +0200
@@ -18,6 +18,8 @@

 use 5.006;
 use strict 'vars';
+use utf8;
+use open ':encoding(utf8)';
 use Cwd        ();
 use File::Find ();
 use File::Path ();

This parch will make one test in the Module::Install's test suite fail. Not sure what's going on there.

@ribasushi
Copy link
Contributor

I think the patch as-is is going to break a lot of stuff. The premise behind M::I's read/write functions has always been "I ready/write byts, encoding is your own business". If anything - the issue is that the author_from extractor does not return bytes.

@sjn
Copy link
Author

sjn commented Oct 20, 2014

BTW, if you add Éric to the NAME pod field, you'll get a similar error in the META.yml "abstract" field. If you have a fix for author_from, it has also to be applied other places!

@karenetheridge
Copy link
Member

"use utf8" in the MI code is totally unnecessary. That's only used to indicate that there are utf8-encoded bytes in the physical file currently being parsed.

@karenetheridge
Copy link
Member

I think the patch as-is is going to break a lot of stuff

I don't think it will break any more things than the other mojibake'd META files we've been dealing with for the past few years as we sort out encoding issues elsewhere in the toolchain.

The premise behind M::I's read/write functions has always been "I ready/write byts, encoding is your own business".

I disagree. I think it's closer to "unicode -- what's that?"

If anything - the issue is that the author_from extractor does not return bytes.

This is wrong. Interfaces not dealing with the edges (writing to/from disk or the network) should always deal in (decoded) characters, not (encoded) octets.

@mohawk2
Copy link
Member

mohawk2 commented Oct 20, 2014

@sjn Could you PR your test case as a *.t file? As a/the Designated Unicode Victim, I'll have a go at it.

@ribasushi
Copy link
Contributor

On 10/20/2014 06:22 PM, Karen Etheridge wrote:

I think the patch as-is is going to break a lot of stuff

I don't think it will break any more things than the other mojibake'd
META files we've been dealing with for the past few years as we sort
out encoding issues elsewhere in the toolchain.

I much rather have a mojibaked meta rather than an "I have a new M::I on
this 5.8.8 and now I can't install dists". Let's not lose perspective
here please.

If anything - the issue is that the author_from extractor does not
return bytes.

This is wrong. Interfaces not dealing with the edges (writing to/from
disk or the network) should always deal in (decoded) characters, not
(encoded) octets.

For greenfield code - you are absolutely correct. M::I is a tool with a
very specific fail-case, one needs to thread very carefully.

@karenetheridge
Copy link
Member

much rather have a mojibaked meta rather than an "I have a new M::I on this 5.8.8 and now I can't install dists".

Of course. :) But I don't think we were far enough through evaluating a patch to assess whether lack of backwards compatibility was even a possibility here. Let's fix the code as we would like it to be, and then be aggressive about testing various install combinations to see what the potential ramifications are. (And, of course, an alpha release is necessary before anything goes stable.)

@dagolden
Copy link

perhaps "all_from" need to take arguments?

all_from 'lib/Broken.pm', { encoding => 'UTF-8' };

Or perhaps "all_from" should look for "use utf8;" or a POD "=encoding" section and DWIM? Assuming the author actually went to the trouble of telling Perl or perldoc about encodings, it would be a shame for M::I to ignore it.

@ribasushi
Copy link
Contributor

@dagolden agreed, an extra flag (and perhaps some internal guesswork) is the way to go.

To clarify - I would be delighted to get this fixed (and not just for UTF8)

My initial answer #37 (comment) was only pointing out the incorrectness of the proposed one-line patch

And the second paragraph of #37 (comment) was specifically to counter the dangerous attitude "let's rewrite everything the modern way, redefining all interfaces in the process, and then see what the fallout is"

@karenetheridge
Copy link
Member

karenetheridge commented Oct 21, 2014

On Mon, Oct 20, 2014 at 11:58:43PM -0700, Peter Rabbitson wrote:

... specifically to counter the dangerous attitude "let's rewrite everything the modern way, redefining all interfaces in the process, and then see what the fallout is"

No one said that, so I think you're inferring too much here.

However, it's perfectly reasonable to say "please don't merge this yet -
more tests are needed, as well as consideration of the potential fallout",
just in case someone zealously clicks the 'merge' button followed by the
'release' button (but I think we've already whacked with a cluebat anyone
who might have done that in the past).

@adamkennedy
Copy link
Member

Indeed. It was written during "the back compatibility target doesn't
support Unicode, META.yml doesn't support Unicode, we don't support
Unicode"
On Oct 20, 2014 9:22 AM, "Karen Etheridge" [email protected] wrote:

I think the patch as-is is going to break a lot of stuff

I don't think it will break any more things than the other mojibake'd META
files we've been dealing with for the past few years as we sort out
encoding issues elsewhere in the toolchain.

The premise behind M::I's read/write functions has always been "I
ready/write byts, encoding is your own business".

I disagree. I think it's closer to "unicode -- what's that?"

If anything - the issue is that the author_from extractor does not return
bytes.

This is wrong. Interfaces not dealing with the edges (writing to/from disk
or the network) should always deal in (decoded) characters, not (encoded)
octets.


Reply to this email directly or view it on GitHub
#37 (comment)
.

@ambs
Copy link
Contributor

ambs commented Aug 23, 2016

#55 should fix this issue (??)

akhuettel added a commit to lab-measurement/Lab-VISA that referenced this issue Apr 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants