M::I doesn't read utf8-encoded files correctly #37

sjn · 2014-10-20T13:48:17Z

Module/Install.pm's _read() and _write() functions don't set the encoding correctly, leading to utf8-bugs in the author field in META.yml. Here's how to reproduce this error:

Create two files; Makefile.PL...

use inc::Module::Install;
all_from 'lib/Broken.pm';

requires();
auto_install();
&WriteAll;

...and lib/Broken.pm

package Broken;

use 5.006;
our $VERSION = '1.0';

1;

__END__

=encoding utf8

=head1 NAME

Broken - A broken module

=head1 DESCRIPTION

This module is only here to show a bug

=head1 AUTHORS

Philippe "BooK" Bruhat, Éric Cholet, Olivier Mengué

=head1 LICENSE

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut

Make sure the AUTHORS field is saved correctly as utf8. You can check this with xxd -g1 -s 179 -l 70 lib/Broken.pm, which should give the following output:

00000b3: 3d 68 65 61 64 31 20 41 55 54 48 4f 52 53 0a 0a  =head1 AUTHORS..
00000c3: 50 68 69 6c 69 70 70 65 20 22 42 6f 6f 4b 22 20  Philippe "BooK" 
00000d3: 42 72 75 68 61 74 2c 20 c3 89 72 69 63 20 43 68  Bruhat, ..ric Ch
00000e3: 6f 6c 65 74 2c 20 4f 6c 69 76 69 65 72 20 4d 65  olet, Olivier Me
00000f3: 6e 67 75 c3 a9 0a                                ngu...

Then, run perl Makefile.PL, using Module::Install version 1.12. The following META.yml will then be generated:


---
abstract: 'A broken module'
author:
  - "Philippe \"BooK\" Bruhat, �\x89ric Cholet, Olivier Mengué"
build_requires:
  ExtUtils::MakeMaker: 6.59
configure_requires:
  ExtUtils::MakeMaker: 6.59
distribution_type: module
dynamic_config: 1
generated_by: 'Module::Install version 1.12'
license: perl
meta-spec:
  url: http://module-build.sourceforge.net/META-spec-v1.4.html
  version: 1.4
module_name: Broken
name: Broken
no_index:
  directory:
    - inc
requires:
  perl: 5.6.0
resources:
  license: http://dev.perl.org/licenses/
version: '1.0'

The error is in the author field, specifically �\x89ric Cholet.

The following patch to Module::Install with fix the issue:

--- lib/Module/Install.pm.orig  2014-10-20 15:35:13.731849265 +0200
+++ lib/Module/Install.pm   2014-10-20 15:36:52.326525499 +0200
@@ -18,6 +18,8 @@

 use 5.006;
 use strict 'vars';
+use utf8;
+use open ':encoding(utf8)';
 use Cwd        ();
 use File::Find ();
 use File::Path ();

This parch will make one test in the Module::Install's test suite fail. Not sure what's going on there.

The text was updated successfully, but these errors were encountered:

ribasushi · 2014-10-20T13:51:44Z

I think the patch as-is is going to break a lot of stuff. The premise behind M::I's read/write functions has always been "I ready/write byts, encoding is your own business". If anything - the issue is that the author_from extractor does not return bytes.

sjn · 2014-10-20T14:00:37Z

BTW, if you add Éric to the NAME pod field, you'll get a similar error in the META.yml "abstract" field. If you have a fix for author_from, it has also to be applied other places!

karenetheridge · 2014-10-20T16:20:03Z

"use utf8" in the MI code is totally unnecessary. That's only used to indicate that there are utf8-encoded bytes in the physical file currently being parsed.

karenetheridge · 2014-10-20T16:22:17Z

I think the patch as-is is going to break a lot of stuff

I don't think it will break any more things than the other mojibake'd META files we've been dealing with for the past few years as we sort out encoding issues elsewhere in the toolchain.

The premise behind M::I's read/write functions has always been "I ready/write byts, encoding is your own business".

I disagree. I think it's closer to "unicode -- what's that?"

If anything - the issue is that the author_from extractor does not return bytes.

This is wrong. Interfaces not dealing with the edges (writing to/from disk or the network) should always deal in (decoded) characters, not (encoded) octets.

mohawk2 · 2014-10-20T16:24:40Z

@sjn Could you PR your test case as a *.t file? As a/the Designated Unicode Victim, I'll have a go at it.

ribasushi · 2014-10-20T16:25:57Z

On 10/20/2014 06:22 PM, Karen Etheridge wrote:

I think the patch as-is is going to break a lot of stuff
I don't think it will break any more things than the other mojibake'd
META files we've been dealing with for the past few years as we sort
out encoding issues elsewhere in the toolchain.

I much rather have a mojibaked meta rather than an "I have a new M::I on
this 5.8.8 and now I can't install dists". Let's not lose perspective
here please.

If anything - the issue is that the author_from extractor does not
return bytes.
This is wrong. Interfaces not dealing with the edges (writing to/from
disk or the network) should always deal in (decoded) characters, not
(encoded) octets.

For greenfield code - you are absolutely correct. M::I is a tool with a
very specific fail-case, one needs to thread very carefully.

karenetheridge · 2014-10-20T16:45:47Z

much rather have a mojibaked meta rather than an "I have a new M::I on this 5.8.8 and now I can't install dists".

Of course. :) But I don't think we were far enough through evaluating a patch to assess whether lack of backwards compatibility was even a possibility here. Let's fix the code as we would like it to be, and then be aggressive about testing various install combinations to see what the potential ramifications are. (And, of course, an alpha release is necessary before anything goes stable.)

dagolden · 2014-10-21T02:25:10Z

perhaps "all_from" need to take arguments?

all_from 'lib/Broken.pm', { encoding => 'UTF-8' };

Or perhaps "all_from" should look for "use utf8;" or a POD "=encoding" section and DWIM? Assuming the author actually went to the trouble of telling Perl or perldoc about encodings, it would be a shame for M::I to ignore it.

ribasushi · 2014-10-21T06:58:41Z

@dagolden agreed, an extra flag (and perhaps some internal guesswork) is the way to go.

To clarify - I would be delighted to get this fixed (and not just for UTF8)

My initial answer #37 (comment) was only pointing out the incorrectness of the proposed one-line patch

And the second paragraph of #37 (comment) was specifically to counter the dangerous attitude "let's rewrite everything the modern way, redefining all interfaces in the process, and then see what the fallout is"

karenetheridge · 2014-10-21T23:11:30Z

On Mon, Oct 20, 2014 at 11:58:43PM -0700, Peter Rabbitson wrote:

... specifically to counter the dangerous attitude "let's rewrite everything the modern way, redefining all interfaces in the process, and then see what the fallout is"

No one said that, so I think you're inferring too much here.

However, it's perfectly reasonable to say "please don't merge this yet -
more tests are needed, as well as consideration of the potential fallout",
just in case someone zealously clicks the 'merge' button followed by the
'release' button (but I think we've already whacked with a cluebat anyone
who might have done that in the past).

adamkennedy · 2014-10-21T23:51:33Z

Indeed. It was written during "the back compatibility target doesn't
support Unicode, META.yml doesn't support Unicode, we don't support
Unicode"
On Oct 20, 2014 9:22 AM, "Karen Etheridge" [email protected] wrote:

I think the patch as-is is going to break a lot of stuff

I don't think it will break any more things than the other mojibake'd META
files we've been dealing with for the past few years as we sort out
encoding issues elsewhere in the toolchain.

The premise behind M::I's read/write functions has always been "I
ready/write byts, encoding is your own business".

I disagree. I think it's closer to "unicode -- what's that?"

If anything - the issue is that the author_from extractor does not return
bytes.

This is wrong. Interfaces not dealing with the edges (writing to/from disk
or the network) should always deal in (decoded) characters, not (encoded)
octets.

—
Reply to this email directly or view it on GitHub
#37 (comment)
.

ambs · 2016-08-23T16:51:26Z

#55 should fix this issue (??)

... since Module::Install is b0rken. Perl-Toolchain-Gang/Module-Install#37

karenetheridge mentioned this issue Apr 26, 2015

Cannot create META files containing non-ascii characters #44

Open

ambs mentioned this issue Aug 19, 2016

first try fixing utf8 issues #52

Closed

akhuettel added a commit to lab-measurement/Lab-VISA that referenced this issue Apr 22, 2017

For the moment no unicode author names

2f96f40

... since Module::Install is b0rken. Perl-Toolchain-Gang/Module-Install#37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M::I doesn't read utf8-encoded files correctly #37

M::I doesn't read utf8-encoded files correctly #37

sjn commented Oct 20, 2014

ribasushi commented Oct 20, 2014

sjn commented Oct 20, 2014

karenetheridge commented Oct 20, 2014

karenetheridge commented Oct 20, 2014

mohawk2 commented Oct 20, 2014

ribasushi commented Oct 20, 2014

karenetheridge commented Oct 20, 2014

dagolden commented Oct 21, 2014

ribasushi commented Oct 21, 2014

karenetheridge commented Oct 21, 2014 •

edited

Loading

adamkennedy commented Oct 21, 2014

ambs commented Aug 23, 2016

M::I doesn't read utf8-encoded files correctly #37

M::I doesn't read utf8-encoded files correctly #37

Comments

sjn commented Oct 20, 2014

ribasushi commented Oct 20, 2014

sjn commented Oct 20, 2014

karenetheridge commented Oct 20, 2014

karenetheridge commented Oct 20, 2014

mohawk2 commented Oct 20, 2014

ribasushi commented Oct 20, 2014

karenetheridge commented Oct 20, 2014

dagolden commented Oct 21, 2014

ribasushi commented Oct 21, 2014

karenetheridge commented Oct 21, 2014 • edited Loading

adamkennedy commented Oct 21, 2014

ambs commented Aug 23, 2016

karenetheridge commented Oct 21, 2014 •

edited

Loading