NET/www/webプログラミングの編集

**Perl [#s7554a38]

&aname(urlencode);

-（先ず簡単な方から。）URLエンコードされている文字列を取り扱う（例えばApacheのログファイルの解析）には、単にURLデコードしただけでは駄目。URLデコード後、直ちにPerlの内部表現に「デコード」する。さもないと他の文字列と整合性がとれなくなる。例えば、一般に全ての文字列は、出力の際、エンコードする。URLデコードした文字列について、デコードされる前にエンコードされると二重エンコードになり、文字化けする。

-（これは意外だった。）URLエンコードされている文字列を含むファイル（例えばApacheのログファイル）を読み取るとき、「&pre(open(IN, '<:utf8'...);」や「&pre(use open IN => ":utf8";);」でデコードすると、（その後、Encode::decode_utf8しようがしまいが）スクリプトのリテラル（&pre(use utf8;);前提）と（正規表現や文字列比較演算子で）マッチしなくなる。リテラルによる正規表現や文字列比較演算子が使えない。文字化けはないので、上記より気付きにくい。use open INやbinmode INを便利なプラグマとして、雛形として使っていることは少なくないのでは。

--「&pre(<:utf8);」ではなく「&pre(<:encoding(UTF-8));」でも変わらず。「:utf8 は、さらなるチェックなしにデータが UTF-8 としてマークしますが、 :encoding(UTF-8) はデータが実際に有効な UTF-8 かどうかをチェックします。 」[[perlfunc - Perl 組み込み関数 - perldoc.jp:http://perldoc.jp/docs/perl/5.24.1/perlfunc.pod#open]]
--「binmode IN, ":utf8";」でも同様に駄目になる。

--下記のコードで試せる。
#pre{{
use utf8;
use strict;
use warnings;
use Encode;
use Devel::Peek;

my $str = '%E6%97%85%E8%A1%8C'; # 旅行
my $file = 'test.bin';
my $str_decoded = decode_utf8( &urldecode ($str));
print "basis:$str_decoded\n";
Devel::Peek::Dump($str_decoded);
print qq(\n);

open (OUT, ">$file");
binmode OUT;
print OUT $str;
close OUT;

# -------------------------------------------------------
# デコードせず読み込み
open (IN, "<$file");
#binmode IN;
my $str1_from_file = <IN>;
close IN;
&compare($str,$str1_from_file);
print qq(\n);

# -------------------------------------------------------
# デコードして読み込み
open(IN, '<:utf8', $file);
#binmode IN;
my $str2_from_file = <IN>;
close IN;
&compare($str,$str2_from_file);

# -------------------------------------------------------
# URLデコード
sub urldecode{
	my $uri = shift(@_);
	$uri =~ tr/+/ /;
	$uri =~ s/%([0-9A-Fa-f][0-9A-Fa-f])/pack('H2', $1)/eg;
	return $uri;
}
# 比較
sub compare{
	my @sample = @_;
	# URLデコード前
	&sameornot(@sample);

# URLデコード
	my @sample_urldecoded = map {urldecode($_)} @sample;
	&sameornot(@sample_urldecoded);

# 内部表現へのデコード
	my @sample_decoded = map {decode_utf8($_)} @sample_urldecoded;
	&sameornot(@sample_decoded);

sub sameornot {
		$_[0] eq $_[1] ? print "same\n" : print "differnet\n";
	}
	# 正規表現
	if ($sample_decoded[1] =~ /旅行/) {
		print "1) regex works\n";
	}
	# 念のためdecode_utf8前で試す
	if ($sample_urldecoded[1] =~ /旅行/) {
		print "2) regex works\n";
	}

print "compared:$sample_decoded[1]\n";
	print "before decode_utf8\n";
	Devel::Peek::Dump($sample_urldecoded[1]);
	print "after decode_utf8\n";
	Devel::Peek::Dump($sample_decoded[1]);
}
}}
--途中「Wide character in print」と言われるが、それが正しい。むしろ3番目の出力箇所で出ないことの方が意外だった。
--このような結果になる。
#pre{{
$ perl test2.pl
Wide character in print at test2.pl line 12.
basis:旅行
SV = PV(0x1018ee0) at 0x1037980
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x110a940 "\346\227\205\350\241\214"\0 [UTF8 "\x{65c5}\x{884c}"]
  CUR = 6
  LEN = 16

same
same
same
1) regex works
Wide character in print at test2.pl line 73.
compared:旅行
before decode_utf8
SV = PVMG(0x10d7a20) at 0x1037188
  REFCNT = 1
  FLAGS = (POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x1122220 "\346\227\205\350\241\214"\0
  CUR = 6
  LEN = 16
after decode_utf8
SV = PV(0x1019740) at 0x1037218
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x1097030 "\346\227\205\350\241\214"\0 [UTF8 "\x{65c5}\x{884c}"]
  CUR = 6
  LEN = 16

same
same
differnet
compared:旅行
before decode_utf8
SV = PVMG(0x10d7a50) at 0x10370e0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x110acb0 "\303\246\302\227\302\205\303\250\302\241\302\214"\0 [UTF8 "\x{e6}\x{97}\x{85}\x{e8}\x{a1}\x{8c}"]
  CUR = 12
  LEN = 16
after decode_utf8
SV = PVMG(0x10d7960) at 0x10371e8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x1030240 "\303\246\302\227\302\205\303\250\302\241\302\214"\0 [UTF8 "\x{e6}\x{97}\x{85}\x{e8}\x{a1}\x{8c}"]
  CUR = 12
  LEN = 16
}}

タイムスタンプを変更しない

テキスト整形のルールを表示する

NET/www/webプログラミング の編集