我正在尝试弄清为什么这种方法不起作用:
my $url = 'www880740.com';
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0" );
my $tx = $ua->get(
$url =>
{ 'Accept-Charset' => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7' }
);
my $page_title = $tx->result->dom->at( 'title' )->text;
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
基本上,我想从URL测试标题,并检查其是否与那些字符集匹配。我假设它是因为我需要将其解码为正则表达式可以找到的东西。当我将页面的“卷曲”版本插入内存时,它可以正常工作。Devel :: Peek :: Dump给了我:
SV = PV(0x55cd8264d650) at 0x55cd824c4b10
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55cd82655d80 "\301\371\272\317\264\253\306\34644181.com/\301\371\272\317\264\253\306\346\313\304\262\273\317\361/\302\355\273\341\277\252\275\261\275\341\271\373/\317\343\270\333\301\371\272\317\264\253\306\346/\302\355\273\341\277\252\275\261\274\307\302\274/\317\343\270\333\271\322\305\306|\310\374\302\355\273\341\327\312\301\317"\0
CUR = 91
LEN = 96
COW_REFCNT = 0
更新:我终于可以工作了:
my $page_title = $tx->result->dom->at( 'title' )->text;
use Encode;
use Encode::Detect;
use Encode::HanExtra;
my $page_title = decode("Detect", $page_title);
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
这一点:
my $page_title = decode("Detect", $page_title);
检测尝试检测编码的尝试,然后转换为Perl的内部表示形式(准备让我的正则表达式正常工作)。我试图发布示例输出,但是由于某种原因它触发了垃圾邮件?
标题在charset=gb2312
其中需要解码为perl内部表示。
以下代码演示了解码和输出以控制该特定网站的标题。
use strict;
use warnings;
use feature 'say';
use utf8;
use Mojo::UserAgent;
use Encode qw/encode decode/;
binmode STDOUT, 'encoding(UTF-8)';
my $url = 'www880740.com';
my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0' );
my $res = $ua->get( $url )->result;
my $page_title = decode('euc-cn',$res->dom->at('title')->text);
say 'GOT: ' . $page_title;
exit;
my @langs = qw/Arabic Armenian Bengali Bopomofo Braille Buhid
Canadian_Aboriginal Cherokee Cyrillic Devanagari
Ethiopic Georgian Greek Gujarati Gurmukhi Han
Hangul Hanunoo Hebrew Hiragana Inherited Kannada
Katakana Khmer Lao Limbu Malayalam Mongolian
Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog
Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/;
for( @langs ) {
say "$page_title matches $_!" if $page_title =~ /\p{$_}/;
}
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句