我正在使用 SimpleHtmlDom 抓取 HTML,它获取所编写的 HTML,导致许多图像和脚本的链接断开,因为它们不包含指向其资源位置的完整 url。因此页面显示有错误。
我已经通过用src="http://example.com/"替换这些字母来更正资源链接,如src="/等,但是当链接中没有前导斜杠时,它会变得棘手,很难判断是否它是本地链接或完整链接。
例如:
<img src="images/pic.jpg">
我需要找到并更正阅读:
<img src="http://example.com/images/pic.jpg">
有没有正则表达式或函数可以用来在没有前导斜杠的情况下找到src="?还需要满足所有类型的链接,例如 ahref、脚本等。
如果您正在使用,simple HTML dom
您可以使用以下代码段来调整 URL
<?php
require 'simple_html_dom.php';
class Parser {
protected $url;
protected $url_parts;
protected $html_dom = null;
protected $path = null;
public function __construct($url) {
$this->setUrl($url);
}
protected function setUrl($url) {
$this->url = $url;
$this->url_parts = parse_url($url);
return $this;
}
protected function getUrl() {
return $this->url;
}
protected function getUrlParts() {
return $this->url_parts;
}
protected function getHtmlDom() {
if ($this->html_dom === null) $this->html_dom = file_get_html($this->getUrl());
return $this->html_dom;
}
/** ------------
- path ends with /, e.g. foo/bar/foo/, so the full path for the relative image is foo/bar/foo
- path doesn't end with / e.g. foo/bar/foo, so the full path the relative image is foo/bar
------------ **/
public function getPath() {
if ($this->path === null) $this->path = isset($this->getUrlParts()['path']) ? implode('/', explode('/', $this->getUrlParts()['path'], -1)) : '';
return $this->path;
}
public function getHost() {
return (isset($this->getUrlParts()['scheme']) ? $this->getUrlParts()['scheme'] : 'http').'://'.$this->getUrlParts()['host'];
}
public function adjust($tag, $attribute) {
foreach($this->getHtmlDom()->find($tag) as $element) {
if (parse_url($element->$attribute, PHP_URL_SCHEME) === null) {
// Test if SRC starts with /, if so only append host part of the URL cause image starts at root
if (strpos($element->$attribute, '/') === 0) {
$element->$attribute = $this->getHost().$element->$attribute;
}else{
$element->$attribute = $this->getHost().$this->getPath().'/'.$element->$attribute;
}
}
}
return $this;
}
public function getHtml() {
return (string)$this->getHtmlDom();
}
}
$parser = new Parser('https://www.darkbee.be/stack/images/index.html');
$parser->adjust('img', 'src')
->adjust('a', 'href')
->adjust('link', 'href')
->adjust('script', 'src');
;
echo $parser->getHtml();
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句