我正在使用apache-storm 1.2.3和elasticsearch 7.5.0。我已经成功地从3k新闻网站中提取了数据,并在Grafana和kibana上进行了可视化处理。我在内容中收到大量垃圾(例如广告)。我已附加了内容的SS。内容有人可以建议我如何过滤它们。我正在考虑从ES向某些python package.am提供html内容,如果没有,请向我提出一个好的解决方案。提前致谢。
这是crawler-conf.yaml文件
config:
topology.workers: 1
topology.message.timeout.secs: 300
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 50
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.source
- isSitemap
- isFeed
http.agent.name: "Nitesh Singh"
http.agent.version: "1.0"
http.agent.description: "built with StormCrawler Elasticsearch Archetype 1.16"
http.agent.url: "http://someorganization.com/"
http.agent.email: "[email protected]"
# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
http.content.limit: 65536
# FetcherBolt queue dump => comment out to activate
# if a file exists on the worker machine with the corresponding port number
# the FetcherBolt will log the content of its internal queues to the logs
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
fetchInterval.error: -1
# text extraction for JSoupParserBolt
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
# custom fetch interval to be used when a document has the key/value in its metadata
# and has been fetched successfully (value in minutes)
# fetchInterval.FETCH_ERROR.isFeed=true: 30
# fetchInterval.isFeed=true: 10
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
您是否配置了文本提取器?例如
# text extraction for JSoupParserBolt
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
如果找到,这会将文本限制为特定元素,并且/或者删除排除项中指定的元素。
大多数新闻站点将使用某种形式的标记来标记主要内容。
您作为元素给出的示例 您可以为其添加图案。
您可以在ParseFilter中嵌入各种样板删除库,但是它们的准确性差异很大。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句