如何从Elasticsearch过滤stromcrawler数据

黑鹰

我正在使用apache-storm 1.2.3和elasticsearch 7.5.0。我已经成功地从3k新闻网站中提取了数据,并在Grafana和kibana上进行了可视化处理。我在内容中收到大量垃圾(例如广告)。我已附加了内容的SS。内容有人可以建议我如何过滤它们。我正在考虑从ES向某些python package.am提供html内容,如果没有,请向我提出一个好的解决方案。提前致谢。

这是crawler-conf.yaml文件

config:
  topology.workers: 1
  topology.message.timeout.secs: 300
  topology.max.spout.pending: 100
  topology.debug: false

  fetcher.threads.number: 50

  # override the JVM parameters for the workers
  topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"

  # mandatory when using Flux
  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  # metadata to transfer to the outlinks
  # used by Fetcher for redirections, sitemapparser, etc...
  # these are also persisted for the parent document (see below)
  # metadata.transfer:
  # - customMetadataName

  # lists the metadata to persist to storage
  # these are not transfered to the outlinks
  metadata.persist:
   - _redirTo
    - error.source
   - isSitemap
   - isFeed

  http.agent.name: "Nitesh Singh"
  http.agent.version: "1.0"
  http.agent.description: "built with StormCrawler Elasticsearch Archetype 1.16"
  http.agent.url: "http://someorganization.com/"
  http.agent.email: "[email protected]"

  # The maximum number of bytes for returned HTTP response bodies.
  # The fetched page will be trimmed to 65KB in this case
  # Set -1 to disable the limit.
  http.content.limit: 65536

  # FetcherBolt queue dump => comment out to activate
  # if a file exists on the worker machine with the corresponding port number
  # the FetcherBolt will log the content of its internal queues to the logs
  # fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"

  parsefilters.config.file: "parsefilters.json"
  urlfilters.config.file: "urlfilters.json"

  # revisit a page daily (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.default: 1440

  # revisit a page with a fetch error after 2 hours (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.fetch.error: 120
fetchInterval.error: -1

  # text extraction for JSoupParserBolt
  textextractor.include.pattern:
   - DIV[id="maincontent"]
   - DIV[itemprop="articleBody"]
   - ARTICLE

  textextractor.exclude.tags:
   - STYLE
   - SCRIPT

  # custom fetch interval to be used when a document has the key/value in its metadata
  # and has been fetched successfully (value in minutes)
  # fetchInterval.FETCH_ERROR.isFeed=true: 30
  # fetchInterval.isFeed=true: 10

  # configuration for the classes extending AbstractIndexerBolt
  # indexer.md.filter: "someKey=aValue"
  indexer.url.fieldname: "url"
  indexer.text.fieldname: "content"
  indexer.canonical.name: "canonical"
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain

  # Metrics consumers:
  topology.metrics.consumer.register:
     - class: "org.apache.storm.metric.LoggingMetricsConsumer"
 parallelism.hint: 1

朱利安·尼奥奇

您是否配置了文本提取器?例如

  # text extraction for JSoupParserBolt
  textextractor.include.pattern:
   - DIV[id="maincontent"]
   - DIV[itemprop="articleBody"]
   - ARTICLE

  textextractor.exclude.tags:
   - STYLE
   - SCRIPT

如果找到,这会将文本限制为特定元素,并且/或者删除排除项中指定的元素。

大多数新闻站点将使用某种形式的标记来标记主要内容。

您作为元素给出的示例 您可以为其添加图案。

您可以在ParseFilter中嵌入各种样板删除库,但是它们的准确性差异很大。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章