解析很长的文本

瑞安

我是一个完整的 Python 初学者,但我正在制作一个网络爬虫作为一个项目。我正在使用 jupyter 笔记本、beautifulsoup 和 lxml。

我设法获取了包含我需要的所有信息的文本,但现在我不知道该怎么做。我想获取特定的数据,如经度、纬度、站点 ID、方向(北、南等),我想下载照片并重命名它们。我需要为所有 41 个地点执行此操作。如果有人可以建议任何软件包或方法,我将不胜感激!谢谢!

这是我抓取的一小部分文本(模式重复 41 次):

{
  "count": 41,
  "message": "success",
  "results": [
    {
      "protocol": "land_covers",
      "measuredDate": "2020-06-13",
      "createDate": "2020-06-13T16:35:04",
      "updateDate": "2020-06-15T14:00:10",
      "publishDate": "2020-07-17T21:06:31",
      "organizationId": 17043304,
      "organizationName": "United States of America Citizen Science",
      "siteId": 202689,
      "siteName": "18TWK294769",
      "countryName": null,
      "countryCode": null,
      "latitude": xx.xxx(edited),
      "longitude": xx.xxx(edited),
      "elevation": 25.4,
      "pid": 163672280,
      "data": {
        "landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682247/original.jpg",
        "landcoversEastExtraData": "(source: app, (compassData.horizon: -14.32171587255965))",
        "landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682242/original.jpg",
        "landcoversMucCode": null,
        "landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682246/original.jpg",
        "landcoversEastCaption": "",
        "landcoversMeasurementLatitude": xx.xxx(edited),
        "landcoversWestClassifications": null,
        "landcoversNorthCaption": "",
        "landcoversNorthExtraData": "(source: app, (compassData.horizon: -10.817734330181267))",
        "landcoversDataSource": "GLOBE Observer App",
        "landcoversDryGround": true,
        "landcoversSouthClassifications": null,
        "landcoversWestCaption": "",
        "landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682241/original.jpg",
        "landcoversUpwardCaption": "",
        "landcoversDownwardExtraData": "(source: app, (compassData.horizon: -84.48900393488086))",
        "landcoversEastClassifications": null,
        "landcoversMucDetails": "",
        "landcoversMeasuredAt": "2020-06-13T15:12:00",
        "landcoversDownwardCaption": "",
        "landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682243/original.jpg",
        "landcoversMuddy": false,
        "landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682245/original.jpg",
        "landcoversStandingWater": false,
        "landcoversLeavesOnTrees": true,
        "landcoversUserid": 67150810,
        "landcoversSouthExtraData": "(source: app, (compassData.horizon: -14.872806403121302))",
        "landcoversSouthCaption": "",
        "landcoversRainingSnowing": false,
        "landcoversUpwardExtraData": "(source: app, (compassData.horizon: 89.09211989270894))",
        "landcoversMeasurementElevation": 24.1,
        "landcoversWestExtraData": "(source: app, (compassData.horizon: -15.47334477111039))",
        "landcoversLandCoverId": 32043,
        "landcoversMeasurementLongitude": xx.xxx(edited),
        "landcoversMucDescription": null,
        "landcoversSnowIce": false,
        "landcoversNorthClassifications": null,
        "landcoversFieldNotes": "(none)"
      }
    },
    {
      "protocol": "land_covers",
      "measuredDate": "2020-06-13",
      "createDate": "2020-06-13T16:35:04",
      "updateDate": "2020-06-15T14:00:10",
      "publishDate": "2020-07-17T21:06:31",
      "organizationId": 17043304,
      "organizationName": "United States of America Citizen Science",
      "siteId": 202689,
      "siteName": "18TWK294769",
      "countryName": null,
      "countryCode": null,
      "latitude": xx.xxx(edited),
      "longitude": xx.xxx(edited),
      "elevation": 25.4,
      "pid": 163672280,
      "data": {
        "landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682240/original.jpg",
        "landcoversEastExtraData": "(source: app, (compassData.horizon: -6.06710116543897))",
        "landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682235/original.jpg",
        "landcoversMucCode": null,
        "landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682239/original.jpg",
        "landcoversEastCaption": "",
        "landcoversMeasurementLatitude": xx.xxx(edited),
        "landcoversWestClassifications": null,
        "landcoversNorthCaption": "",
        "landcoversNorthExtraData": "(source: app, (compassData.horizon: -9.199031748908894))",
        "landcoversDataSource": "GLOBE Observer App",
        "landcoversDryGround": true,
        "landcoversSouthClassifications": null,
        "landcoversWestCaption": "",
        "landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682233/original.jpg",
        "landcoversUpwardCaption": "",
        "landcoversDownwardExtraData": "(source: app, (compassData.horizon: -88.86569321651771))",
        "landcoversEastClassifications": null,
        "landcoversMucDetails": "",
        "landcoversMeasuredAt": "2020-06-13T15:07:00",
        "landcoversDownwardCaption": "",
        "landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682236/original.jpg",
        "landcoversMuddy": false,
        "landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682237/original.jpg",
        "landcoversStandingWater": false,
        "landcoversLeavesOnTrees": true,
        "landcoversUserid": 67150810,
        "landcoversSouthExtraData": "(source: app, (compassData.horizon: -11.615041431350335))",
        "landcoversSouthCaption": "",
        "landcoversRainingSnowing": false,
        "landcoversUpwardExtraData": "(source: app, (compassData.horizon: 86.6284079864236))",
        "landcoversMeasurementElevation": 24,
        "landcoversWestExtraData": "(source: app, (compassData.horizon: -9.251774266832626))",
        "landcoversLandCoverId": 32042,
        "landcoversMeasurementLongitude": xx.xxx(edited),
        "landcoversMucDescription": null,
        "landcoversSnowIce": false,
        "landcoversNorthClassifications": null,
        "landcoversFieldNotes": "(none)"
      }
    },

亚伦

看看一些代码会有所帮助。话虽如此,正如已经指出的那样,内置的 json 库会帮助你。这是一个 JSON 格式的输出,请参阅此处了解此类格式的介绍。

假设这里的输出存储在一个名为data. 您可以将此 json 数据转换为字典。

编码示例

import json
data_dict = json.load(data)

json.load 所做的是获取一个 JSON 对象并将其转换为 Python 字典。json.load 实际上扫描变量以检查它是否是 JSON 对象,并使用转换表将其转换为字典。还有其他 json 格式可以转换为其他 python 对象类型。请参阅此处了解该表。

现在你有一个 python 字典,你可以从中访问数据。因此,让我们通过经度、纬度、站点 ID、方向(北、南等)。我看到有一个开放的 '[' 没有相应的 ']'。根据您的描述,我只能假设该列表中有 41 个项目,因此我将首先采用第一个结果。你总是可以很容易地循环遍历这个以获得所有 41 个结果。

longitude = data_dict['results'][0]['longitude']
langitude = data_dict['results'][0]['langitude']
site_id = data_dict['results'][0]['siteid']

提示

  1. 我总是使用 jupyter notebooks 作为一种快速的方法来尝试从 JSON 对象中获取我想要的特定数据,有时可能需要花费一些时间才能正确访问正确的部分。这样,当我编写变量时,我知道我从 JSON 对象中获取了我想要的数据。有时,Json 对象可能会大量嵌套并且难以跟踪。

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章