SCOPE:使用哪个库?urllib Vs 请求 我试图下载一个 url 上可用的日志文件。URL 托管在 aws 上并包含文件名。访问该 url 后,它会提供一个 .tar.gz 文件供下载。我需要在我选择的目录中下载这个文件 untar 并解压缩它以到达其中的 json 文件,最后解析 json 文件。在网上搜索时,我发现零星的信息散布在各处。在这个问题中,我尝试将它合并到一个地方。
使用REQUESTS库:一个 PyPi 包,在处理高 http 请求时被认为是优越的。参考资料:
代码:
import requests
import urllib.request
import tempfile
import shutil
import tarfile
import json
import os
import re
with requests.get(respurl,stream = True) as File:
# stream = true is required by the iter_content below
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
with open(tmp_file.name,'wb') as fd:
for chunk in File.iter_content(chunk_size=128):
fd.write(chunk)
with tarfile.open(tmp_file.name,"r:gz") as tf:
# To save the extracted file in directory of choice with same name as downloaded file.
tf.extractall(path)
# for loop for parsing json inside tar.gz file.
for tarinfo_member in tf:
print("tarfilename", tarinfo_member.name, "is", tarinfo_member.size, "bytes in size and is", end="")
if tarinfo_member.isreg():
print(" a regular file.")
elif tarinfo_member.isdir():
print(" a directory.")
else:
print(" something else.")
if os.path.splitext(tarinfo_member.name)[1] == ".json":
print("json file name:",os.path.splitext(tarinfo_member.name)[0])
json_file = tf.extractfile(tarinfo_member)
# capturing json file to read its contents and further processing.
content = json_file.read()
json_file_data = json.loads(content)
print("Status Code",json_file_data[0]['status_code'])
print("Response Body",json_file_data[0]['response'])
# Had to decode content again as it was double encoded.
print("Errors:",json.loads(json_file_data[0]['response'])['errors'])
将提取的文件保存在与下载文件同名的选择目录中。变量“path”的构成如下。
其中 url 示例包含文件名“ 44301621eb-response.tar.gz ”
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
PROJECT_NAME = 'your_project_name'
PROJECT_ROOT = os.path.join(BASE_DIR, PROJECT_NAME)
LOG_ROOT = os.path.join(PROJECT_ROOT, 'log')
filename = re.split("([^?]+)(?:.+/)([^#?]+)(\?.*)?", respurl)
# respurl is the url from the where the file will be downloaded
path = os.path.join(LOG_ROOT,filename[2])
与 urllib 的比较
为了了解细微的差异,我也使用 urllib 实现了相同的代码。
请注意,临时文件库的用法略有不同,这对我有用。由于我们使用 urllib 和请求获得的响应对象不同,我不得不使用带有 urllib 的shutil 库,其中请求不适用于shutil 库copyfileobj 方法。
with urllib.request.urlopen(respurl) as File:
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
shutil.copyfileobj(File, tmp_file)
with tarfile.open(tmp_file.name,"r:gz") as tf:
print("Temp tf File:", tf.name)
tf.extractall(path)
for tarinfo in tf:
print("tarfilename", tarinfo.name, "is", tarinfo.size, "bytes in size and is", end="")
if tarinfo.isreg():
print(" a regular file.")
elif tarinfo.isdir():
print(" a directory.")
else:
print(" something else.")
if os.path.splitext(tarinfo_member.name)[1] == ".json":
print("json file name:",os.path.splitext(tarinfo_member.name)[0])
json_file = tf.extractfile(tarinfo_member)
# capturing json file to read its contents and further processing.
content = json_file.read()
json_file_data = json.loads(content)
print("Status Code",json_file_data[0]['status_code'])
print("Response Body",json_file_data[0]['response'])
# Had to decode content again as it was double encoded.
print("Errors:",json.loads(json_file_data[0]['response'])['errors'])
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句