如何将带有原始html的CSV重新格式化为已清理的数据集csv？

Aaron 发表于 Dev

亚伦

给了我一个数据集，我需要将嵌入单元格中的html转换为干净的html剥离的csv。给出了预期的结果。在html内是单独标识的文件，每个文件都必须是其自己的行。这些列位于单独的单元格中，并且具有单独的关键字（也嵌入在HTML中），需要生成到新的列中并标识为TRUE（条件是在行中找到关键字）或FALSE（条件是未找到关键字）在该行中）。解决方案需要对先前生成并标识为TRUE的关键字敏感。

我一直在搜索类似的问题作为示例，但是这个问题似乎是由于我的已知技术语言（我不是数据清理专家）或要求不寻常。

这是CSV中的典型行...

    "<div id="categories">
    <h3>Categories</h3>
    <ul>
    <li><a href="">Keyword1</a></li>
    <li><a href="">Keyword2</a></li>
    </ul>
    </div>
    ","<div id="file"><h3>File</h3>, <div id="image">
    <a href="A">A.jpg</a>
    <br/></div>
    ,  <div id="image">
    <a href="B">B.jpg</a>
    <br/></div>
    </div>
    "

每行中关键字和文件的数量各不相同。

预期结果

File, Keyword1, Keyword2, Keyword3
A.jpg, TRUE, TRUE, FALSE
B.jpg, TRUE, TRUE, FALSE
C.jpg, TRUE, FALSE, TRUE
D.jpg, FALSE, FALSE, TRUE
E.jpg, FALSE, FALSE, TRUE

Chiheb Nexus

这是使用以下方法获得所需输出的一种方法BeautifulSoup：

from bs4 import BeautifulSoup as bs


a = '''
    <div id="categories">
        <h3>Categories</h3>
        <ul>
            <li><a href="">Keyword1</a></li>
            <li><a href="">Keyword2</a></li>
        </ul>
    </div>
    ","
    <div id="file">
        <h3>File</h3>,
        <div id="image">
            <a href="A">A.jpg</a>
            <br/>
        </div>
        ,
        <div id="image">
            <a href="B">B.jpg</a>
            <br/>
        </div>
    </div>
'''


def find_elms(soup, tag, attribute):
    """Find the block using it's tag and attribute values"""
    categories_block = soup.find(tag, attribute)
    if categories_block:
        return [elm.text for elm in categories_block.findAll('a')]
    return []


def pretty_print(master, categories, files):
    """Here we're just better printing the output"""
    cat = '\t'.join(['{elm:<12}'.format(elm=elm) for elm in master])
    print(cat)
    for k in files:
        out = '{file_:<12}'.format(file_=k)
        cells = '\t'.join(
            ['{:<12}'.format(str(True if j in categories else False)) for j in master[1:]]
        )
        print(out, cells)


master_categories = ['File', 'Keyword1', 'Keyword2', 'Keyword3']
soup = bs(a, 'html.parser')
categories = find_elms(soup, 'div', {'id': 'categories'})
files = find_elms(soup, 'div', {'id': 'file'})
pretty_print(master_categories, categories, files)

输出：

File            Keyword1        Keyword2        Keyword3    
A.jpg        True           True            False       
B.jpg        True           True            False

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-12-27

我来说两句

0 条评论

登录后参与评论

上一篇：Active Directory组列出了一个成员用户，用户的输入未提及该组

TOP 榜单

文章

如何将带有原始html的CSV重新格式化为已清理的数据集csv？

如何将带有原始html的CSV重新格式化为已清理的数据集csv？

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID