BeautifulSoup get text between tags for one line

Mechatrnk

I have a bunch of HTML documents of GCOV branch and line coverage tools, the files look like this:

<tr>
<td align="right" class="lineno"><pre>224</pre></td>
<td align="right" class="linebranch"><span class="takenBranch" title="Branch 1 taken 329 times">&check;</span><span class="notTakenBranch" title="Branch 2 not taken">&cross;</span><span class="notTakenBranch" title="Branch 4 not taken">&cross;</span><span class="takenBranch" title="Branch 5 taken 329 times">&check;</span><br/><span class="notTakenBranch" title="Branch 6 not taken">&cross;</span><span class="takenBranch" title="Branch 7 taken 329 times">&check;</span></td>
<td align="right" class="linecount coveredLine"><pre>329</pre></td>
<td align="left" class="src coveredLine"><pre>        line of C++ code</pre></td>
</tr>

<tr>
<td align="right" class="lineno"><pre>225</pre></td>
<td align="right" class="linebranch"></td>
<td align="right" class="linecount uncoveredLine"><pre></pre></td>
<td align="left" class="src uncoveredLine"><pre>   another line of  C++ code;</pre></td>
</tr>

I would like to extract the text "(another) line of C++" code and ideally also the line number so the output would look like this:

224 line of C++ code
225 another line of C++ code

I tried to use BeautifulSoup but it does not provide the requested output, my code looks like this:

from itertools import islice
import codecs
import glob
from ntpath import join
import os
from bs4 import BeautifulSoup

lineNo = "<td align=\"right\" class=\"lineNo\"><pre>"
linetextCovered = "<td align=\"left\" class=\"src coveredLine\"><pre>"
linetextNotCovered = "<td align=\"left\" class=\"src uncoveredLine\"><pre>"
open('Output.txt', 'w').close() #Erase any content of Output.txt file

for filepath in glob.iglob('path/To/Reports/*.html'):
    with codecs.open(os.path.join(filepath), "r") as inputFile, open('Output.txt',"a") as outputFile:
        for num, line in enumerate(inputFile, 1):
            if lineNo in line:
                inputSoup = BeautifulSoup(line)
                text = inputSoup.getText()
                outputFile.write("".join(islice(text, 1) + "\t"))
            if linetextCovered or linetextNotCovered in line:
                inputSoup = BeautifulSoup(line)
                text = inputSoup.getText()
                outputFile.write("".join(islice(text, 4)))
            outputFile.write("\n")
print("Done")

But the output looks like this

/* L
a:li
{

colo
text
}

What am I doing wrong? Thank you very much for any help.

mama

You can do like this:

from bs4 import BeautifulSoup

html = '''
<tr>
<td align="right" class="lineno"><pre>224</pre></td>
<td align="right" class="linebranch"><span class="takenBranch" title="Branch 1 taken 329 times">&check;</span><span class="notTakenBranch" title="Branch 2 not taken">&cross;</span><span class="notTakenBranch" title="Branch 4 not taken">&cross;</span><span class="takenBranch" title="Branch 5 taken 329 times">&check;</span><br/><span class="notTakenBranch" title="Branch 6 not taken">&cross;</span><span class="takenBranch" title="Branch 7 taken 329 times">&check;</span></td>
<td align="right" class="linecount coveredLine"><pre>329</pre></td>
<td align="left" class="src coveredLine"><pre>        line of C++ code</pre></td>
</tr>

<tr>
<td align="right" class="lineno"><pre>225</pre></td>
<td align="right" class="linebranch"></td>
<td align="right" class="linecount uncoveredLine"><pre></pre></td>
<td align="left" class="src uncoveredLine"><pre>   another line of  C++ code;</pre></td>
</tr>
'''


for tr in BeautifulSoup(html.encode(), 'html.parser').find_all('tr'):
    lineno  = tr.find('td',{'class':'src'}).text.strip()
    src     = tr.find('td', {'class':'lineno'}).text.strip()
    print(lineno, src)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

How to get content of tags and print in one line in beautifulsoup with python?

Using BeautifulSoup to get tags and text

Bash - How to get multi line text between XML tags

Python BeautifulSoup - how to extract text between <a> tags

beautifulsoup to extract text in between two tags

BeautifulSoup : do not insert a line break with soup.get_text for certain tags like <b>

get text between span with BeautifulSoup

Get text from br tags using beautifulsoup

How does one get the text from html while ignoring formatting tags using BeautifulSoup?

PHP Get Text Between Tags + The Tags Themselves

beautifulsoup to csv: putting paragraph of text into one line

Write a consequence of children tags in one line into csv file using BeautifulSoup

How to get the whole text in one line from the same html tags inside a specific HTML tag?

Extracting text between link tags using BeautifulSoup in Python

How to scrape nested text between tags using BeautifulSoup?

Finding Audio and Text between two <td> tags Python BeautifulSoup

Get button text on to one line

How to get text on one line

How to get text between different tags in Jsoup?

VBA WORD: Want to get the text in between tags

Get text between tags using javascript

Regex get text between the html tags - PHP

Get Text between Custom Tags using RegEx

Get text between H2 tags

get text between 2 div tags in python

How to get text is not between html tags and text is between html

BeautifulSoup get only the "general" text in a td tag, and nothing in nested tags

Using BeautifulSoup to get_text of td tags within a resultset

How can I get text without specific tags in BeautifulSoup?

TOP Ranking

  1. 1

    Failed to listen on localhost:8000 (reason: Cannot assign requested address)

  2. 2

    How to import an asset in swift using Bundle.main.path() in a react-native native module

  3. 3

    Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

  4. 4

    pump.io port in URL

  5. 5

    Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

  6. 6

    BigQuery - concatenate ignoring NULL

  7. 7

    ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

  8. 8

    Do Idle Snowflake Connections Use Cloud Services Credits?

  9. 9

    maven-jaxb2-plugin cannot generate classes due to two declarations cause a collision in ObjectFactory class

  10. 10

    Compiler error CS0246 (type or namespace not found) on using Ninject in ASP.NET vNext

  11. 11

    Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

  12. 12

    Generate random UUIDv4 with Elm

  13. 13

    Jquery different data trapped from direct mousedown event and simulation via $(this).trigger('mousedown');

  14. 14

    Is it possible to Redo commits removed by GitHub Desktop's Undo on a Mac?

  15. 15

    flutter: dropdown item programmatically unselect problem

  16. 16

    Change dd-mm-yyyy date format of dataframe date column to yyyy-mm-dd

  17. 17

    EXCEL: Find sum of values in one column with criteria from other column

  18. 18

    Pandas - check if dataframe has negative value in any column

  19. 19

    How to use merge windows unallocated space into Ubuntu using GParted?

  20. 20

    Make a B+ Tree concurrent thread safe

  21. 21

    ggplotly no applicable method for 'plotly_build' applied to an object of class "NULL" if statements

HotTag

Archive