How to get orphaned text with Jsoup?

Enuff

I have an html:

<span>This is the first text</span>
More text here 
Another line of text
<span>Text in the span</span>
<span>Another text in span</span>
This is another line

I want to get all the texts in order, something like this array:

[
"Span:This is the first text",
"More text here",
"Another line of text",
"Span:Text in the span",
"Span:Another text in span",
"This is another line",
]
ProgrammersBlock

I would go with a recursive method that takes your starting tag and iterates over its child nodes. For each TextNode, print the contents. For each Element, check it for child nodes.

public static void main(String[] args) throws ParseException, IOException
{
    //I put your HTML in the body tag in a local file
    Document doc = Jsoup.parse(new File("input/20160505.html"), "UTF-8");
    Elements elements = doc.getElementsByTag("body");
    Element rootTag = elements.get(0);
    printTextOfTag(rootTag);
}

public static void printTextOfTag(Element currentTag)
{
    List<Node> nodes = currentTag.childNodes();
    for(Node n : nodes)
    {
        if(n instanceof TextNode)
        {
            System.out.println(((TextNode)n).text());
        }
        else if(n instanceof Element)
        {
            printTextOfTag((Element)n);
        }
    }
}

Output

This is the first text

 More text here Another line of text 

Text in the span



Another text in span

 This is another line

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related