Question about parsing HTML using Regex and Java

Elham :

I Have a question about finding html tags using Java and Regex.

I am using the code below to find all the tags in HTML, documentURL is obviously the HTML content.

The find method return true, meaning that it can find something in the HTML but the matches() method always return false and I am completly and utterly puzzled about this.

I refered to Java documentations too but could not find my answer.

What is the correct way of using Matcher ?

    Pattern keyLineContents = Pattern.compile("(<.*?>)");

    Matcher keyLineMatcher = keyLineContents.matcher(documentURL);

    boolean result = keyLineMatcher.find();

    boolean matchFound = keyLineMatcher.matches();

Doing something like this throws an exeption:

     String abc = keyLineMatcher.group(0);

Thanks.

cletus :

The correct way to loop through matches is:

Pattern p = Pattern.compile("<.*?>");
Matcher m = p.matcher(htmlString);
while (m.find()) {
  System.out.println(m.group());
}

That being said, regular expressions are an extremely poor method of parsing HTML. The reason comes down to this: regular expressions work well for parsing regular languages. HTML is a context free language. Where regular expressions fall down is for things like nested tags, using > inside attribute values and so on.

Use a dedicated HTML parser instead such as HTML Parser.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related