Saturday, April 24, 2010

Partial HTML parser

This is a partial HTML parser. This will parse and . But it will not parse and value inside parent tag (eg:
. Will only parse
not the
inside). This is for learning purpose.

Expression : "\\<(\\S*).*?>.*?"

To make this work in java you have to set the dotall mode in pattern.

Pattern pattern =
Pattern.compile("\\<(\\S*).*?>.*?",Pattern.DOTALL);

*dotall mode means . also represents line terminator

Here \\< for < character, (\\S*) for any character which is not whitespace, .* for any character including whitespace (? for parsing Reluctant parsing. Please refer the Reluctant parsing in ) , > for that character, .* for any character including whitespace (? for parsing Reluctant parsing), for that character.

No comments:

Post a Comment