Want create site? Find Free WordPress Themes and plugins.

Parsing HTML documents is never easy. Some languages have better support for such tasks than others. I thought Groovy wasnt one of them but I was wrong. I had to parse a HTML document that wasnt always well-formed and that made the task harder. Dennis post was very useful when I was getting started. The post introduced me to TagSoup parser which is a very useful Java library for parsing HTML.

I eventually did not use TagSoup but instead ended up using NekoHTML. I had to parse a HTML that wasnt always formed the same way. For example,

<head>
   <title>Hiya!</title>
</head>
<body>
   <table>
   <tr>
   <th colspan='3'>Settings</th>
   <td>First cell r1</td>
   <td>Second cell r1</td>
   </tr>
   </table>
   <table>
   <tr>
   <th colspan='3'>Other Settings</th>
   <td>First cell r2</td>
   <td>Second cell r2</td>
   </tr>
   </table>
</body>

This HTML isnt well formed and if I had to parse it I would be lost if someone decided to insert a TBODY tag in the table. This is where NekoHTML comes in. It converts this HTML into well-formed XML that can then be read by XmlSlurper.

First we define the parser.

def parser = new org.cyberneko.html.parsers.SAXParser()
parser.setFeature('http://xml.org/sax/features/namespaces', false)

We must set the parser to ignore namespace because we dont really care about it. The parser has a host of other options that can be set including the ability to remove certain elements. I havent tested this (because I couldnt get it to work with Groovy) but I can imagine that this can be very useful sometimes when you want to get rid of text formatting in HTML.

Next, define our slurper giving it our newly created parser. Ask the slurper to parse the text and we get a page.

def slurper = new XmlSlurper(parser)
def page = slurper.parseText(html)

Because our slurper is groovy we can now access the body of the HTML document directly without the need to execute GPath expressions although that is still possible to do.

As an example, I wanted to find the first table that had a particular heading. In the HTML above this is the table with heading Settings. To do this you just do. Disclaimer: I had help from SO where I asked how to do this.

def settingsTableNode = page.BODY.TABLE.find { table -&gt;
table.TBODY.TR.TH.text() == 'Settings'
}

I can now access all other rows of the table because I have the table node with me. This makes scraping extremely easy to do. I can read parts of the table that I want or perform further find on the table node to get other sub-entities.

Groovy doesnt only make XML parsing easy but also HTML parsing.

Did you find apk for android? You can find new Free Android Games and apps.
Share.

Leave A Reply