Agile Ajax

101 Ideas for JSONP - Idea #3: Scraping HTML With TagSoup and XQuery

Now that we have two ways (here and here) to pull XML data into our Ajax apps via JSONP, it's time to talk about how we can get our hands on more interesting data. Our two techniques allow us to proxy services that already produce valid XML, but not everything is available in this form. Lots of groovy data is still locked away in tradition web applications.

You can see an example of this problem even over at the leader in JSONP web services, Yahoo!. While Yahoo! provides an excellent delayed stock quote service, one thing that is lacking is a corresponding ticker lookup service. They do however have a ticker lookup web app: http://finance.yahoo.com/lookup?t=S&m=US&s=IBM

Transforming this app into a service involves the usual calisthenics -- transform the crappy HTML into well formed XHTML and then extract and transform our desired data into valid XML. Now we could use XSLT to crack this nut, but every time I break out the XSLT I get flashbacks to my Recursive Function Theory course and think that I'm doing a proof. Truth be told, I've never been a big fan of functional programming; FLWR expressions just seem so much easier to write. Yes, there are times where performance dictates not using XQuery, but for this example I'm sticking with my preference.

Below is a query that will perform our desired extraction and transformation. (XQuery syntax is a little beyond the scope of this post, but for a quick tutorial see here.) Like all screen scraping, it is a little brittle. If Yahoo! decides to change the format or even the color of the header row in this return page, the thing breaks. There are other queries that will do the trick and maybe be a little less brittle, but you should build some sort of regression testing and notification into these sorts of screen scrapers for precisesly these reasons. When it breaks, you know about it and you can fix it quickly.

<entities>{     for $t in //*:table[*:tr[@bgcolor="dcdcdc"]]     for $r at $position in $t/*:tr[count(child::*:td)>1]     where $position != 1     return     <entity>     <symbol>{data($r/*:td[1])}</symbol>     <name>{data($r/*:td[2])}</name>     <market>{data($r/*:td[3])}</market>     <industry>{data($r/*:td[4])}</industry>     </entity> } </entities>

To apply this to the Yahoo! ticker symbol lookup, we use a combination of the Nux XML processing toolkit and the TagSoup parser. TagSoup is the weapon of choice for converting gnarly HTML to well formed XHTML, and Nux gives us the XQuery machinery we need. Here is the code snippet for scraping the ticker symbol information from the results page:

    private String applyScript(String uri) {         InputStream in = null;         GetMethod get = new GetMethod(uri);         get.setFollowRedirects(true);         try {             httpClient.executeMethod(get);             in = get.getResponseBodyAsStream();         } catch (IOException e) {             log.error("Problem getting uri feed " + uri, e);             return JSONIdeasConstants.ERROR_XML;         }         try {             XMLReader parser = new org.ccil.cowan.tagsoup.Parser(); // tagsoup parser            Document doc = new Builder(parser).build(in);            Nodes results = XQueryUtil.xquery(doc, xqueryScript);            if (results.size() < 1) {                return JSONIdeasConstants.ERROR_XML;            }            return results.get(0).toXML();        } catch (ValidityException e) {            log.error("Problem getting uri feed " + uri, e);            return JSONIdeasConstants.ERROR_XML;        } catch (IOException e) {            log.error("Problem getting uri feed " + uri, e);            return JSONIdeasConstants.ERROR_XML;        } catch (ParsingException e) {            log.error("Problem getting uri feed " + uri, e);            return JSONIdeasConstants.ERROR_XML;        }    }

Here xqueryScript is a String containing the above XQuery script text. The ERROR_XML is just a constant string, "<Error></Error>", that we send back to the caller in case of an error. (You may want to send back a more informative error message to the browser. We'll tackle that in a later post.) This code will end up transforming the following app output...

yahoolookup.PNG

...into the following xml

  <entities>     <entity>        <symbol>IBM</symbol>        <name>INTL BUSINESS MACH</name>        <market>NYSE</market>        <industry>Diversified Computer Systems</industry>     </entity>     <entity>        <symbol>HZK</symbol>        <name>CORTS TR VI IBM DEB</name>        <market>NYSE</market>        <industry>N/A</industry>     </entity>     <entity>        <symbol>HZD</symbol>        <name>6.40% CORPORATE-BACK</name>        <market>NYSE</market>        <industry>N/A</industry>     </entity>     <entity>        <symbol>KVM</symbol>        <name>STR PD 7.0 CORTS IBM</name>        <market>NYSE</market>        <industry>N/A</industry>     </entity>     <entity>        <symbol>GJI</symbol>        <name>SYNTHETIC FXD INC</name>        <market>NYSE</market>        <industry>N/A</industry>     </entity>  </entities>

Submitting the query is just the same old thing we've done before, just grab the input field contents and pass it to our servlet via an insert script operation:

  JSONPIdeas.lookupSymbol = function() {      var query = $("#lookup").val();      JSONPIdeas.addScript("http://labs.pathf.com/JSONIdeas/TickerLookup?callback=JSONPIdeas.renderSymbols&query=" + query);  }

Give the example below a try. You can see the Javascript source code involved in this here. Note that if there is only one entity, the JSON lib that converts the XML into JSON doesn't return an array, so we check for that.

Query:

---

I hope I've inspired you to write some of your own JSONP web services. Next time I'll tackle some of the security issues that JSON and JSONP raise.

 
  Technorati : , , , , ,

Topics: ,

Comments: 1 so far

  1. I tried this for my own site. It seems to work but I am having a problem. When I use this script it doesn’t find certain things (seems to be 2 word company names ie General Motors). But if I goto the Yahoo Symbol Search page and search for the same thing it will find the stock. Any ideas why this would be different? Thanks for the help and thanks for this nice script.

    Comment by Jonathan, Thursday, May 24, 2007 @ 12:27 pm

Leave a comment

Powered by WP Hashcash

About Pathfinder

  • We design and build extraordinary applications for companies looking to make the next great idea a reality.
  • learn more

Topics

WordPress

Comments about this site: info@pathf.com