101 Ideas for JSONP - Idea #3: Scraping HTML With TagSoup and XQuery
Now that we have two ways (here and here) to pull XML data into our Ajax apps via JSONP, it's time to talk about how we can get our hands on more interesting data. Our two techniques allow us to proxy services that already produce valid XML, but not everything is available in this form. Lots of groovy data is still locked away in tradition web applications.
You can see an example of this problem even over at the leader in JSONP web services, Yahoo!. While Yahoo! provides an excellent delayed stock quote service, one thing that is lacking is a corresponding ticker lookup service. They do however have a ticker lookup web app: http://finance.yahoo.com/lookup?t=S&m=US&s=IBM
Transforming this app into a service involves the usual calisthenics -- transform the crappy HTML into well formed XHTML and then extract and transform our desired data into valid XML. Now we could use XSLT to crack this nut, but every time I break out the XSLT I get flashbacks to my Recursive Function Theory course and think that I'm doing a proof. Truth be told, I've never been a big fan of functional programming; FLWR expressions just seem so much easier to write. Yes, there are times where performance dictates not using XQuery, but for this example I'm sticking with my preference.
Below is a query that will perform our desired extraction and transformation. (XQuery syntax is a little beyond the scope of this post, but for a quick tutorial see here.) Like all screen scraping, it is a little brittle. If Yahoo! decides to change the format or even the color of the header row in this return page, the thing breaks. There are other queries that will do the trick and maybe be a little less brittle, but you should build some sort of regression testing and notification into these sorts of screen scrapers for precisesly these reasons. When it breaks, you know about it and you can fix it quickly.
<entities>{ for $t in //*:table[*:tr[@bgcolor="dcdcdc"]] for $r at $position in $t/*:tr[count(child::*:td)>1] where $position != 1 return <entity> <symbol>{data($r/*:td[1])}</symbol> <name>{data($r/*:td[2])}</name> <market>{data($r/*:td[3])}</market> <industry>{data($r/*:td[4])}</industry> </entity> } </entities>
To apply this to the Yahoo! ticker symbol lookup, we use a combination of the Nux XML processing toolkit and the TagSoup parser. TagSoup is the weapon of choice for converting gnarly HTML to well formed XHTML, and Nux gives us the XQuery machinery we need. Here is the code snippet for scraping the ticker symbol information from the results page:
private String applyScript(String uri) { InputStream in = null; GetMethod get = new GetMethod(uri); get.setFollowRedirects(true); try { httpClient.executeMethod(get); in = get.getResponseBodyAsStream(); } catch (IOException e) { log.error("Problem getting uri feed " + uri, e); return JSONIdeasConstants.ERROR_XML; } try { XMLReader parser = new org.ccil.cowan.tagsoup.Parser(); // tagsoup parser Document doc = new Builder(parser).build(in); Nodes results = XQueryUtil.xquery(doc, xqueryScript); if (results.size() < 1) { return JSONIdeasConstants.ERROR_XML; } return results.get(0).toXML(); } catch (ValidityException e) { log.error("Problem getting uri feed " + uri, e); return JSONIdeasConstants.ERROR_XML; } catch (IOException e) { log.error("Problem getting uri feed " + uri, e); return JSONIdeasConstants.ERROR_XML; } catch (ParsingException e) { log.error("Problem getting uri feed " + uri, e); return JSONIdeasConstants.ERROR_XML; } }
Here xqueryScript is a String containing the above XQuery script text. The ERROR_XML is just a constant string, "<Error></Error>", that we send back to the caller in case of an error. (You may want to send back a more informative error message to the browser. We'll tackle that in a later post.) This code will end up transforming the following app output...
...into the following xml
<entities> <entity> <symbol>IBM</symbol> <name>INTL BUSINESS MACH</name> <market>NYSE</market> <industry>Diversified Computer Systems</industry> </entity> <entity> <symbol>HZK</symbol> <name>CORTS TR VI IBM DEB</name> <market>NYSE</market> <industry>N/A</industry> </entity> <entity> <symbol>HZD</symbol> <name>6.40% CORPORATE-BACK</name> <market>NYSE</market> <industry>N/A</industry> </entity> <entity> <symbol>KVM</symbol> <name>STR PD 7.0 CORTS IBM</name> <market>NYSE</market> <industry>N/A</industry> </entity> <entity> <symbol>GJI</symbol> <name>SYNTHETIC FXD INC</name> <market>NYSE</market> <industry>N/A</industry> </entity> </entities>
Submitting the query is just the same old thing we've done before, just grab the input field contents and pass it to our servlet via an insert script operation:
JSONPIdeas.lookupSymbol = function() { var query = $("#lookup").val(); JSONPIdeas.addScript("http://labs.pathf.com/JSONIdeas/TickerLookup?callback=JSONPIdeas.renderSymbols&query=" + query); }
Give the example below a try. You can see the Javascript source code involved in this here. Note that if there is only one entity, the JSON lib that converts the XML into JSON doesn't return an array, so we check for that.
I hope I've inspired you to write some of your own JSONP web services. Next time I'll tackle some of the security issues that JSON and JSONP raise.
Comments: 1 so far
Leave a comment
About Pathfinder
Follow the Blog
-
Get a monthly update on best practices for delivering successful software.
Subscribe via email
Subscribe via RSS
Categories
Topics
Archives
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
Blogroll
Recent
- Elements of Testing Style
- Aesthetics and Web Design
- Asterisk-Java Testing with Groovy
- 3 Misuses of Code Comments
- Fluently NHibernate
- Digging a Hole and Covering it with Leaves — The Software Development Version
- The Importance of User Experience - Do You Understand It in Your Bones?
- Writing Your Own Protocol With NSURLProtocol
- What’s In Your Dock: iPhone edition
- Feature Fatigue


I tried this for my own site. It seems to work but I am having a problem. When I use this script it doesn’t find certain things (seems to be 2 word company names ie General Motors). But if I goto the Yahoo Symbol Search page and search for the same thing it will find the stock. Any ideas why this would be different? Thanks for the help and thanks for this nice script.
Comment by Jonathan, Thursday, May 24, 2007 @ 12:27 pm