- We design and build extraordinary applications for companies looking to make the next great idea a reality.
- learn more
Parsing HTML with innerHTML
Yes, yes, innerHTML is "the debil." Direct DOM manipulation is much to be preferred. It's faster, smells less, and gives you shiny white teeth. But innerHTML, when combined with XHR, can allow you to scrape data from the existing HTML pages of a site.
var div = document.createElement(div);// do the XHR thing...div.innerHTML = response.responseText; // contains the full html of a page// voila, div now is the Root of an HTML DOM tree that can be traversed for screen scraping
Now remember, don't hook the div node into the page with appendChild. Yes, it is a hack that makes use of the browser's built-in parser, but it is a useful hack nonetheless. I have seen only a few mentions of this technique (see the references). Maybe it's because this seems like such a contrary thing to do. If you are in control of an application or site, you have control of the server and what runs on it, after all. You could just write a quick XML service that can be requested and parsed by an Ajax client, right?
Well, that's not always possible when integrating widgets with third-party or other closed source packages. A common solution has been to proxy and scrape an application with a combination of XQuery and TagSoup (to fix the ugly, broken HTML, dontcha know), but it is possible to do this purely in the browser. And browsers don't need TagSoup as they are masters on wrangling broken HTML.
Another use of this technique is for injecting a new user interface on top of web sites and applications via bookmarklets or Greasemonkey. I've been tinkering with SuperCraig, a makeover of the Craig's List interface that makes use of this technique.
There are limits to this technique. For example, scripts are ignored and images are not loaded when the div is not hooked into the document, so a site that depends heavily on Javascript for its content. may withhold it's secrets from the screen scraper. That sort of site is likely to have HTTP requests that produce XML, however, so it will give up it's secrets in other ways.
Anyhow, that's the story on tinkering with bookmarklets and interface injection. I wanted to finish SuperCraig for today, but the grind of work sometimes gets in the way.
References
Topics: BJAX, Javascript
Comments: 2 so far
Leave a comment
About Pathfinder
Recent
- Pimp my jQuery: Five plugins to replace the features Prototype and Scriptaculous users expect
- Thanksgiving 2008: What We’re Thankful For (In Rails)
- iPhone SDK: Testing with TextMate & GTM
- GWTQuery - JQuery-like Syntax in GWT
- Ask the readers: How do I fire native browser events in Prototype.js?
- News Rollup for the Week of November 17, 2008
- Rails ThreatDown!
- Automated Deployments Rock
- Bandwidth profiling Flex projects and more with Charles
- iPhone SDK: UIViewController Testing & TDD
Archives
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006


Another reason to use innerHTML is because it’s so much faster. I have a case where I use ajax to paginate a table’s content (thousands of rows, so it calls back to the server); Building the table manually with the DOM is very unwieldy and slow, the easier and faster solution is just to return back an updated TABLE element and toss that in with innerHTML.
A browser’s parsing engine is generally going to be much much faster than its javascript engine.
The end-user doesn’t care about coding niceness, only about perceived speed of the site.
Building a TABLE on the server side is also much easier than trying to build one in the DOM.
Comment by Anonymous, Monday, September 10, 2007 @ 6:35 pm
A very good reason for using innerHTML is that it avoids duplicate code.
You already know how to render data to a table on the server side.
Implementing this a second time, in javascript, on the client side and keeping both implementations in sync blows up your codebase unnecessarily, is tedious and error-prone.
Comment by Jan, Wednesday, September 12, 2007 @ 5:08 am