How to stip tags, script and style off the HTML

Havn’t you just wished sometimes that all the html, script and style tags would just vanish from the html pages and all you get is pure text (for fun and profit). Well, here’s how I am managing it :)

require "open-uri"
require "hpricot"
require "sanitize"

html = open("http://www.google.com")
hp = Hpricot(html.read)
hp.search("script").remove
hp.search("style").remove
sanitize(hp.innerHTML, okTags="")

And output?

“GoogleWeb Images News Orkut Groups Gmail more ▼ Books Scholar Blogs YouTube Calendar Photos Documents Reader even more » iGoogle | Sign inIndia   Advanced Search  Preferences  Language ToolsSearch: the web pages from India Google.co.in offered in: Hindi Bengali Telugu Marathi Tamil Gujarati Kannada Malayalam Punjabi Advertising Programs – About Google – Go to Google.com©2008 – Privacy”

Now you can use this text to any imaginable use – as I mentioned earlier – maybe fun & profit :)

Libraries – hpricot, sanitize, open-uri

Have fun!

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

  • makuchaku

    hp.inner_text after removing the script and style tags