WEB SCRAPER TESTING GROUND

INVALID HTML

It's obvious that not every web publisher pays much attention to validity of his HTML code. Though most of the browsers are able to digest a broken markup, when you do web scraping some mistakes in web pages may result in scraping errors preventing you from getting relevant results.

To test web scrapers against invalid markup we suggest scraping this page that contains the following markup mistakes:

  1. Unescaped characters (& and > instead of & and >)
  2. Non-HTML tags (<nonHTML>)
  3. Unclosed tags (<span<span/>)
  4. Unmatched quotes (<a href="scrapetools.com'>)
  5. Missed spaces (<a id="test"href="scrapetools.com">)
  6. Invalid tag nesting (<div><span></div></span>)
  7. The charset specified in META tag or HTTP header in does not match the real document encoding

In other words, after scraping the invalid HTML presented below the scraper should output the following values:

Here is the invalid HTML itself:

2>1 & 1<2
nonHTML
unclosed
millepah.com
bad nesting
(windows-1251) wrong meta
проверка (utf-8) wrong header