Screen scraping

From Wikipedia, the free encyclopedia.

Screen scraping is the act of capturing data from a system or program by capturing and interpreting the contents of some display that is not actually intended for data transport or inspection by programs. Around 1980 this term referred to practices such as reading the display memory of a smart terminal through its auxiliary port, but other methods do exist.

More recently, screen scraping often refers to parsing the HTML in generated web pages with programs designed to find particular patterns or parts of content. In either guise screen scraping is very much an ad-hoc technique that is fully dependent on a consistent format for the data being collected.

In spite of the inelegance of consuming data from a web page using screen scraping, the emergence of web services has lent itself to the creation of technologies that turn web page screen scraping into a science (though still a very imperfect one). Microsoft, for example, has built into its implementation of web services the ability to create a web service which extracts its data from a web page with the help of an extension to the WSDL standard and the use of regular expressions. For more information on this technique see the MSDN document Creating XML Web Services That Parse the Contents of a Web Page.

Regular expressions themselves are a traditional and very powerful technology used for screen scraping. Screen scraping requires intensive text parsing algorithms. Computer languages (e.g. Perl) that have strong support for regular expressions are a popular choice for writing screen scraping programs.

Also in recent years, PHP has been developing in areas ideally suited to creating screen scraping applications. The release of PHP5 included many new XML and DOM additions, including functions to parse badly formed HTML documents into DOM-trees, and work on them as if they were well-formed XML. Having the webpage in an XML format makes it easier for programs to parse web pages. Java also offers highly efficient screen scraping techniques, namely by leveraging the W3C's XQuery specification.

With the prevalence of screen scraping many website owners have begun developing anti-screen scraping techniques including blocking and banning of individual and ranges of IP addresses which stops the majority of "cookie cutter" screen scraping applications.