Skip to main content Skip to main navigation

Publikation

Round-trip HTML Rendering and Analysis for Testing, Indexing, and Security

Thomas Breuel; Daniel Keysers
In: 7th IAPR Workshop on Document Analysis Systems (DAS). IAPR International Workshop on Document Analysis Systems (DAS), Nelson, DAS 2006, DAS-IAPR, 2/2006.

Zusammenfassung

The widespread adoption of HTML, DHTML, and web technologies has had many benefits, but a number of undesirable uses and problems have emerged as well. Some of these problems are unreliable cross-platform rendering of web pages, attempts to create web pages that deceive either web users or search engines, and lack of accessibility of some web pages by users with vision impairments or users with small screen devices. Standard approaches to addressing these problems rely on syntactic and semantic analysis of the web page source; for example, to determine whether a page is likely to render correctly, a style checker may check for the absence of certain tags or constructs known to cause problems on some browsers. Source based methods are fast, conceptually easy to implement, and can be built using standard parsing and text analysis tools, but they also have significant limitations. For example, the presence of style sheets, JavaScript, and other HTML and plug-in features makes it hard to make statements about the final, rendered form of a web page based on an analysis of its source text. Cross-platform browser problems can only be detected by such methods if the cause of the problem is understood and known, and if appropriate patterns have been formulated that can detect these problems in web page sources; such rules are likely to remain incomplete and their coverage spotty given the evolution of web standards. Similarly, detecting phishing or search engine spam is a co-evolutionary process between adversaries and tool creators­phishers and spammers will develop new attacks in response to each countermeasure. As part of the image based personal computing project in our laboratory, we are developing round-trip rendering and analysis methods for addressing these problems. The foundation of our approach is the observation that the image presented to the end user is ultimately what determines the meaning of a piece of HTML (see also Breuel, 2004, Lopresti, 2005). In this talk, we report on on-going work in our laboratory on developing systems that address cross-platform browser and web page design testing, efforts for fighting phishing and search engine spam, and for improving accessibility.