Skip to content

Instantly share code, notes, and snippets.

@robertknight
Last active August 29, 2015 14:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save robertknight/7d87a504df1fd7be672b to your computer and use it in GitHub Desktop.
Save robertknight/7d87a504df1fd7be672b to your computer and use it in GitHub Desktop.
Design of Automatically Adaptable Web Wrappers - TeX source, LaTeXML output
<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/robert/projects/seed/content-mine/latex-xml"?>
<?latexml class="article" options="a4paper,twoside"?>
<?latexml package="amssymb"?>
<?latexml package="amstext"?>
<?latexml package="amsmath"?>
<?latexml package="multicol"?>
<?latexml package="pslatex"?>
<?latexml package="apalike"?>
<?latexml package="fancyhdr"?>
<?latexml package="scitepress"?>
<?latexml package="caption" options="small"?>
<?latexml package="graphicx"?>
<?latexml package="algorithm"?>
<?latexml package="algorithmic" options="noend"?>
<?latexml package="qtree"?>
<?latexml package="color"?>
<?latexml package="inputenc" options="utf8"?>
<?latexml package="xcolor" options="table"?>
<?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
<resource src="LaTeXML.css" type="text/css"/>
<resource src="ltx-article.css" type="text/css"/>
<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%**** cr-icaart2011.tex Line 25 ****-->
<section refnum="1" xml:id="S1">
<title><tag close=" ">1</tag>INTRODUCTION</title>
<para xml:id="S1.p1">
<p>The World Wide Web today contains an exterminated amount of information, mostly unstructured, under the form of Web pages, but also documents of various nature.
During last years big efforts have been conducted to develop techniques of information extraction on top of the Web.
Approaches adopted spread in several fields of Mathematics and Computer Science, including, for example, logic-programming and machine learning.
Several projects, initially developed in academic settings, evolved in commercial products, and it is possible to identify different methodologies to face the problem of Web data extraction.
A widely adopted approach is to define <emph>Web wrappers</emph>, procedures relying on analyzing the structure of HTML Web pages (i.e. DOM tree) to extract required information.
Wrappers can be defined in several ways, e.g. most advanced tools let users to design them in a visual way, for example selecting elements of interest in a Web page and defining rules for their extraction and validation, semi-automatically; regardless of their generation process, wrappers intrinsically refer to the HTML structure of the Web page at the time of their creation.
Thus, introducing not negligible problems of robustness, wrappers could fail in their tasks of data extraction if the underlying structure of the Web page changes, also slightly.
Moreover, it could happens that the process of extraction does not fail but extracted data are corrupted.</p>
</para>
<para xml:id="S1.p2">
<p>All these aspects clarify the following scenarios: during their definition, wrappers should be as much elastic as possible, in order to intrinsically handle minor modifications on the structure of Web pages (this kind of small local changes are much more frequent than heavy modifications); although elastic wrappers could efficiently react to minor changes, maintenance is required for the whole wrapper life-cycle.
Wrapper maintenance is expensive because it requires highly qualified personnel, specialized in defining wrappers, to spend their time in rebuilding or fixing wrappers whenever they stop working properly.
For improving this aspect, several commercial tools include notification features, reporting warnings or errors during wrappers execution.
Moreover, to increase their reliability, data extracted by wrappers could be subject to validation processes, and also data cleaning is a fundamental step; some tools provide caching services to store the last working copy of Web pages involved in data extraction processes.
Sometimes, it is even more convenient to rewrite <emph>ex novo</emph> a wrapper, instead of trying to find causes of malfunctioning and fixing them, because debugging wrapper executions can be not trivial.
The unpredictability of what changes will occur in a specific Web page and, consequently, the impossibility to establish when a wrappers will stop working properly, requires a smart approach to wrapper maintenance.</p>
</para>
<para xml:id="S1.p3">
<p>Our purpose in this paper is to describe the realization and to investigate performances of an automatic process of wrapper adaptation to structural modifications of Web pages.
We designed and implemented a system relying on the possibility of storing, during the wrapper definition step, a <emph>snapshot</emph> of the DOM tree of the original Web page, namely a <emph>tree-gram</emph>.
If, during the wrapper execution, problems occur, this sample is compared to the new DOM structure, finding similarities on trees and sub-trees, to automatically try adaptating the wrapper with a custom degree of accuracy.
Briefly, the paper is structured as follows: Section 2 focuses on related work, covering the literature about wrapper generation and adaptation.
In Section 3 we explain some concepts related to the tree similarity algorithm implemented, to prove the correctness of our approach. Section 4 shows details about our implementation of the automatic wrapper adaptation. Most important results, obtained by our experimentation, are reported in Section 5. Finally, Section 6 concludes providing some remarks for future work.</p>
</para>
<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%**** cr-icaart2011.tex Line 50 ****-->
</section>
<section refnum="2" xml:id="S2">
<title><tag close=" ">2</tag>RELATED WORK</title>
<para xml:id="S2.p1">
<p>The concept of analyzing similarities between trees, widely adopted in this work, was introduced by Tai <cite>[<bibref bibrefs="Tai1979" separator="," show="Refnum" yyseparator=","/>]</cite>; he defined the notion of <emph>distance</emph> between two trees as the measure of the dissimilarity between them.
The problem of transforming trees into other similar trees, namely <emph>tree edit distance</emph>, can be solved applying elementary transformations to nodes, step-by-step.
The minimum cost for this operation represents the tree edit distance between the two trees.
This technique shows high computational requirements and complex implementations <cite>[<bibref bibrefs="Bille2005" separator="," show="Refnum" yyseparator=","/>]</cite>, and do not represents the optimal solution to our problem of finding similarities between two trees.
The <emph>simple tree matching</emph> technique <cite>[<bibref bibrefs="StanleyM.Selkow1977" separator="," show="Refnum" yyseparator=","/>]</cite> represents a turning point: it is a light-weight recursive top-down algorithm which evaluates position of nodes to measure the degree of isomorphism between two trees, analyzing and comparing their sub-trees.
Several improvements to this technique have been suggested: Ferrara and Baumgartner <cite>[<bibref bibrefs="Ferrara2010" separator="," show="Refnum" yyseparator=","/>]</cite>, extending the concept of weights introduced by Yang <cite>[<bibref bibrefs="Yang1991" separator="," show="Refnum" yyseparator=","/>]</cite>, developed a variant of this algorithm with the capability of discovering clusters of similar sub-trees.
An interesting evaluation of the simple tree matching and its weighed version, brought by Kim et al. <cite>[<bibref bibrefs="Kim2007" separator="," show="Refnum" yyseparator=","/>]</cite>, was performed exploiting these two algorithms for extracting information from HTML Web pages; we found their achievements very useful to develop automatically adaptable wrappers.</p>
</para>
<para xml:id="S2.p2">
<p>Web data extraction and adaptation rely especially on algorithms working with DOM trees.
Related work, in particular regarding Web wrappers and their maintenance, is intricate: Laender et al. <cite>[<bibref bibrefs="Laender2002" separator="," show="Refnum" yyseparator=","/>]</cite> presented a taxonomy of wrapper generation methodologies, while Ferrara et al. <cite>[<bibref bibrefs="Baumgartner2010" separator="," show="Refnum" yyseparator=","/>]</cite> discussed a comprehensive survey about techniques and fields of application of Web data extraction and adaptation.
Some novel wrapper adaptation techniques have been introduced during last years: a valid hybrid approach, mixing logic-based and grammar rules, has been presented by Chidlovskii <cite>[<bibref bibrefs="Chidlovskii2001" separator="," show="Refnum" yyseparator=","/>]</cite>.
Also machine-learning techniques have been investigated, e.g. Lerman et al. <cite>[<bibref bibrefs="Lerman2003" separator="," show="Refnum" yyseparator=","/>]</cite> exploited their know-how in this field to develop a system for wrapper verification and re-induction.
Meng et al. <cite>[<bibref bibrefs="20" separator="," show="Refnum" yyseparator=","/>]</cite> developed the SG-WRAM (Schema-Guided WRApper Maintenance), for wrapper maintenance, starting from the observation that, changes in Web pages, even substantial, always preserve syntactic features (i.e. syntactic characteristics of data items like data patterns, string lengths, etc.), hyperlinks and annotations (e.g. descriptive information representing the semantic meaning of a piece of information in its context).
This system has been implemented in their Web data extraction platform: wrappers are defined providing both HTML Web pages and their XML schemes, describing a mappings between them.
When the system executes the wrapper, data are extracted under the XML format reflecting the previously specified XML Schema; the wrapper maintainer verifies any issue and, eventually, provides protocols for the automatic adaptation of the problematic wrapper.
The XML Schema is a DTD (Document Type Definition) while the HTML Web page is represented by its DOM tree.
The framework described by Wong and Lam <cite>[<bibref bibrefs="Wong" separator="," show="Refnum" yyseparator=","/>]</cite> performs the adaptation of wrappers previously learned, applying them to Web pages never seen before; they assert that this platform is also able to discover, eventually, new attributes in the Web page, using a probabilistic approach, exploiting the extraction knowledge acquired through previous wrapping tasks.
Also Raposo et al. <cite>[<bibref bibrefs="Raposo2005" separator="," show="Refnum" yyseparator=","/>]</cite> suggested the possibility of exploiting previously acquired information, e.g. results of queries, to ensure a reliable degree of accuracy during the wrapper adaptation process.
Concluding, Kowalkiewicz et al. <cite>[<bibref bibrefs="Kowalkiewicz2006" separator="," show="Refnum" yyseparator=","/>]</cite> investigate the possibility of increasing the robustness of wrappers based on the identification of HTML elements, inside Web pages, through their XPath, adopting relative XPath, instead of absolute ones.</p>
</para>
<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
</section>
<section refnum="3" xml:id="S3">
<title><tag close=" ">3</tag>MATCHING UP HTML TREES</title>
<!-- %**** cr-icaart2011.tex Line 75 **** -->
<para xml:id="S3.p1">
<p>Our idea of automatic adaptation of wrappers can be explained as follows: first of all, outlining how to extract information from Web pages (i.e. in our case, how a Web wrapper works); then, describing how it is possible to recover information previously extracted from a different Web page (i.e. how to compare structural information between the two versions of the Web page, finding similarities); finally, defining how to automatize this process (i.e. how to build reliable, robust automatically adaptable wrappers).</p>
</para>
<para xml:id="S3.p2">
<p>Our solution has been implemented in a commercial product <note mark="1" role="footnote">Lixto Suite, www.lixto.com</note>; Baumgartner et al. <cite>[<bibref bibrefs="baumgartner2009scalable" separator="," show="Refnum" yyseparator=","/>]</cite> described details about its design.
This platform provides tools to design Web wrappers in a visual way, selecting elements to be extracted from Web pages.
During the wrapper execution, selected elements, identified through their XPath(s) in the DOM tree of the Web page, are automatically extracted.
Although the wrapper design process lets users to define several restricting or generalizing conditions to build wrappers as much elastic as possible, wrappers are strictly interconnected with the structure of the Web page on top of they are built.
Usually, also slight modifications to this structure could alter the wrapper execution or corrupt extracted data.</p>
</para>
<para xml:id="S3.p3">
<p>In this section we discuss some theoretical foundations on which our solution relies; in details, we show an efficient algorithm to find similar elements within different Web pages.</p>
</para>
<!-- %%%%%%%%%%%%%%%%%%%%%%% -->
<subsection refnum="3.1" xml:id="S3.SS1">
<title><tag close=" ">3.1</tag>Methodology</title>
<para xml:id="S3.SS1.p1">
<p>A simple measure of similarity between two trees, once defined their comparable elements, can be established applying the <emph>simple tree matching</emph> algorithm <cite>[<bibref bibrefs="StanleyM.Selkow1977" separator="," show="Refnum" yyseparator=","/>]</cite>, introduced in Section 2.
We define <emph>comparable elements</emph> among HTML Web pages, nodes, representing HTML elements (or, otherwise, free text) identified by tags, belonging to the DOM tree of these pages.
Similarly, we intend for <emph>comparable attributes</emph> all the attributes, both generic (e.g. <emph>class</emph>, <emph>id</emph>, etc.) and type-specific (e.g. <emph>href</emph> for anchor texts, <emph>src</emph> for images, etc.), shown by HTML elements; it is possible to exploit these properties to introduce additional comparisons to refine the similarity measure.
Several implementations of the <emph>simple tree matching</emph> have been proposed; our solution exploits an improved version, namely <emph>clustered tree matching</emph> <cite>[<bibref bibrefs="Baumgartner2010" separator="," show="Refnum" yyseparator=","/>]</cite>, designed to match up HTML trees, identifying clusters of sub-trees with similar structures, satisfying a custom degree of accuracy.</p>
</para>
<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
</subsection>
<subsection refnum="3.2" xml:id="S3.SS2" labels="LABEL:alg1">
<title><tag close=" ">3.2</tag>Tree Matching Algorithms</title>
<para xml:id="S3.SS2.p1">
<p>Previous studies proved the effectiveness of the <emph>simple tree matching</emph> algorithm applied to Web data extraction tasks <cite>[<bibref bibrefs="Kim2007,19" separator="," show="Refnum" yyseparator=","/>]</cite>; it measures the similarity degree between two HTML trees, producing the maximum matching through dynamic programming, ensuring an acceptable compromise between precision and recall.</p>
</para>
<para xml:id="S3.SS2.p2">
<p>As improvement to this algorithm, this is a possible implementation of <emph>clustered tree matching</emph>: let <emph>d(n)</emph> to be the degree of a node <emph>n</emph> (i.e. the number of first-level children); let T(i) to be the i-<emph>th</emph> sub-tree of the tree rooted at node T; let <Math mode="inline" xml:id="S3.SS2.p2.m1" tex="t(n)" text="t * n"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">t</XMTok><XMTok close=")" open="(" role="UNKNOWN" font="italic">n</XMTok></XMApp></XMath></Math> to be the number of total siblings of a node <emph>n</emph> including itself.</p>
</para>
<ERROR class="undefined">{algorithm}</ERROR>
<para xml:id="S3.SS2.p3">
<p><text class="ltx_caption">ClusteredTreeMatching(<Math mode="inline" xml:id="S3.SS2.p3.m1" tex="T^{{}^{\prime}}" text="T ^ ^prime"><XMath><XMApp><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime" role="SUPOP">′</XMTok></XMApp></XMApp></XMath></Math>, <Math mode="inline" xml:id="S3.SS2.p3.m2" tex="T^{{}^{\prime\prime}}" text="T ^ ^prime2"><XMath><XMApp><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime2">′′</XMTok></XMApp></XMApp></XMath></Math>)</text>
<!-- %**** cr-icaart2011.tex Line 100 **** -->
<ERROR class="undefined">{algorithmic}</ERROR>[1]
<ERROR class="undefined">\IF</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m3" tex="T^{{}^{\prime}}" text="T ^ ^prime"><XMath><XMApp><XMTok role="SUPERSCRIPTOP" scriptpos="post3"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="4"><XMTok name="prime" role="SUPOP">′</XMTok></XMApp></XMApp></XMath></Math> has the same label of <Math mode="inline" xml:id="S3.SS2.p3.m4" tex="T^{{}^{\prime\prime}}" text="T ^ ^prime2"><XMath><XMApp><XMTok role="SUPERSCRIPTOP" scriptpos="post3"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="4"><XMTok name="prime2">′′</XMTok></XMApp></XMApp></XMath></Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m5" tex="m\leftarrow" text="m leftarrow absent"><XMath><XMApp><XMTok name="leftarrow" role="ARROW">←</XMTok><XMTok role="UNKNOWN" font="italic">m</XMTok><XMTok meaning="absent"/></XMApp></XMath></Math> <Math mode="inline" xml:id="S3.SS2.p3.m6" tex="d(T^{{}^{\prime}})" text="d * T ^ ^prime"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">d</XMTok><XMApp close=")" open="("><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime" role="SUPOP">′</XMTok></XMApp></XMApp></XMApp></XMath></Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m7" tex="n\leftarrow" text="n leftarrow absent"><XMath><XMApp><XMTok name="leftarrow" role="ARROW">←</XMTok><XMTok role="UNKNOWN" font="italic">n</XMTok><XMTok meaning="absent"/></XMApp></XMath></Math> <Math mode="inline" xml:id="S3.SS2.p3.m8" tex="d(T^{{}^{\prime\prime}})" text="d * T ^ ^prime2"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">d</XMTok><XMApp close=")" open="("><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime2">′′</XMTok></XMApp></XMApp></XMApp></XMath></Math>
<ERROR class="undefined">\FOR</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m9" tex="i=0" text="i = 0"><XMath><XMApp><XMTok meaning="equals" role="RELOP">=</XMTok><XMTok role="UNKNOWN" font="italic">i</XMTok><XMTok meaning="0" role="NUMBER">0</XMTok></XMApp></XMath></Math> to <Math mode="inline" xml:id="S3.SS2.p3.m10" tex="m" text="m"><XMath><XMTok role="UNKNOWN" font="italic">m</XMTok></XMath></Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m11" tex="M[i][0]\leftarrow 0" text="M * i * 0 leftarrow 0"><XMath><XMApp><XMTok name="leftarrow" role="ARROW">←</XMTok><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">M</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">i</XMTok><XMTok close="]" meaning="0" open="[" role="NUMBER">0</XMTok></XMApp><XMTok meaning="0" role="NUMBER">0</XMTok></XMApp></XMath></Math>;
<ERROR class="undefined">\ENDFOR</ERROR><ERROR class="undefined">\FOR</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m12" tex="j=0" text="j = 0"><XMath><XMApp><XMTok meaning="equals" role="RELOP">=</XMTok><XMTok role="UNKNOWN" font="italic">j</XMTok><XMTok meaning="0" role="NUMBER">0</XMTok></XMApp></XMath></Math> to <Math mode="inline" xml:id="S3.SS2.p3.m13" tex="n" text="n"><XMath><XMTok role="UNKNOWN" font="italic">n</XMTok></XMath></Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m14" tex="M[0][j]\leftarrow 0" text="M * 0 * j leftarrow 0"><XMath><XMApp><XMTok name="leftarrow" role="ARROW">←</XMTok><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">M</XMTok><XMTok close="]" meaning="0" open="[" role="NUMBER">0</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">j</XMTok></XMApp><XMTok meaning="0" role="NUMBER">0</XMTok></XMApp></XMath></Math>;
<ERROR class="undefined">\ENDFOR</ERROR><ERROR class="undefined">\FORALL</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m15" tex="i" text="i"><XMath><XMTok role="UNKNOWN" font="italic">i</XMTok></XMath></Math> such that <Math mode="inline" xml:id="S3.SS2.p3.m16" tex="1\leq i\leq m" text="1 less= i less= m"><XMath><XMApp><XMTok meaning="multirelation"/><XMTok meaning="1" role="NUMBER">1</XMTok><XMTok meaning="less-than-or-equals" name="leq" role="RELOP">≤</XMTok><XMTok role="UNKNOWN" font="italic">i</XMTok><XMTok meaning="less-than-or-equals" name="leq" role="RELOP">≤</XMTok><XMTok role="UNKNOWN" font="italic">m</XMTok></XMApp></XMath></Math>
<ERROR class="undefined">\FORALL</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m17" tex="j" text="j"><XMath><XMTok role="UNKNOWN" font="italic">j</XMTok></XMath></Math> such that <Math mode="inline" xml:id="S3.SS2.p3.m18" tex="1\leq j\leq n" text="1 less= j less= n"><XMath><XMApp><XMTok meaning="multirelation"/><XMTok meaning="1" role="NUMBER">1</XMTok><XMTok meaning="less-than-or-equals" name="leq" role="RELOP">≤</XMTok><XMTok role="UNKNOWN" font="italic">j</XMTok><XMTok meaning="less-than-or-equals" name="leq" role="RELOP">≤</XMTok><XMTok role="UNKNOWN" font="italic">n</XMTok></XMApp></XMath></Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" xml:id="S3.SS2.p3.m19" tex="M[i][j]\leftarrow" text="M * i * j leftarrow absent"><XMath><XMApp><XMTok name="leftarrow" role="ARROW">←</XMTok><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">M</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">i</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">j</XMTok></XMApp><XMTok meaning="absent"/></XMApp></XMath></Math> Max(<Math mode="inline" xml:id="S3.SS2.p3.m20" tex="M[i][j-1]" text="M * i * (j - 1)"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">M</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">i</XMTok><XMApp close="]" open="["><XMTok meaning="minus" role="ADDOP">-</XMTok><XMTok role="UNKNOWN" font="italic">j</XMTok><XMTok meaning="1" role="NUMBER">1</XMTok></XMApp></XMApp></XMath></Math>, <Math mode="inline" xml:id="S3.SS2.p3.m21" tex="M[i-1][j]" text="M * (i - 1) * j"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok role="UNKNOWN" font="italic">M</XMTok><XMApp close="]" open="["><XMTok meaning="minus" role="ADDOP">-</XMTok><XMTok role="UNKNOWN" font="italic">i</XMTok><XMTok meaning="1" role="NUMBER">1</XMTok></XMApp><XMTok close="]" open="[" role="UNKNOWN" font="italic">j</XMTok></XMApp></XMath></Math>, <Math mode="inline" xml:id="S3.SS2.p3.m22" tex="M[i-1][j-1]+W[i][j]" text="M * (i - 1) * (j - 1) + W * i * j"><XMath><XMApp><XMTok meaning="plus" role="ADDOP">+</XMTok><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok role="UNKNOWN" font="italic">M</XMTok><XMApp close="]" open="["><XMTok meaning="minus" role="ADDOP">-</XMTok><XMTok role="UNKNOWN" font="italic">i</XMTok><XMTok meaning="1" role="NUMBER">1</XMTok></XMApp><XMApp close="]" open="["><XMTok meaning="minus" role="ADDOP">-</XMTok><XMTok role="UNKNOWN" font="italic">j</XMTok><XMTok meaning="1" role="NUMBER">1</XMTok></XMApp></XMApp><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">W</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">i</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">j</XMTok></XMApp></XMApp></XMath></Math>) where <Math mode="inline" xml:id="S3.SS2.p3.m23" tex="W[i][j]" text="W * i * j"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">W</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">i</XMTok><XMTok close="]" open="[" role="UNKNOWN" font="italic">j</XMTok></XMApp></XMath></Math> = ClusteredTreeMatching(<Math mode="inline" xml:id="S3.SS2.p3.m24" tex="T^{{}^{\prime}}(i-1)" text="T ^ ^prime * (i - 1)"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMApp><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime" role="SUPOP">′</XMTok></XMApp></XMApp><XMApp close=")" open="("><XMTok meaning="minus" role="ADDOP">-</XMTok><XMTok role="UNKNOWN" font="italic">i</XMTok><XMTok meaning="1" role="NUMBER">1</XMTok></XMApp></XMApp></XMath></Math>, <Math mode="inline" xml:id="S3.SS2.p3.m25" tex="T^{{}^{\prime\prime}}(j-1)" text="T ^ ^prime2 * (j - 1)"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMApp><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime2">′′</XMTok></XMApp></XMApp><XMApp close=")" open="("><XMTok meaning="minus" role="ADDOP">-</XMTok><XMTok role="UNKNOWN" font="italic">j</XMTok><XMTok meaning="1" role="NUMBER">1</XMTok></XMApp></XMApp></XMath></Math>)
<ERROR class="undefined">\ENDFOR</ERROR><ERROR class="undefined">\ENDFOR</ERROR></p>
</para>
<ERROR class="undefined">\IF</ERROR>
<para xml:id="S3.SS2.p4">
<p><Math mode="inline" xml:id="S3.SS2.p4.m1" tex="m&gt;0" text="m &gt; 0"><XMath><XMApp><XMTok meaning="greater-than" role="RELOP">&gt;</XMTok><XMTok role="UNKNOWN" font="italic">m</XMTok><XMTok meaning="0" role="NUMBER">0</XMTok></XMApp></XMath></Math> AND <Math mode="inline" xml:id="S3.SS2.p4.m2" tex="n&gt;0" text="n &gt; 0"><XMath><XMApp><XMTok meaning="greater-than" role="RELOP">&gt;</XMTok><XMTok role="UNKNOWN" font="italic">n</XMTok><XMTok meaning="0" role="NUMBER">0</XMTok></XMApp></XMath></Math>
<ERROR class="undefined">\STATE</ERROR>return M[m][n] * 1 / Max(<Math mode="inline" xml:id="S3.SS2.p4.m3" tex="t(T^{{}^{\prime}})" text="t * T ^ ^prime"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">t</XMTok><XMApp close=")" open="("><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime" role="SUPOP">′</XMTok></XMApp></XMApp></XMApp></XMath></Math>, <Math mode="inline" xml:id="S3.SS2.p4.m4" tex="t(T^{{}^{\prime\prime}})" text="t * T ^ ^prime2"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">t</XMTok><XMApp close=")" open="("><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime2">′′</XMTok></XMApp></XMApp></XMApp></XMath></Math>)
<ERROR class="undefined">\ELSE</ERROR><ERROR class="undefined">\STATE</ERROR>return M[m][n] + 1 / Max(<Math mode="inline" xml:id="S3.SS2.p4.m5" tex="t(T^{{}^{\prime}})" text="t * T ^ ^prime"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">t</XMTok><XMApp close=")" open="("><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime" role="SUPOP">′</XMTok></XMApp></XMApp></XMApp></XMath></Math>, <Math mode="inline" xml:id="S3.SS2.p4.m6" tex="t(T^{{}^{\prime\prime}})" text="t * T ^ ^prime2"><XMath><XMApp><XMTok meaning="times" role="MULOP">⁢</XMTok><XMTok possibleFunction="yes" role="UNKNOWN" font="italic">t</XMTok><XMApp close=")" open="("><XMTok role="SUPERSCRIPTOP" scriptpos="post2"/><XMTok role="UNKNOWN" font="italic">T</XMTok><XMApp role="FLOATSUPERSCRIPT" scriptpos="3"><XMTok name="prime2">′′</XMTok></XMApp></XMApp></XMApp></XMath></Math>)
<ERROR class="undefined">\ENDIF</ERROR><ERROR class="undefined">\ELSE</ERROR><ERROR class="undefined">\STATE</ERROR>return 0
<ERROR class="undefined">\ENDIF</ERROR><!-- %**** cr-icaart2011.tex Line 125 **** --></p>
</para>
<figure frefnum="Figure 1" refnum="1" xml:id="S3.F1" labels="LABEL:fig1">
<p>
<text fontsize="small">(A)
<ERROR class="undefined">\Tree</ERROR>[.a<break/>N1 [.b<break/>N2<break/><Math mode="inline" xml:id="S3.F1.m1" tex="\frac{1}{4}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="4" role="NUMBER" fontsize="normal">4</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math><break/>(N6+N7) [.d<break/>N6<break/><Math mode="inline" xml:id="S3.F1.m2" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] [.e<break/>N7<break/><Math mode="inline" xml:id="S3.F1.m3" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] ] [.c<break/>N3<break/><Math mode="inline" xml:id="S3.F1.m4" tex="\frac{1}{4}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="4" role="NUMBER" fontsize="normal">4</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math>N8 [.f<break/>N8<break/><Math mode="inline" xml:id="S3.F1.m5" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] ] [.b<break/>N4<break/><Math mode="inline" xml:id="S3.F1.m6" tex="\frac{1}{4}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="4" role="NUMBER" fontsize="normal">4</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math><break/>(N9+N10) [.e<break/>N9<break/><Math mode="inline" xml:id="S3.F1.m7" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] [.d<break/>N10<break/><Math mode="inline" xml:id="S3.F1.m8" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] ] [.c<break/>N5<break/><Math mode="inline" xml:id="S3.F1.m9" tex="\frac{1}{4}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="4" role="NUMBER" fontsize="normal">4</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math>N11 [.g<break/>N11<break/><Math mode="inline" xml:id="S3.F1.m10" tex="\frac{1}{2}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math><break/>(N12+N13+N14) [.h<break/>N12<break/><Math mode="inline" xml:id="S3.F1.m11" tex="\frac{1}{3}" text="1 / 3"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="3" role="NUMBER" fontsize="normal">3</XMTok></XMApp></XMath></Math> ] [.i<break/>N13<break/><Math mode="inline" xml:id="S3.F1.m12" tex="\frac{1}{3}" text="1 / 3"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="3" role="NUMBER" fontsize="normal">3</XMTok></XMApp></XMath></Math> ] [.j<break/>N14<break/><Math mode="inline" xml:id="S3.F1.m13" tex="\frac{1}{3}" text="1 / 3"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="3" role="NUMBER" fontsize="normal">3</XMTok></XMApp></XMath></Math> ] ] ] ]
(B)
<ERROR class="undefined">\Tree</ERROR>[.a<break/>N15 [.b<break/>N16<break/><Math mode="inline" xml:id="S3.F1.m14" tex="\frac{1}{4}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="4" role="NUMBER" fontsize="normal">4</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math><break/>(N18+N19) [.d<break/>N18<break/><Math mode="inline" xml:id="S3.F1.m15" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] [.e<break/>N19<break/><Math mode="inline" xml:id="S3.F1.m16" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] ] [.c<break/>N17<break/><Math mode="inline" xml:id="S3.F1.m17" tex="\frac{1}{4}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="4" role="NUMBER" fontsize="normal">4</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math><break/>(N20+N21) [.g<break/>N20<break/><Math mode="inline" xml:id="S3.F1.m18" tex="\frac{1}{2}\cdot"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp><XMTok name="cdot" role="MULOP" fontsize="normal">⋅</XMTok></XMath></Math>N22 [.h<break/>N22<break/><Math mode="inline" xml:id="S3.F1.m19" tex="\frac{1}{3}" text="1 / 3"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="3" role="NUMBER" fontsize="normal">3</XMTok></XMApp></XMath></Math> ] ] [.f<break/>N21<break/><Math mode="inline" xml:id="S3.F1.m20" tex="\frac{1}{2}" text="1 / 2"><XMath><XMApp><XMTok mathstyle="text" meaning="divide" role="MULOP"/><XMTok meaning="1" role="NUMBER" fontsize="normal">1</XMTok><XMTok meaning="2" role="NUMBER" fontsize="normal">2</XMTok></XMApp></XMath></Math> ] ] ]</text>
</p>
<toccaption>
<tag close=" ">
<text fontsize="small">1</text>
</tag>
<text fontsize="small">Two labeled trees, </text>
<emph fontsize="small">A</emph>
<text fontsize="small"> and </text>
<emph fontsize="small">B</emph>
<text fontsize="small">, which show similarities in their structures.</text>
</toccaption>
<caption fontsize="small"><tag close=": ">Figure 1</tag>Two labeled trees, <emph>A</emph> and <emph>B</emph>, which show similarities in their structures.</caption>
</figure>
<para xml:id="S3.SS2.p5">
<p>The main difference between the <emph>simple</emph> and the <emph>clustered</emph> tree matching is the way of assigning values to matching elements. The first, adopts a fixed matching value of 1; the latter, instead, computes some additional information, retrieved in the sub-trees of matched nodes.</p>
</para>
<para xml:id="S3.SS2.p6">
<p>Omitting detail, provided in <cite>[<bibref bibrefs="Baumgartner2010" separator="," show="Refnum" yyseparator=","/>]</cite>, the <emph>clustered tree matching</emph> algorithm assigns a weighted value equal to 1, divided by the greater number of siblings, computed between the two compared nodes (also considering themselves).</p>
</para>
<para xml:id="S3.SS2.p7">
<p>Figure <ref labelref="LABEL:fig1"/> shows two similar simple rooted, labeled trees, and the way of assignment of weights that would be calculated by applying the <emph>clustered tree matching</emph> between them.</p>
</para>
<figure frefnum="Figure 2" placement="!ht" refnum="2" xml:id="S3.F2" labels="LABEL:diagram">
<!-- %**** cr-icaart2011.tex Line 150 **** -->
<graphics graphic="graph.png" options="width=360.0pt" xml:id="S3.F2.g1" class="ltx_centering"/>
<toccaption class="ltx_centering"><tag close=" ">2</tag>State diagram of Web wrappers design and adaptation in the Lixto Visual Developer.</toccaption>
<caption class="ltx_centering"><tag close=": ">Figure 2</tag>State diagram of Web wrappers design and adaptation in the Lixto Visual Developer.</caption>
</figure>
<subsubsection refnum="3.2.1" xml:id="S3.SS2.SSS1">
<title><tag close=" ">3.2.1</tag>Motivations</title>
<para xml:id="S3.SS2.SSS1.p1">
<p>Several motivations lead us to bring these improvements.
For example, considering common characteristics shown by Web pages, provides some useful tips: usually, rich sub-levels (i.e. sub-levels with several nodes) represent list items, table rows, menu, etc., more frequently affected by modifications than other elements of Web pages; moreover, analyzing which kind of modifications usually affect Web pages suggests to assign less importance to slight changes happening in deep sub-levels of the DOM tree, this because these are commonly related to missing/added details to elements, etc.</p>
</para>
<para xml:id="S3.SS2.SSS1.p2">
<p>On the one hand, <emph>simple tree matching</emph> ignores these important aspects, on the other <emph>clustered tree matching</emph> exploits information like position and number of mismatches to produce a more accurate result.</p>
</para>
</subsubsection>
<subsubsection refnum="3.2.2" xml:id="S3.SS2.SSS2">
<title><tag close=" ">3.2.2</tag>Advantages and limitations</title>
<para xml:id="S3.SS2.SSS2.p1">
<p>The main advantage of our <emph>clustered tree matching</emph> is its capability to calculate an absolute measure of similarity, while <emph>simple tree matching</emph> produces the mapping value between two trees.
Moreover, the more the structure of considered trees is complex and similar, the more the measure of similarity established by this algorithm will be accurate.
It fits particularly well to matching up HTML Web pages, this because they often own rich and variegated structures.</p>
</para>
<para xml:id="S3.SS2.SSS2.p2">
<p>One important limitation of algorithms based on the tree matching is that they can not match permutations of nodes.
Intuitively, this happens because of the dynamic programming technique used to face the problem with computational efficiency; both the algorithms execute recursive calls, scanning sub-trees in a sequential manner, so as to reduce the number of required iterations (e.g. in Figure <ref labelref="LABEL:fig1"/>, permutation of nodes [c,b] in <emph>A</emph> with [b,c] in <emph>B</emph> is not computed).
It is possible to modify the algorithm introducing the analysis of permutations of sub-trees, but this would heavily affect performances.
Despite this intrinsic limit, this technique appears to fit very well to our purpose of measuring HTML trees similarity.</p>
</para>
<!-- %**** cr-icaart2011.tex Line 175 **** -->
<para xml:id="S3.SS2.SSS2.p3">
<p>It is important to remark that, applying <emph>simple tree matching</emph> to compare simple and quite different trees will produce a more accurate result.
Despite that, because of the most of modifications in Web pages are usually slight changes, <emph>clustered tree matching</emph> is far and away the best algorithm to be adopted in building automatically adaptable wrappers.
Moreover, this algorithm makes it possible to establish a custom level of accuracy of the process of matching, defining a similarity threshold required to match two trees.</p>
</para>
<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
</subsubsection>
</subsection>
</section>
<section refnum="4" xml:id="S4">
<title><tag close=" ">4</tag>ADAPTABLE WEB WRAPPERS</title>
<para xml:id="S4.p1">
<p>Based on the adaptation algorithms described above, a proof-of-concept extension to the Lixto Visual Developer (VD) has been implemented.
Wrappers are automatically adapted based on given configuration settings, integrity constraints, and triggers.</p>
</para>
<para xml:id="S4.p2">
<p>Usually, wrapper generation in VD is a hierarchical top-down process, e.g. first, a “hotel record” is characterized, and inside the hotel record, entities such as “rating” and “room types”.
Such entities are referred to as <emph>patterns</emph>.
To define a pattern, the wrapper designer visually selects an example and together with system suggestions generalizes the rule configuration until the desired instances are matched.</p>
</para>
<para xml:id="S4.p3">
<p>In this extension, to support the automatic adaptation process during runtime, the wrapper designer further specifies what it means that extraction failed.
In general, this means wrong or missing data, and with integrity constraints one can give indications how correct results look like.
Typical <emph>integrity constraints</emph> are:</p>
</para>
<para xml:id="S4.p4">
<itemize xml:id="I1">
<item xml:id="I1.i1">
<tag>•</tag>
<para xml:id="I1.i1.p1">
<p><emph>Occurrence restrictions</emph>: e.g. minimum and/or maximum number of allowed occurrence of a pattern instance, minimum and/or maximum number of child pattern instances;</p>
</para>
</item>
<item xml:id="I1.i2">
<tag>•</tag>
<para xml:id="I1.i2.p1">
<p><emph>Data types</emph>: e.g. all results of a “price” pattern need to be of data type integer.</p>
</para>
</item>
</itemize>
</para>
<para xml:id="S4.p5">
<p>Integrity constraints can be specified with each pattern individually or be based on a data model (in our case, a given XML Schema).
<!-- %**** cr-icaart2011.tex Line 200 **** -->In case integrity constraints are violated during runtime, the adaptation process for this particular pattern is started.</p>
</para>
<para xml:id="S4.p6">
<p>During wrapper creation, the application designer provides a number of configuration settings to this process.
This includes:</p>
</para>
<para xml:id="S4.p7">
<itemize xml:id="I2">
<item xml:id="I2.i1">
<tag>•</tag>
<para xml:id="I2.i1.p1">
<p>Threshold values;</p>
</para>
</item>
<item xml:id="I2.i2">
<tag>•</tag>
<para xml:id="I2.i2.p1">
<p>Priorities/order of adaptation algorithms used;</p>
</para>
</item>
<item xml:id="I2.i3">
<tag>•</tag>
<para xml:id="I2.i3.p1">
<p>Flags of the chosen algorithm (e.g. using HTML element name as node label, using id/class attributes as node labels, etc.);</p>
</para>
</item>
<item xml:id="I2.i4">
<tag>•</tag>
<para xml:id="I2.i4.p1">
<p>Triggers for bottom-up, top-down and process flow adaptation bubbling;</p>
</para>
</item>
<item xml:id="I2.i5">
<tag>•</tag>
<para xml:id="I2.i5.p1">
<p>Whether stored tree-grams and XPath statements are updated based on adaptation results to be additionally used as inputs in future adaptation procedures (reflecting and addressing regular slight changes of a Web page over time).</p>
</para>
</item>
</itemize>
</para>
<para xml:id="S4.p8">
<p>Used algorithms for adaptations rely on two inputs (stored example tree-gram(s), DOM tree of current page) and provide as output sub-trees that are sufficiently similar to the original (example) ones, and in consequence a generated XPath statement that matches the nodes (Fig. <ref labelref="LABEL:diagram"/> summarizes the process from design time and execution time perspective).</p>
</para>
<para xml:id="S4.p9">
<p>Algorithms under consideration include the clustered tree matching discussed above, as well as tree-based variants of the Bigram <cite>[<bibref bibrefs="Collins1996bigram" separator="," show="Refnum" yyseparator=","/>]</cite> and Jaro-Winkler similarity <cite>[<bibref bibrefs="winkler1999state" separator="," show="Refnum" yyseparator=","/>]</cite> (which are of advantage when one assumes that permutations in the tree nodes are likely over time).
Moreover, for extraction of leaf nodes which exhibit no inherent tree structure, we rely on string similarity metrics.
Finally, triggers in adaptation settings can be used to force adaptation of further fragments of the wrapper:</p>
</para>
<para xml:id="S4.p10">
<itemize xml:id="I3">
<item xml:id="I3.i1">
<tag>•</tag>
<para xml:id="I3.i1.p1">
<p>Top-down: forcing adaptation of all/some descendant patterns (e.g. adapt the “price” pattern as well to identify prices within a record if the “record” pattern was adapted).</p>
</para>
</item>
<item xml:id="I3.i2">
<tag>•</tag>
<para xml:id="I3.i2.p1">
<p>Bottom-up: forcing adaptation of a parent pattern in case adaptation of a particular pattern was not successful. Experimental evaluation pointed out that in such cases it is often the problem that the parent pattern already provides wrong or missing results (even if matched by the integrity constraints) and has to be adapted first.</p>
</para>
</item>
<item xml:id="I3.i3">
<tag>•</tag>
<para xml:id="I3.i3.p1">
<p>Process flow: it might happen that particular patterns are no longer detected because the wrapper evaluates on the wrong page. Hence, there is the need to use variations in the deep web navigation processes. A simple approach explored at this time is to use a switch window or back step action to check if the previous window or another tab/pop-up provides the required information.</p>
</para>
</item>
</itemize>
</para>
<!-- %**** cr-icaart2011.tex Line 225 ****
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->
</section>
<section refnum="5" xml:id="S5">
<title><tag close=" ">5</tag>EXPERIMENTAL RESULTS</title>
<para xml:id="S5.p1">
<p>The best way of measuring reliability of automatically adaptable wrappers is to test their behavior in real world use-cases.
Several common areas of application of Web wrappers have been identified: social networks and bookmarking, retail market and comparison shopping, Web search and information distribution, and, finally, Web communities.
For each of these fields, we designed a test using a representative Website, studying a total of 7 use-cases, defining wrappers applied to 70 Web pages.
Websites like Facebook, Google News, Ebay, etc. are usually subjected to countless, although often invisible, structural modifications; thus, altering the correct behavior of Web wrappers.
Table <ref labelref="LABEL:tab-res"/> summarizes results: each wrapper automatically tries to adapt itself using both the algorithms described in Section 3.
Column referred as <emph>thresh.</emph> means the threshold value of similarity required to match two elements. Columns <emph>tp</emph>, <emph>fp</emph> and <emph>fn</emph> represent true and false positive, and false negative, measures usually adopted to evaluate precision and recall of these kind of tasks.</p>
</para>
<table frefnum="Table 1" placement="!ht" refnum="1" xml:id="S5.T1" labels="LABEL:tab-res">
<tabular vattach="middle">
<tbody>
<tr>
<td align="center" border="l r" colspan="2" thead="true">
<text fontsize="small"></text>
</td>
<td align="center" border="r t" colspan="3">
<text fontsize="small">Simple T. M.</text>
</td>
<td align="center" border="r t" colspan="3">
<text fontsize="small">Clustered T. M.</text>
</td>
</tr>
<tr>
<td border="r" colspan="2" thead="true"/>
<td align="center" border="r t" colspan="3">
<text fontsize="small">Precision/Recall</text>
</td>
<td align="center" border="r t" colspan="3">
<text fontsize="small">Precision/Recall</text>
</td>
</tr>
<tr>
<td align="center" border="l" colspan="8" thead="true">
<text fontsize="small"></text>
<p class="ltx_intertext"/>
</td>
</tr>
<tr>
<td align="center" border="l t" thead="true">
<text fontsize="small">Scenario</text>
</td>
<td align="center" border="r t" thead="true">
<text fontsize="small">thresh.</text>
</td>
<td align="center" border="t">
<text fontsize="small">tp</text>
</td>
<td align="center" border="t">
<text fontsize="small">fp</text>
</td>
<td align="center" border="r t">
<text fontsize="small">fn</text>
</td>
<td align="center" border="t">
<text fontsize="small">tp</text>
</td>
<td align="center" border="t">
<text fontsize="small">fp</text>
</td>
<td align="center" border="r t">
<text fontsize="small">fn</text>
</td>
</tr>
<tr>
<td align="center" border="l t" thead="true">
<text fontsize="small">Delicious</text>
</td>
<td align="center" border="r t" thead="true">
<text fontsize="small">40%</text>
</td>
<td align="center" border="t">
<text fontsize="small">100</text>
</td>
<td align="center" border="t">
<text fontsize="small">4</text>
</td>
<td align="center" border="r t">
<text fontsize="small">-</text>
</td>
<td align="center" border="t">
<text fontsize="small">100</text>
</td>
<td align="center" border="t">
<text fontsize="small">-</text>
</td>
<td align="center" border="r t">
<text fontsize="small">-</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">Ebay</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">85%</text>
</td>
<td align="center">
<text fontsize="small">200</text>
</td>
<td align="center">
<text fontsize="small">12</text>
</td>
<td align="center" border="r">
<text fontsize="small">-</text>
</td>
<td align="center">
<text fontsize="small">196</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">4</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">Facebook</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">65%</text>
</td>
<td align="center">
<text fontsize="small">240</text>
</td>
<td align="center">
<text fontsize="small">72</text>
</td>
<td align="center" border="r">
<text fontsize="small">-</text>
</td>
<td align="center">
<text fontsize="small">240</text>
</td>
<td align="center">
<text fontsize="small">12</text>
</td>
<td align="center" border="r">
<text fontsize="small">-</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">Google news</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">90%</text>
</td>
<td align="center">
<text fontsize="small">604</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">52</text>
</td>
<td align="center">
<text fontsize="small">644</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">12</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">Google.com</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">80%</text>
</td>
<td align="center">
<text fontsize="small">100</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">60</text>
</td>
<td align="center">
<text fontsize="small">136</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">24</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">Kelkoo</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">40%</text>
</td>
<td align="center">
<text fontsize="small">60</text>
</td>
<td align="center">
<text fontsize="small">4</text>
</td>
<td align="center" border="r">
<text fontsize="small">-</text>
</td>
<td align="center">
<text fontsize="small">58</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">2</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">Techcrunch</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">85%</text>
</td>
<td align="center">
<text fontsize="small">52</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">28</text>
</td>
<td align="center">
<text fontsize="small">80</text>
</td>
<td align="center">
<text fontsize="small">-</text>
</td>
<td align="center" border="r">
<text fontsize="small">-</text>
</td>
</tr>
<tr>
<td align="center" border="l t" thead="true">
<text fontsize="small">Total</text>
</td>
<td align="center" border="r t" thead="true">
<text fontsize="small">-</text>
</td>
<td align="center" border="t">
<text fontsize="small">1356</text>
</td>
<td align="center" border="t">
<text fontsize="small">92</text>
</td>
<td align="center" border="r t">
<text fontsize="small">140</text>
</td>
<td align="center" border="t">
<text fontsize="small">1454</text>
</td>
<td align="center" border="t">
<text fontsize="small">12</text>
</td>
<td align="center" border="r t">
<text fontsize="small">42</text>
</td>
</tr>
<tr>
<td align="center" border="l tt" thead="true">
<text fontsize="small">Recall</text>
</td>
<td align="center" border="r tt" thead="true">
<text fontsize="small">-</text>
</td>
<td align="center" border="r tt" colspan="3">
<text fontsize="small">90.64%</text>
</td>
<td align="center" border="r tt" colspan="3">
<text fontsize="small">97.19%</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">Precision</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">-</text>
</td>
<td align="center" border="r" colspan="3">
<text fontsize="small">93.65%</text>
</td>
<td align="center" border="r" colspan="3">
<text fontsize="small">99.18%</text>
</td>
</tr>
<tr>
<td align="center" border="l" thead="true">
<text fontsize="small">F-Measure</text>
</td>
<td align="center" border="r" thead="true">
<text fontsize="small">-</text>
</td>
<td align="center" border="r" colspan="3">
<text fontsize="small">92.13%</text>
</td>
<td align="center" border="r" colspan="3">
<text fontsize="small">98.18%</text>
</td>
</tr>
<tr>
<td align="center" border="l t" thead="true"/>
<td border="t" thead="true"/>
<td border="t"/>
<td border="t"/>
<td border="t"/>
<td border="t"/>
<td border="t"/>
<td border="t"/>
</tr>
</tbody>
</tabular>
<toccaption>
<tag close=" ">
<text fontsize="small">1</text>
</tag>
<text fontsize="small">Evaluation of the reliability of automatically adaptable wrappers applied to real world scenarios.</text>
</toccaption>
<caption fontsize="small"><tag close=": ">Table 1</tag>Evaluation of the reliability of automatically adaptable wrappers applied to real world scenarios.</caption>
</table>
<para xml:id="S5.p2">
<p>Performances obtained using the <emph>simple</emph> and the <emph>clustered</emph> tree matching are, respectively, good and excellent; <emph>clustered tree matching</emph> definitely is a viable solution to automatically adapt wrappers with a high degree of reliability (F-Measure greater than 98%).
This system provides also the possibility of improving these results including additional checks on <emph>comparable attributes</emph> (e.g. <emph>id</emph>, <emph>name</emph> or <emph>class</emph>). The role of the required accuracy degree is fundamental; experimental results help to highlight the following considerations: very high values of threshold could result in false negatives (e.g. Google news and Google.com scenarios), while low values could result in false positives (e.g. the Facebook scenario). Our solution exploiting the <emph>clustered tree matching</emph> algorithm, designed by us, helps to reduce wrapper maintenance tasks, keeping in mind that, in cases of deep structural changes, it could be required to manually intervene to fix a specific wrapper, since it is impossible to automatically face all the possible malfunctionings.
<!-- %**** cr-icaart2011.tex Line 275 **** --></p>
</para>
<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
</section>
<section refnum="6" xml:id="S6">
<title><tag close=" ">6</tag>CONCLUSION</title>
<para xml:id="S6.p1">
<p>In this paper we described several novel ideas, investigating the possibility of designing smart Web wrappers which automatically react to structural modifications of underlying Web pages and adapting themselves to avoid malfunctionings or corrupting extracted data.
After explaining the core algorithms on which this system relies, we shown how to implement this feature in Web wrappers.
Finally, we analyzed performances of this system through a rigorous testing of the behavior of automatically adaptable wrappers in real world use-cases.</p>
</para>
<para xml:id="S6.p2">
<p>This work opens new scenarios on wrapper adaptation techniques and is liable to several improvements: first of all, avoiding some limitations of the matching algorithms, for example the inability of handling permutations on nodes previously explained, with computationally efficient solutions could be important to improve the robustness of wrappers.
One limitation of adopted tree matching algorithms is also that they do not work very well if new levels of nodes are added or node levels are removed.
We already investigated the possibility of adopting different tree similarity algorithms, working better in such cases.
We could try to “generalize” other similarity metrics on strings, such as the n-gram distance and the Jaro-Winkler distance.
Implementing these two metrics do not require dynamic programming and might be computationally efficient; in particular, variants of the Bigram distance on trees might work well with permutations of groups of nodes and the Jaro-Winkler distance could better reflect missing or added node levels.
Another idea is investigating the possibility of improving matching criteria including additional information to be compared during the tree match up process (e.g. full path information, all attributes, etc.); then, exploiting logic-based rules (e.g. regular expressions, string edit distance, and so on) to analyze textual properties.</p>
</para>
<para xml:id="S6.p3">
<p>Finally, the tree-grammar, already exploited to store a light-weight signature of the structure of elements identified by the wrapper, could be extended for classifying topologies of templates frequently shown by Web pages, in order to define <emph>standard protocols</emph> of automatic adaptation in these particular contexts.
Adaptation in the deep web navigation is a different topic than adaptation on a particular page, but also extremely important for wrapper adaptation.
Future work will comprise to investigate focused spidering techniques: instead of explicit modeling of a work flow on a Web page (form fill-out, button clicks, etc.) we develop a tree-grammar based approach that decides for a given Web page which template it matches best and executes the data extraction rules defined for this template.
Navigation steps are carried out implicitly by following all links and DOM events that have been defined as interesting, crawling a site in a focused way to find the relevant information.</p>
</para>
<para xml:id="S6.p4">
<p>Concluding, the system of designing automatically adaptable wrappers described in this paper has been proved to be robust and reliable.
The <emph>clustered tree matching</emph> algorithm is very extensible and it could be adopted for different tasks, also not strictly related to Web wrappers (e.g. operations that require matching up trees could exploit this algorithm).</p>
</para>
<!-- %**** cr-icaart2011.tex Line 300 **** -->
</section>
<bibliography bibstyle="apalike" citestyle="numbers" files="biblio" xml:id="bib">
<title fontsize="small">References</title>
</bibliography>
</document>
\documentclass[a4paper,twoside]{article}
\usepackage{amssymb}
\usepackage{amstext}
\usepackage{amsmath}
\usepackage{multicol}
\usepackage{pslatex}
\usepackage{apalike}
\usepackage{fancyhdr}
\usepackage{scitepress}
\usepackage[small]{caption}
\usepackage{graphicx}
\usepackage{algorithm}
\usepackage[noend]{algorithmic}
\usepackage{qtree}
\usepackage{color}
\usepackage[utf8]{inputenc} % set input encoding (not needed with XeLaTeX)
\usepackage[table]{xcolor}
\begin{document}
\onecolumn \maketitle \normalsize \vfill
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\uppercase{Introduction}}
\noindent The World Wide Web today contains an exterminated amount of information, mostly unstructured, under the form of Web pages, but also documents of various nature.
During last years big efforts have been conducted to develop techniques of information extraction on top of the Web.
Approaches adopted spread in several fields of Mathematics and Computer Science, including, for example, logic-programming and machine learning.
Several projects, initially developed in academic settings, evolved in commercial products, and it is possible to identify different methodologies to face the problem of Web data extraction.
A widely adopted approach is to define \emph{Web wrappers}, procedures relying on analyzing the structure of HTML Web pages (i.e. DOM tree) to extract required information.
Wrappers can be defined in several ways, e.g. most advanced tools let users to design them in a visual way, for example selecting elements of interest in a Web page and defining rules for their extraction and validation, semi-automatically; regardless of their generation process, wrappers intrinsically refer to the HTML structure of the Web page at the time of their creation.
Thus, introducing not negligible problems of robustness, wrappers could fail in their tasks of data extraction if the underlying structure of the Web page changes, also slightly.
Moreover, it could happens that the process of extraction does not fail but extracted data are corrupted.
All these aspects clarify the following scenarios: during their definition, wrappers should be as much elastic as possible, in order to intrinsically handle minor modifications on the structure of Web pages (this kind of small local changes are much more frequent than heavy modifications); although elastic wrappers could efficiently react to minor changes, maintenance is required for the whole wrapper life-cycle.
Wrapper maintenance is expensive because it requires highly qualified personnel, specialized in defining wrappers, to spend their time in rebuilding or fixing wrappers whenever they stop working properly.
For improving this aspect, several commercial tools include notification features, reporting warnings or errors during wrappers execution.
Moreover, to increase their reliability, data extracted by wrappers could be subject to validation processes, and also data cleaning is a fundamental step; some tools provide caching services to store the last working copy of Web pages involved in data extraction processes.
Sometimes, it is even more convenient to rewrite \emph{ex novo} a wrapper, instead of trying to find causes of malfunctioning and fixing them, because debugging wrapper executions can be not trivial.
The unpredictability of what changes will occur in a specific Web page and, consequently, the impossibility to establish when a wrappers will stop working properly, requires a smart approach to wrapper maintenance.
Our purpose in this paper is to describe the realization and to investigate performances of an automatic process of wrapper adaptation to structural modifications of Web pages.
We designed and implemented a system relying on the possibility of storing, during the wrapper definition step, a \emph{snapshot} of the DOM tree of the original Web page, namely a \emph{tree-gram}.
If, during the wrapper execution, problems occur, this sample is compared to the new DOM structure, finding similarities on trees and sub-trees, to automatically try adaptating the wrapper with a custom degree of accuracy.
Briefly, the paper is structured as follows: Section 2 focuses on related work, covering the literature about wrapper generation and adaptation.
In Section 3 we explain some concepts related to the tree similarity algorithm implemented, to prove the correctness of our approach. Section 4 shows details about our implementation of the automatic wrapper adaptation. Most important results, obtained by our experimentation, are reported in Section 5. Finally, Section 6 concludes providing some remarks for future work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\uppercase{Related Work}}
\noindent The concept of analyzing similarities between trees, widely adopted in this work, was introduced by Tai \cite{Tai1979}; he defined the notion of \emph{distance} between two trees as the measure of the dissimilarity between them.
The problem of transforming trees into other similar trees, namely \emph{tree edit distance}, can be solved applying elementary transformations to nodes, step-by-step.
The minimum cost for this operation represents the tree edit distance between the two trees.
This technique shows high computational requirements and complex implementations \cite{Bille2005}, and do not represents the optimal solution to our problem of finding similarities between two trees.
The \emph{simple tree matching} technique \cite{StanleyM.Selkow1977} represents a turning point: it is a light-weight recursive top-down algorithm which evaluates position of nodes to measure the degree of isomorphism between two trees, analyzing and comparing their sub-trees.
Several improvements to this technique have been suggested: Ferrara and Baumgartner \cite{Ferrara2010}, extending the concept of weights introduced by Yang \cite{Yang1991}, developed a variant of this algorithm with the capability of discovering clusters of similar sub-trees.
An interesting evaluation of the simple tree matching and its weighed version, brought by Kim et al. \cite{Kim2007}, was performed exploiting these two algorithms for extracting information from HTML Web pages; we found their achievements very useful to develop automatically adaptable wrappers.
Web data extraction and adaptation rely especially on algorithms working with DOM trees.
Related work, in particular regarding Web wrappers and their maintenance, is intricate: Laender et al. \cite{Laender2002} presented a taxonomy of wrapper generation methodologies, while Ferrara et al. \cite{Baumgartner2010} discussed a comprehensive survey about techniques and fields of application of Web data extraction and adaptation.
Some novel wrapper adaptation techniques have been introduced during last years: a valid hybrid approach, mixing logic-based and grammar rules, has been presented by Chidlovskii \cite{Chidlovskii2001}.
Also machine-learning techniques have been investigated, e.g. Lerman et al. \cite{Lerman2003} exploited their know-how in this field to develop a system for wrapper verification and re-induction.
Meng et al. \cite{20} developed the SG-WRAM (Schema-Guided WRApper Maintenance), for wrapper maintenance, starting from the observation that, changes in Web pages, even substantial, always preserve syntactic features (i.e. syntactic characteristics of data items like data patterns, string lengths, etc.), hyperlinks and annotations (e.g. descriptive information representing the semantic meaning of a piece of information in its context).
This system has been implemented in their Web data extraction platform: wrappers are defined providing both HTML Web pages and their XML schemes, describing a mappings between them.
When the system executes the wrapper, data are extracted under the XML format reflecting the previously specified XML Schema; the wrapper maintainer verifies any issue and, eventually, provides protocols for the automatic adaptation of the problematic wrapper.
The XML Schema is a DTD (Document Type Definition) while the HTML Web page is represented by its DOM tree.
The framework described by Wong and Lam \cite{Wong} performs the adaptation of wrappers previously learned, applying them to Web pages never seen before; they assert that this platform is also able to discover, eventually, new attributes in the Web page, using a probabilistic approach, exploiting the extraction knowledge acquired through previous wrapping tasks.
Also Raposo et al. \cite{Raposo2005} suggested the possibility of exploiting previously acquired information, e.g. results of queries, to ensure a reliable degree of accuracy during the wrapper adaptation process.
Concluding, Kowalkiewicz et al. \cite{Kowalkiewicz2006} investigate the possibility of increasing the robustness of wrappers based on the identification of HTML elements, inside Web pages, through their XPath, adopting relative XPath, instead of absolute ones.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\uppercase{Matching Up HTML Trees}}
\noindent Our idea of automatic adaptation of wrappers can be explained as follows: first of all, outlining how to extract information from Web pages (i.e. in our case, how a Web wrapper works); then, describing how it is possible to recover information previously extracted from a different Web page (i.e. how to compare structural information between the two versions of the Web page, finding similarities); finally, defining how to automatize this process (i.e. how to build reliable, robust automatically adaptable wrappers).
Our solution has been implemented in a commercial product \footnote{Lixto Suite, www.lixto.com}; Baumgartner et al. \cite{baumgartner2009scalable} described details about its design.
This platform provides tools to design Web wrappers in a visual way, selecting elements to be extracted from Web pages.
During the wrapper execution, selected elements, identified through their XPath(s) in the DOM tree of the Web page, are automatically extracted.
Although the wrapper design process lets users to define several restricting or generalizing conditions to build wrappers as much elastic as possible, wrappers are strictly interconnected with the structure of the Web page on top of they are built.
Usually, also slight modifications to this structure could alter the wrapper execution or corrupt extracted data.
In this section we discuss some theoretical foundations on which our solution relies; in details, we show an efficient algorithm to find similar elements within different Web pages.
%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Methodology}
\noindent A simple measure of similarity between two trees, once defined their comparable elements, can be established applying the \emph{simple tree matching} algorithm \cite{StanleyM.Selkow1977}, introduced in Section 2.
We define \emph{comparable elements} among HTML Web pages, nodes, representing HTML elements (or, otherwise, free text) identified by tags, belonging to the DOM tree of these pages.
Similarly, we intend for \emph{comparable attributes} all the attributes, both generic (e.g. \emph{class}, \emph{id}, etc.) and type-specific (e.g. \emph{href} for anchor texts, \emph{src} for images, etc.), shown by HTML elements; it is possible to exploit these properties to introduce additional comparisons to refine the similarity measure.
Several implementations of the \emph{simple tree matching} have been proposed; our solution exploits an improved version, namely \emph{clustered tree matching} \cite{Baumgartner2010}, designed to match up HTML trees, identifying clusters of sub-trees with similar structures, satisfying a custom degree of accuracy.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Tree Matching Algorithms}
\noindent Previous studies proved the effectiveness of the \emph{simple tree matching} algorithm applied to Web data extraction tasks \cite{Kim2007,19}; it measures the similarity degree between two HTML trees, producing the maximum matching through dynamic programming, ensuring an acceptable compromise between precision and recall.
As improvement to this algorithm, this is a possible implementation of \emph{clustered tree matching}: let \emph{d(n)} to be the degree of a node \emph{n} (i.e. the number of first-level children); let T(i) to be the i-\emph{th} sub-tree of the tree rooted at node T; let $t(n)$ to be the number of total siblings of a node \emph{n} including itself.
\begin{algorithm}
\caption{ClusteredTreeMatching($T^{'}$, $T^{''}$)}
\label{alg1}
\begin{algorithmic}[1]
\IF{$T^{'}$ has the same label of $T^{''}$}
\STATE $m \leftarrow$ $d(T^{'})$
\STATE $n \leftarrow$ $d(T^{''})$
\FOR{$i = 0$ to $m$}
\STATE $M[i][0] \leftarrow 0$;
\ENDFOR
\FOR{$j = 0$ to $n$}
\STATE $M[0][j] \leftarrow 0$;
\ENDFOR
\FORALL{$i$ such that $1\leq i\leq m$}
\FORALL{$j$ such that $1\leq j \leq n$}
\STATE $M[i][j] \leftarrow$ Max($M[i][j-1]$, $M[i-1][j]$, $M[i-1][j-1] + W[i][j]$) where $W[i][j]$ = ClusteredTreeMatching($T^{'}(i-1)$, $T^{''}(j-1)$)
\ENDFOR
\ENDFOR
\IF{$m > 0$ AND $n > 0$}
\STATE return M[m][n] * 1 / Max($t(T^{'})$, $t(T^{''})$)
\ELSE
\STATE return M[m][n] + 1 / Max($t(T^{'})$, $t(T^{''})$)
\ENDIF
\ELSE
\STATE return 0
\ENDIF
\end{algorithmic}
\end{algorithm}
\begin{figure*}
\small (A)
\Tree [.{a\\\small N1} [.{b\\\small N2\\$\frac{1}{4}\cdot$\\\small (N6+N7)} [.{d\\\small N6\\ $\frac{1}{2}$} ] [.{e\\\small N7\\ $\frac{1}{2}$} ] ] [.{c\\\small N3\\$\frac{1}{4}\cdot$\small N8} [.{f\\\small N8\\ $\frac{1}{2}$} ] ] [.{b\\\small N4\\$\frac{1}{4}\cdot$\\\small (N9+N10)} [.{e\\\small N9\\ $\frac{1}{2}$} ] [.{d\\\small N10\\ $\frac{1}{2}$} ] ] [.{c\\\small N5\\$\frac{1}{4}\cdot$\small N11} [.{g\\\small N11\\$\frac{1}{2}\cdot$\\\small (N12+N13+N14)} [.{h\\\small N12\\ $\frac{1}{3}$} ] [.{i\\\small N13\\ $\frac{1}{3}$} ] [.{j\\\small N14\\ $\frac{1}{3}$} ] ] ] ]
\small (B)
\Tree [.{a\\\small N15} [.{b\\\small N16\\$\frac{1}{4}\cdot$\\\small (N18+N19)} [.{d\\\small N18\\ $\frac{1}{2}$} ] [.{e\\\small N19\\ $\frac{1}{2}$} ] ] [.{c\\\small N17\\$\frac{1}{4}\cdot$\\\small (N20+N21)} [.{g\\\small N20\\$\frac{1}{2}\cdot$\small N22} [.{h\\\small N22\\ $\frac{1}{3}$} ] ] [.{f\\\small N21\\ $\frac{1}{2}$} ] ] ]
\caption{Two labeled trees, \emph{A} and \emph{B}, which show similarities in their structures.}
\label{fig1}
\end{figure*}
\noindent The main difference between the \emph{simple} and the \emph{clustered} tree matching is the way of assigning values to matching elements. The first, adopts a fixed matching value of 1; the latter, instead, computes some additional information, retrieved in the sub-trees of matched nodes.
Omitting detail, provided in \cite{Baumgartner2010}, the \emph{clustered tree matching} algorithm assigns a weighted value equal to 1, divided by the greater number of siblings, computed between the two compared nodes (also considering themselves).
Figure \ref{fig1} shows two similar simple rooted, labeled trees, and the way of assignment of weights that would be calculated by applying the \emph{clustered tree matching} between them.
\begin{figure*}[!ht]%
\begin{center}
\includegraphics[width=360px]{graph.png}%
\caption{State diagram of Web wrappers design and adaptation in the Lixto Visual Developer.}%
\label{diagram}%
\end{center}
\end{figure*}
\subsubsection{Motivations}
\noindent Several motivations lead us to bring these improvements.
For example, considering common characteristics shown by Web pages, provides some useful tips: usually, rich sub-levels (i.e. sub-levels with several nodes) represent list items, table rows, menu, etc., more frequently affected by modifications than other elements of Web pages; moreover, analyzing which kind of modifications usually affect Web pages suggests to assign less importance to slight changes happening in deep sub-levels of the DOM tree, this because these are commonly related to missing/added details to elements, etc.
On the one hand, \emph{simple tree matching} ignores these important aspects, on the other \emph{clustered tree matching} exploits information like position and number of mismatches to produce a more accurate result.
\subsubsection{Advantages and limitations}
\noindent The main advantage of our \emph{clustered tree matching} is its capability to calculate an absolute measure of similarity, while \emph{simple tree matching} produces the mapping value between two trees.
Moreover, the more the structure of considered trees is complex and similar, the more the measure of similarity established by this algorithm will be accurate.
It fits particularly well to matching up HTML Web pages, this because they often own rich and variegated structures.
One important limitation of algorithms based on the tree matching is that they can not match permutations of nodes.
Intuitively, this happens because of the dynamic programming technique used to face the problem with computational efficiency; both the algorithms execute recursive calls, scanning sub-trees in a sequential manner, so as to reduce the number of required iterations (e.g. in Figure \ref{fig1}, permutation of nodes [c,b] in \emph{A} with [b,c] in \emph{B} is not computed).
It is possible to modify the algorithm introducing the analysis of permutations of sub-trees, but this would heavily affect performances.
Despite this intrinsic limit, this technique appears to fit very well to our purpose of measuring HTML trees similarity.
It is important to remark that, applying \emph{simple tree matching} to compare simple and quite different trees will produce a more accurate result.
Despite that, because of the most of modifications in Web pages are usually slight changes, \emph{clustered tree matching} is far and away the best algorithm to be adopted in building automatically adaptable wrappers.
Moreover, this algorithm makes it possible to establish a custom level of accuracy of the process of matching, defining a similarity threshold required to match two trees.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\uppercase{Adaptable Web Wrappers}}
\noindent Based on the adaptation algorithms described above, a proof-of-concept extension to the Lixto Visual Developer (VD) has been implemented.
Wrappers are automatically adapted based on given configuration settings, integrity constraints, and triggers.
Usually, wrapper generation in VD is a hierarchical top-down process, e.g. first, a ``hotel record'' is characterized, and inside the hotel record, entities such as ``rating'' and ``room types''.
Such entities are referred to as \emph{patterns}.
To define a pattern, the wrapper designer visually selects an example and together with system suggestions generalizes the rule configuration until the desired instances are matched.
In this extension, to support the automatic adaptation process during runtime, the wrapper designer further specifies what it means that extraction failed.
In general, this means wrong or missing data, and with integrity constraints one can give indications how correct results look like.
Typical \emph{integrity constraints} are:
\begin{itemize}
\item \emph{Occurrence restrictions}: e.g. minimum and/or maximum number of allowed occurrence of a pattern instance, minimum and/or maximum number of child pattern instances;
\item \emph{Data types}: e.g. all results of a ``price'' pattern need to be of data type integer.
\end{itemize}
\noindent Integrity constraints can be specified with each pattern individually or be based on a data model (in our case, a given XML Schema).
In case integrity constraints are violated during runtime, the adaptation process for this particular pattern is started.
During wrapper creation, the application designer provides a number of configuration settings to this process.
This includes:
\begin{itemize}
\item Threshold values;
\item Priorities/order of adaptation algorithms used;
\item Flags of the chosen algorithm (e.g. using HTML element name as node label, using id/class attributes as node labels, etc.);
\item Triggers for bottom-up, top-down and process flow adaptation bubbling;
\item Whether stored tree-grams and XPath statements are updated based on adaptation results to be additionally used as inputs in future adaptation procedures (reflecting and addressing regular slight changes of a Web page over time).
\end{itemize}
\noindent Used algorithms for adaptations rely on two inputs (stored example tree-gram(s), DOM tree of current page) and provide as output sub-trees that are sufficiently similar to the original (example) ones, and in consequence a generated XPath statement that matches the nodes (Fig. \ref{diagram} summarizes the process from design time and execution time perspective).
Algorithms under consideration include the clustered tree matching discussed above, as well as tree-based variants of the Bigram \cite{Collins1996bigram} and Jaro-Winkler similarity \cite{winkler1999state} (which are of advantage when one assumes that permutations in the tree nodes are likely over time).
Moreover, for extraction of leaf nodes which exhibit no inherent tree structure, we rely on string similarity metrics.
Finally, triggers in adaptation settings can be used to force adaptation of further fragments of the wrapper:
\begin{itemize}
\item Top-down: forcing adaptation of all/some descendant patterns (e.g. adapt the ``price'' pattern as well to identify prices within a record if the ``record'' pattern was adapted).
\item Bottom-up: forcing adaptation of a parent pattern in case adaptation of a particular pattern was not successful. Experimental evaluation pointed out that in such cases it is often the problem that the parent pattern already provides wrong or missing results (even if matched by the integrity constraints) and has to be adapted first.
\item Process flow: it might happen that particular patterns are no longer detected because the wrapper evaluates on the wrong page. Hence, there is the need to use variations in the deep web navigation processes. A simple approach explored at this time is to use a switch window or back step action to check if the previous window or another tab/pop-up provides the required information.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\uppercase{Experimental Results}}
\noindent The best way of measuring reliability of automatically adaptable wrappers is to test their behavior in real world use-cases.
Several common areas of application of Web wrappers have been identified: social networks and bookmarking, retail market and comparison shopping, Web search and information distribution, and, finally, Web communities.
For each of these fields, we designed a test using a representative Website, studying a total of 7 use-cases, defining wrappers applied to 70 Web pages.
Websites like Facebook, Google News, Ebay, etc. are usually subjected to countless, although often invisible, structural modifications; thus, altering the correct behavior of Web wrappers.
Table \ref{tab-res} summarizes results: each wrapper automatically tries to adapt itself using both the algorithms described in Section 3.
Column referred as \emph{thresh.} means the threshold value of similarity required to match two elements. Columns \emph{tp}, \emph{fp} and \emph{fn} represent true and false positive, and false negative, measures usually adopted to evaluate precision and recall of these kind of tasks.
\begin{table}[!ht]
\small
\begin{tabular}{|@{}c@{} c@{}|@{}c c c@{}|@{}c c c@{}|}
\cline{3-8}
\multicolumn{2}{r|}{} & \multicolumn{3}{c|}{Simple T. M.} & \multicolumn{3}{c|}{Clustered T. M.} \\
\cline{3-8}
\multicolumn{2}{r|}{} & \multicolumn{3}{c|}{Precision/Recall} & \multicolumn{3}{c|}{Precision/Recall}\\
\cline{3-8}
\noalign{\smallskip}
\hline
Scenario & thresh. & tp & fp & fn & tp & fp & fn \\
\hline
Delicious & 40\% & 100 & 4 & - & 100 & - & - \\
Ebay & 85\% & 200 & 12 & - & 196 & - & 4 \\
Facebook & 65\% & 240& 72 & - & 240&12 & - \\
Google news & 90\% & 604 & - & 52 & 644 & - & 12\\
Google.com & 80\% & 100 & - & 60 & 136 & - & 24 \\
Kelkoo & 40\% & 60 & 4 & - & 58 & - & 2 \\
Techcrunch & 85\% & 52 & - & 28 & 80 & - & - \\
\hline
Total & - & 1356 & 92 & 140 & 1454 & 12 & 42\\
\hline
\hline
Recall & - & \multicolumn{3}{c|}{90.64\%} & \multicolumn{3}{c|}{97.19\%}\\
Precision & - & \multicolumn{3}{c|}{93.65\%} & \multicolumn{3}{c|}{99.18\%}\\
F-Measure & - & \multicolumn{3}{c|}{92.13\%} & \multicolumn{3}{c|}{98.18\%}\\
\hline
\end{tabular}
\caption{Evaluation of the reliability of automatically adaptable wrappers applied to real world scenarios.}
\label{tab-res}
\end{table}
\noindent Performances obtained using the \emph{simple} and the \emph{clustered} tree matching are, respectively, good and excellent; \emph{clustered tree matching} definitely is a viable solution to automatically adapt wrappers with a high degree of reliability (F-Measure greater than 98\%).
This system provides also the possibility of improving these results including additional checks on \emph{comparable attributes} (e.g. \emph{id}, \emph{name} or \emph{class}). The role of the required accuracy degree is fundamental; experimental results help to highlight the following considerations: very high values of threshold could result in false negatives (e.g. Google news and Google.com scenarios), while low values could result in false positives (e.g. the Facebook scenario). Our solution exploiting the \emph{clustered tree matching} algorithm, designed by us, helps to reduce wrapper maintenance tasks, keeping in mind that, in cases of deep structural changes, it could be required to manually intervene to fix a specific wrapper, since it is impossible to automatically face all the possible malfunctionings.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\uppercase{Conclusion}}
\noindent In this paper we described several novel ideas, investigating the possibility of designing smart Web wrappers which automatically react to structural modifications of underlying Web pages and adapting themselves to avoid malfunctionings or corrupting extracted data.
After explaining the core algorithms on which this system relies, we shown how to implement this feature in Web wrappers.
Finally, we analyzed performances of this system through a rigorous testing of the behavior of automatically adaptable wrappers in real world use-cases.
This work opens new scenarios on wrapper adaptation techniques and is liable to several improvements: first of all, avoiding some limitations of the matching algorithms, for example the inability of handling permutations on nodes previously explained, with computationally efficient solutions could be important to improve the robustness of wrappers.
One limitation of adopted tree matching algorithms is also that they do not work very well if new levels of nodes are added or node levels are removed.
We already investigated the possibility of adopting different tree similarity algorithms, working better in such cases.
We could try to ``generalize'' other similarity metrics on strings, such as the n-gram distance and the Jaro-Winkler distance.
Implementing these two metrics do not require dynamic programming and might be computationally efficient; in particular, variants of the Bigram distance on trees might work well with permutations of groups of nodes and the Jaro-Winkler distance could better reflect missing or added node levels.
Another idea is investigating the possibility of improving matching criteria including additional information to be compared during the tree match up process (e.g. full path information, all attributes, etc.); then, exploiting logic-based rules (e.g. regular expressions, string edit distance, and so on) to analyze textual properties.
Finally, the tree-grammar, already exploited to store a light-weight signature of the structure of elements identified by the wrapper, could be extended for classifying topologies of templates frequently shown by Web pages, in order to define \emph{standard protocols} of automatic adaptation in these particular contexts.
Adaptation in the deep web navigation is a different topic than adaptation on a particular page, but also extremely important for wrapper adaptation.
Future work will comprise to investigate focused spidering techniques: instead of explicit modeling of a work flow on a Web page (form fill-out, button clicks, etc.) we develop a tree-grammar based approach that decides for a given Web page which template it matches best and executes the data extraction rules defined for this template.
Navigation steps are carried out implicitly by following all links and DOM events that have been defined as interesting, crawling a site in a focused way to find the relevant information.
Concluding, the system of designing automatically adaptable wrappers described in this paper has been proved to be robust and reliable.
The \emph{clustered tree matching} algorithm is very extensible and it could be adopted for different tasks, also not strictly related to Web wrappers (e.g. operations that require matching up trees could exploit this algorithm).
\renewcommand{\baselinestretch}{0.98}
\bibliographystyle{apalike}
{\small
\bibliography{biblio}}
\renewcommand{\baselinestretch}{1}
\end{document}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment