Towards completely automatized HTML form discovery on the web

Moraes, Maurício Coutinho

dc.contributor.advisor	Heuser, Carlos Alberto	pt_BR
dc.contributor.author	Moraes, Maurício Coutinho	pt_BR
dc.date.accessioned	2013-04-11T01:47:42Z	pt_BR
dc.date.issued	2013	pt_BR
dc.identifier.uri	http://hdl.handle.net/10183/70194	pt_BR
dc.description.abstract	The discovery of HTML forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. We offer an up-to-date review of 19 methods for the discovery of domain-specific query forms that do not involve form submission. This thesis details these methods and discusses how form discovery has become increasingly more automated over time, providing the context in which we propose a novel method to advance the current state-of-the-art in domain-specific structured HTML form discovery. The current state-ofthe- art in domain-specific structured HTML form discovery consists mainly of methods that directly or indirectly depend heavily on human intervention. This thesis proposes and evaluates a method capable of discovering domain-specific structured HTML forms on the Web with very little effort from a human expert, who is required only to define the name of the domain of interest (i.e., the domain for which the discovery should be made). The forms discovered by our proposal can be directly used as training data by some form classifiers. Our experimental validation used thousands of real Web forms, divided into six domains, including a representative subset of the publicly available DeepPeep form base (DEEPPEEP, 2010; DEEPPEEP REPOSITORY, 2011). Our results show that it is feasible to mitigate the demanding manual work required by two cutting-edge form classifiers (i.e., GFC and DSFC (BARBOSA; FREIRE, 2007a)), at the cost of a relatively small loss in effectiveness.	en
dc.format.mimetype	application/pdf	pt_BR
dc.language.iso	eng	pt_BR
dc.rights	Open Access	en
dc.subject	Recuperacao : Informacao	pt_BR
dc.subject	Deep web	en
dc.subject	Hidden web	en
dc.subject	HTML (Linguagem de marcação)	pt_BR
dc.subject	Crawling	en
dc.subject	Serviços Web	pt_BR
dc.subject	Domain-specific search	en
dc.subject	Banco : Dados	pt_BR
dc.subject	Query form discovery	en
dc.title	Towards completely automatized HTML form discovery on the web	pt_BR
dc.type	Tese	pt_BR
dc.contributor.advisor-co	Moreira, Viviane Pereira	pt_BR
dc.identifier.nrb	000875012	pt_BR
dc.degree.grantor	Universidade Federal do Rio Grande do Sul	pt_BR
dc.degree.department	Instituto de Informática	pt_BR
dc.degree.program	Programa de Pós-Graduação em Computação	pt_BR
dc.degree.local	Porto Alegre, BR-RS	pt_BR
dc.degree.date	2013	pt_BR
dc.degree.level	doutorado	pt_BR

Ficheros en el ítem

Nombre:: 000875012.pdf
Tamaño:: 855.0Kb
Formato:: PDF
Descripción:: Texto completo (inglês)

Ver

Este ítem está licenciado en la Creative Commons License

Ciencias Exactas y Naturales (5183)

Computación (1779)

Mostrar el registro sencillo del ítem