Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

摘要：

Inferring an appropriate DTD or XML Schema Denition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deter-ministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suces to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effctiveness of the method is empirically validated both on real world and synthetic data. Fur-thermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

关键词： XML regular expressions schema inference

作者: Geert Jan Bex Wouter Gelade Frank Neven Stijn Vansummereny

作者单位: Hasselt University/Transnational University of Limburg, Belgium

会议类型: 国际会议

会议名称: 第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

会议地点: 北京

会议语种:英文

在线出版日期: 2008-04-21（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data