Page Query Language Generation for Structural Extraction
The information on the Web is usually fabricated to be understandable by human users rather than machines.Its not easy to automatically catalogue and extract the Web information solely with a software agent.Based on these observations,we present an approach that uses human guided operations to automatically generate a PQL query,a SQL like query language focusing on Web pages,to extract the interested information fragments on Web pages.The PQL query uses XPath expressions to locating the target HTML nodes.We develop a K-Medoid clustering algorithm to process PQL queries to generate the structural extractions.The extracted information is structured as a relational table(in CSV format)which can be manipulated smoothly with spreadsheet software or a relational DBMS system.
PQL Structural Extraction Browser Extension
He Hu Xiaoyong Du
School of Information,Renmin University of China Key Laboratory of Data Engineering and Knowledge Engineering,MOE Beijing,China
国际会议
厦门
英文
614-618
2014-08-19(万方平台首次上网日期,不代表论文的发表时间)