会议专题

Mining, Indexing, and Searching for Textual Chemical Molecule Information on the Web

Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine rst extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports dierent query models that a scientist may require. We propose a model of hierarchical conditional random elds for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results signicantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.

Entity extraction conditional random elds independent frequent subsequence hierarchical text segmentation index pruning substring search similarity search ranking

Bingjun Sun Prasenjit Mitra C. Lee Giles

Department of Computer Science and Engineering Pennsylvania State University University Park, PA 168 College of Information Sciences and Technology Pennsylvania State University University Park, PA 168

国际会议

第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)

北京

英文

2008-04-21(万方平台首次上网日期,不代表论文的发表时间)