Representing a Web Page as Sets of Named Entities of Multiple Types – A Model and Some Preliminary Applications
As opposed to representing a document as a “bag of words in most information retrieval applications, we propose a model of representing a web page as sets of named entities of multiple types. Specifically, four types of named entities are extracted, namely person, geographic location, organization, and time. Moreover, the relations among these entities are also extracted, weighted, classified and marked by labels. On top of this model, some interesting applications are demonstrated. In particular, we introduce a notion of person-activity, which contains four different elements: person, location, time and activity. With this notion and based on a reasonably large set of web pages, we are able to show how one persons activities can be attributed by time and location, which gives a good idea of the mobility of the person under question.
Web Content Mining Web Page Model Named Entity
Nan Di Conglei Yao Mengcheng Duan Jonathan J. H. Zhu Xiaoming Li
Dept of Computer Science and Technology Peking University Beijing, 100871, P.R. China Dept of Media and Communication City University of Hong Kong Kowloon, Hong Kong State Key Laboratory of Advanced Optical Communication Systems & Networks Peking University Beijing,
国际会议
第十七届国际万维网大会(the 17th International World Wide Web Conference)(WWW08)
北京
英文
2008-04-21(万方平台首次上网日期,不代表论文的发表时间)