The Trade off between Bottom up and Top down Methods in Systems for Data Intensive Scientific Discovery
It seems like everyone connected with information technology is concerned with big data these days, but this is within a broad and speculative landscape where there are almost as many descriptions and definitions as interests.Our characterization of big data is based on examples of data accumulation processes where it seems, in principle, impossible to store the data stream;and so the analytics, whether OLAP or RTAP or any variant, is not just a strategy for extracting interesting signals, but is in fact a necessary component of data management infrastructure.The three cases noted here arise from the areas of radio astronomy, genome sequencing, and carbon flux sensors.In these examples, the consensus is that the volume and velocity of data is so large that there is no capture alternative.But there remains the scientific challenge of confirming anticipated hypotheses within these data.The underlying analytics tasks are large scale scientific challenges, where the quest is not just for expected artifacts in the data, but for scientific insight for which all uninterpreted data may be crucial.We this as background, the hypothesis sketched here is as follows.Given the view that big data means unstorable data, the challenge is to compress as much of a large data stream as possible, as abstraction labels, and then retain as much as feasible of the remaining data stream as the basis for deeper analysis and detection of new scientific concepts.The process of this semantic compression is contingent upon two foundations of artificial intelligence: 1) knowledge representation, especially for multiscale scientific modelling,2) machine learning methods to construct classification methods to label or chunk data stream portions into identified components of such multi-scale models.It will come as no surprise that recent developments in stream data mining are not only necessary for such large and unstorable data sources, but also offer advantages for scientific modelling in the case where data can be captured and stored, but is otherwise challenging within the recently popular dimensions of volume, variety, velocity, veracity, value, and vulnerability.In this case, further development of big data stream methods must consider a balance between bottom up pattern detection and top-down model-based.
big data stream mining information extraction modelling hypothesis management
Randy Goebel
Alberta Innovates Centre for Machine Learning, Department of Computing Science,University of Alberta, Edmonton, Alberta T6G 2E8, Canada
国际会议
上海
英文
160-167
2013-08-01(万方平台首次上网日期,不代表论文的发表时间)