A Practical Machine Learning Study on Big Data:Taming the Unstructured Data in E&P Industry
There have been accumulated large amounts of data in petroleum E&P industry,much of which are available in unstructured or semi-unstructured forms such as text.It requires data mining techniques to process,analysis and discover knowledge from them.Numerous machine learning libraries and frameworks like Mahout and Apache Spark that are based on Hadoop,the core distributed processing model and de facto standard of big data have become increasingly mature recently.The objective of the paper is to illustrate how to extract valuable information and discover knowledge from large volume of unstructured text by Apache Spark,an open source lightning-fast cluster computing technology.In this study,more than 180,000 paper abstracts are crawled from the online OnePetro library and cleansed,transformed and loaded into Hadoop HDFS file system.The Apache Spark is used to conduct data analytics and machine learning on the unstructured texts.Spark SQL is able to make statistics on papers and to discover the most popular papers and most influential authors in the OnePetro library.In addition,330 most popular papers are manually classified into 8 categories,(1)general;(2)drilling,perforation,completion,casing and cementing;(3)modeling and simulation;(4)production and performance;(5)EOR;(6)reservoir management practices;(7)fluid;(8)reservoir,which are used as the training corpus for a supervised text classification.Na(i)ve Bayes model from Spark MLlib is constructed and then applied to all papers.The performance and accuracy of the classification are proved to be acceptable by an additional small test dataset.
Y.S., Kang W.K.Wu Y.Y.Li Q.Yang
China University of Petroleum - Beijing C&C Reservoirs Ltd.
国内会议
2017年第五届数字油田国际学术会议(DOFIAC2017)
青岛
英文
256-261
2017-10-15(万方平台首次上网日期,不代表论文的发表时间)