会议专题

A Conditional-Probability Zone Transformation Coding Method for Categorical Features

  It has been a key issue for solving problems efficiently by machine learning models with code categorical features.The state-of-the-art one-hot coding is a widely accepted method to convert the categorical features into numerical values.However,it attracts a sparse space and meaningless value after coding.We come up with a novel coding method based on conditional probability after dividing the features into zones,which is called Conditional-probability-based Zone Transformation(CZT)coding.CZT coding calculates the conditional probability of each feature,then divides the features into several zones according to the probability and finally codes the features in each zone.We mathematically prove that compared with the state-of-the-art method,CZT coding reduces the code length by at least the mean of feature space and the issue becomes into an easier one after CZT coding for the following machine learning model.Finally,using the same neuron network as the classifier,we compare the performance of CZT coding and one-hot coding by using the titanic dataset,where most of the features are categorical,and the result is that CZT coding makes the classifier performs better both on the accuracy and steadiness.

feature engineering categorical features conditional probability formatting feature extraction

Liang He Chao Shen Yun Li

National Key Laboratory of Science and Technology on Blind Signal Processing Chengdu,Sichuan,China MOE Key Lab for Intelligent Networks and Network Security.Xi'an Jiaotong University,Shaanxi,Xi'an,Ch

国际会议

2019国图灵大会(ACM Turing Celebration conference-China 2019 )

成都

英文

655-660

2019-05-17(万方平台首次上网日期,不代表论文的发表时间)