機器學習 特征選擇篇——python實作MIC(最大資訊系數)計算
- 摘要
- python實作代碼
- 計算實體
摘要
MIC(最大資訊系數) 可以檢測變數之間的非線性相關性,常用于特征工程中的特征選擇,即通過計算各特征與因變數之間的MIC,從中挑選出對因變數影響較大的特征,剔除資訊量較少的特征,從而使得用于建模的變數更具代表性,一般使用該方法時,需要有較大的資料樣本,本文通過python實作了MIC(最大資訊系數)計算,并將代碼進行了封裝,方便讀者呼叫,
python實作代碼
此物件用于計算離散變數的熵、條件熵、熵增益(互資訊)和熵增益率
.x_num:在變數x方向上劃分的區間數,可以指定最小和最大值,也可不指定
.y_num:在變數y方向上劃分的區間數,可以指定最小和最大值,也可不指定
.cal_mut_info():由概率矩陣計算互資訊
.divide_bin():由劃磁區間計算概率矩陣
cal_MIC():計算最大資訊系數
用法:直接呼叫cal_MIC() 函式計算兩個變數之間的MIC
# -*- coding: utf-8 -*-
# @Time : 2020/12/3 13:44
# @Author : CyrusMay WJ
# @FileName: MIC.py
# @Software: PyCharm
# @Blog :https://blog.csdn.net/Cyrus_May
import numpy as np
import logging
import sys
class CyrusMIC(object):
logger = logging.getLogger()
logger.setLevel(logging.INFO)
screen_handler = logging.StreamHandler(sys.stdout)
screen_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(module)s.%(funcName)s:%(lineno)d - %(levelname)s - %(message)s')
screen_handler.setFormatter(formatter)
logger.addHandler(screen_handler)
def __init__(self,x_num=[None,None],y_num=[None,None]):
self.x_max_num = x_num[1]
self.x_min_num = x_num[0]
self.y_min_num = y_num[0]
self.y_max_num = y_num[1]
self.x = None
self.y = None
def cal_mut_info(self,p_matrix):
"""
計算互資訊值
:param p_matrix: 變數X和Y的構成的概率矩陣
:return: 互資訊值
"""
mut_info = 0
p_matrix = np.array(p_matrix)
for i in range(p_matrix.shape[0]):
for j in range(p_matrix.shape[1]):
if p_matrix[i,j] != 0:
mut_info += p_matrix[i,j]*np.log2(p_matrix[i,j]/(p_matrix[i,:].sum()*p_matrix[:,j].sum()))
self.logger.info("資訊系數為:{}".format(mut_info/np.log2(min(p_matrix.shape[0],p_matrix.shape[1]))))
return mut_info/np.log2(min(p_matrix.shape[0],p_matrix.shape[1]))
def divide_bin(self,x_num,y_num):
"""
指定在兩個變數方向上需劃分的網格數,回傳概率矩陣
:param x_num:
:param y_num:
:return: p_matrix
"""
p_matrix = np.zeros([x_num,y_num])
x_bin = np.linspace(self.x.min(),self.x.max()+1,x_num+1)
y_bin = np.linspace(self.y.min(),self.y.max()+1,y_num+1)
for i in range(x_num):
for j in range(y_num):
p_matrix[i,j] = sum([1 if (self.x[value] < x_bin[i + 1] and self.x[value] >= x_bin[i] and self.y[value] < y_bin[j + 1] and
self.y[value] >= y_bin[j]) else 0 for value in range(self.x.shape[0])])/self.x.shape[0]
return p_matrix
def cal_MIC(self,x,y):
self.x = np.array(x).reshape((-1,))
self.y = np.array(y).reshape((-1,))
if not self.x_max_num:
self.x_max_num = int(round(self.x.shape[0]**0.3,0))
self.y_max_num = self.x_max_num
self.x_min_num = 2
self.y_min_num = 2
mics = []
for i in range(self.x_min_num,self.x_max_num+1):
for j in range(self.y_min_num,self.x_max_num+1):
self.logger.info("劃磁區間數量為:[{},{}]".format(i,j))
mics.append(self.cal_mut_info(self.divide_bin(i,j)))
self.logger.info("最大資訊系數為:{}".format(max(mics)))
return max(mics)
計算實體
計算加入噪聲的線性相關變數的MIC
if __name__ == '__main__':
import matplotlib.pyplot as plt
x = np.arange(0,100)
y = x + 5 + np.array([np.random.random() for i in range(x.shape[0])] )
plt.scatter(x,y,c = 'g')
mic_tool = CyrusMIC()
mic_tool.cal_MIC(x,y)
plt.show()
2020-12-03 17:27:06,617 - MIC.cal_mut_info:41 - INFO - 資訊系數為:0.7193485237183258
2020-12-03 17:27:06,618 - MIC.cal_MIC:71 - INFO - 劃磁區間數量為:[4,2]
2020-12-03 17:27:06,621 - MIC.cal_mut_info:41 - INFO - 資訊系數為:1.0
2020-12-03 17:27:06,621 - MIC.cal_MIC:71 - INFO - 劃磁區間數量為:[4,3]
2020-12-03 17:27:06,631 - MIC.cal_mut_info:41 - INFO - 資訊系數為:0.714608689855715
2020-12-03 17:27:06,631 - MIC.cal_MIC:71 - INFO - 劃磁區間數量為:[4,4]
2020-12-03 17:27:06,643 - MIC.cal_mut_info:41 - INFO - 資訊系數為:0.9694248603634986
2020-12-03 17:27:06,643 - MIC.cal_MIC:73 - INFO - 最大資訊系數為:1.0

計算具有正弦關系變數的MIC
if __name__ == '__main__':
import matplotlib.pyplot as plt
x = np.arange(0,6,0.002)
y = np.sin(x)+5
plt.scatter(x,y,c = 'g')
mic_tool = CyrusMIC()
mic_tool.cal_MIC(x,y)
plt.show()
2020-12-03 17:32:17,002 - MIC.cal_MIC:71 - INFO - 劃磁區間數量為:[11,9]
2020-12-03 17:32:17,221 - MIC.cal_mut_info:41 - INFO - 資訊系數為:0.5534001973540179
2020-12-03 17:32:17,221 - MIC.cal_MIC:71 - INFO - 劃磁區間數量為:[11,10]
2020-12-03 17:32:17,477 - MIC.cal_mut_info:41 - INFO - 資訊系數為:0.540981036470426
2020-12-03 17:32:17,477 - MIC.cal_MIC:71 - INFO - 劃磁區間數量為:[11,11]
2020-12-03 17:32:17,755 - MIC.cal_mut_info:41 - INFO - 資訊系數為:0.5571694750793418
2020-12-03 17:32:17,755 - MIC.cal_MIC:73 - INFO - 最大資訊系數為:0.9204753790747687

by CyrusMay 2020 12 03
每顆心 的相信
每個人 的際遇
每個故事的自己
反覆地問著自己
這些年 讓步的
你是否 會嘆息
——————五月天(頑固)——————
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/229786.html
標籤:AI
