EduCoder平臺：人工智能之決策樹演算法

第1關：決策樹演算法求解分類預測問題

編程要求：

本關的編程任務是補全右側代碼片段 build、predict、parse_data、calc_all_gain、calc_attr_gain、calc_bool_gain、get_targ 和 is_leaf 中 Begin 至 End 中間的代碼，具體要求如下：

在build中，創建一棵決策樹，輸入引數為根結點；
在predict中，根據歸納好的決策樹預測輸入樣例x的謂詞 WillWait 狀態（Yes 或者 No）；
在_parse_data_中，決議輸入矩陣資料（在 Python 里以二維串列資料存盤），各引數詳見代碼中函式注解，然后回傳資訊增益最大的屬性名稱及其屬性值串列；
在_calc_all_gain_中，計算所有樣本的資訊熵并回傳，各引數詳見代碼中函式注解；
在 calc_attr_gain 中，計算某一特征屬性的資訊熵并回傳，各引數詳見代碼中函式注解；
在_calc_bool_gain_中，計算二值隨機變數的資訊熵并回傳，各引數詳見代碼中函式注解；
在_get_targ_中，計算葉子結點的決策分類標簽并回傳，各引數詳見代碼中函式注解；
在_is_leaf_中，判斷該結點是否為葉子結點，若是則回傳 True，否則回傳 False，

測驗說明：

平臺將自動編譯補全后的代碼，并生成若干組測驗資料，接著根據程式的輸出判斷程式是否正確，

以下是平臺的測驗樣例：

測驗輸入：

[[example, Alt, Bar, Fri, Hun, Pat, Price, Rain, Res, Type, Est],[x1,
Yes, No, No, Yes, Some, $$$, No, Yes, French, 0-10]]

預期輸出：

Yes

代碼如下：


# -*- coding: UTF-8 -*-
import math
class TreeNode:
    '''決策樹結點資料結構
    成員變數：
    row - int 串列資料的行數，初始13
    col - int 串列資料的列數，初始12
    data - list[[]] 二維串列資料，初始資料形式在testDecisionTree.py里
                    第0行：[第0列：example(樣本名字) 中間各列(1-10)：各個特征屬性名稱 第11列：WillW ait(目標分類) ]
                    第1-12行：[樣本名字，具體屬性值，分類目標]
        data = [
        ['example', 'Alt', 'Bar', 'Fri', 'Hun', 'Pat',  'Price', 'Rain', 'Res', 'Type',   'Est',   'WillW ait'],
        ['x1',      'Yes', 'No',  'No',  'Yes', 'Some', '$$$',   'No',   'Yes', 'French', '0-10',  'y1=Yes'   ],
        ['x2',      'Yes', 'No',  'No',  'Yes', 'Full', '$',     'No',   'No',  'Thai',   '30-60', 'y2=No'    ],
            ........            .....       .....       .........           ............
        ['x12',     'Yes', 'Yes', 'Yes', 'Yes', 'Full', '$',     'No',   'No',  'Burger', '30-60', 'y12=Yes'  ] ]
    targ - string 分類結果 Yes No
    name - string 結點名字：特征屬性名稱
    attr - list[string] 該特征屬性下的各個屬性值
    children - list[GameNode] 該特征屬性下的各個決策樹子結點，與 attr 一一對應
    '''
    def __init__(self, row, col, data):
        self.row = row
        self.col = col
        self.data = data
        self.targ = ''          # target result
        self.name = ''          # attribute name
        self.attr = []          # attribute value list
        self.child = []         # attribute - TreeNode List
class DecisionTree:
    '''決策樹
    成員變數：
    root - TreeNode 博弈樹根結點
    成員函式：
    buildTree - 創建決策樹
    predict - 預測樣本分類標簽
    _parse_data_ - 決議資料中最大資訊增益的特性屬性
    _calc_all_gain_ - 計算整個樣本的資訊熵
    _calc_attr_gain_ - 計算某一特征屬性的資訊熵
    _calc_bool_gain_ - 通用計算函式：計算二值隨機變數的資訊熵
    _get_targ_ - 獲取葉子結點的決策分類標簽
    _is_leaf_ - 判斷該結點是否為葉子結點
    '''
    def __init__(self, row, col, data):
        self.root = TreeNode(row, col, data)
    def build(self, root):
        '''遞回法創建博弈樹
        引數：
        root - TreeNode 初始為決策樹根結點
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        if self._is_leaf_(root):
            root.targ = self._get_targ_(root)
            return
        root.name, root.attr = self._parse_data_(root.row, root.col, root.data)
        idj = [j for j in range(root.col) if root.data[0][j] == root.name][0]
        for attr in root.attr:
            row = 0
            col = root.col - 1
            data = []
            for i in range(root.row):
                if i!=0 and root.data[i][idj] != attr:
                    continue
                tmp = []
                for j in range(root.col):
                    if j == idj:
                        continue
                    tmp.append(root.data[i][j])
                data.append(tmp)
                row += 1
            node = TreeNode(row, col, data)
            root.child.append(node)
        for node in root.child:
            self.build(node)
        #********** End **********#
    def predict(self, root, x):
        '''分類預測
        引數：
        root - TreeNode 決策樹根結點
        x - [[]] 測驗資料，形如：
           [ ['example', 'Alt', 'Bar', 'Fri', 'Hun', 'Pat', 'Price', 'Rain', 'Res', 'Type',  'Est'],
             ['x1',      'Yes', 'No',  'No',  'Yes', 'Some', '$$$',  'No',   'Yes', 'French','0-10'] ]
        回傳值：
        clf - string 分類標簽 Yes No
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        if self._is_leaf_(root):
            return root.targ
        id_name = x[0].index(root.name)
        for id_attr, attr in enumerate(root.attr):
            if attr == x[1][id_name]:
                return self.predict(root.child[id_attr], x)
        #********** End **********#
    def _parse_data_(self, row, col, data):
        '''決議資料：計算資料中最大資訊增益的特性屬性
        引數：
        row - int 串列資料的行數
        col - int 串列資料的列數
        data - list[[]] 二維串列資料，形如：
                第0行：[第0列：example(樣本名字) 中間各列(1-10)：各個特征屬性名稱 第11列：WillW ait(目標分類) ]
                第1-12行：[樣本名字，具體屬性值，分類目標]
        data = [
        ['example', 'Alt', 'Bar', 'Fri', 'Hun', 'Pat',  'Price', 'Rain', 'Res', 'Type',   'Est',   'WillW ait'],
        ['x1',      'Yes', 'No',  'No',  'Yes', 'Some', '$$$',   'No',   'Yes', 'French', '0-10',  'y1=Yes'   ],
        ['x2',      'Yes', 'No',  'No',  'Yes', 'Full', '$',     'No',   'No',  'Thai',   '30-60', 'y2=No'    ],
            ........            .....       .....       .........           ............
        ['x12',     'Yes', 'Yes', 'Yes', 'Yes', 'Full', '$',     'No',   'No',  'Burger', '30-60', 'y12=Yes'  ] ]
        回傳值：
        clf - string, list[] 資訊增益最大的屬性名稱 及其 屬性值串列
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        max_gain = -float('inf')
        max_name = ''
        max_attr = []
        max_idj = -1
        all_gain = self._calc_all_gain_(row-1, [x[-1] for x in data[1:]])      # col = 1
        #print('all_gain: ', all_gain)
        for j in range(1, col-1, 1):
            tmp_data = []
            for i in range(1, row, 1):
                tmp_data.append([data[i][j], data[i][-1]])
            tmp_gain = self._calc_attr_gain_(row-1, tmp_data)       # col = 2
            if (all_gain - tmp_gain) > max_gain:
                max_gain = all_gain - tmp_gain
                max_name = data[0][j]
                max_idj = j
            #print(max_gain, max_name, max_idj, tmp_gain, data[0][j], all_gain - tmp_gain)
        for i in range(1, row, 1):
            if data[i][max_idj] not in max_attr:
                max_attr.append(data[i][max_idj])
        return max_name, max_attr
        #********** End **********#
    def _calc_all_gain_(self, row, data):
        '''計算整個樣本的資訊熵
        引數：
        row - int 串列資料的行數
        data - list[] 一維串列資料，形如：[分類目標]
                data = ['y1=Yes', 'y2=No', ........, 'y12=Yes']
        回傳值：
        clf - float 資訊熵
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        dict_ = {'yes':0.0, 'no':0.0}
        for i in range(row):
            if data[i][-1] == 's':  # 'Yes'
                dict_['yes'] += 1.0
            else:               # 'No'
                dict_['no'] += 1.0
        sum = 0.0
        for key_ in dict_:
            sum += (1.0 * dict_[key_] / float(row)) * math.log(1.0 * dict_[key_] / float(row), 2)
        return -sum
        #********** End **********#
    def _calc_attr_gain_(self, row, data):
        '''計算某一特征屬性的資訊熵
        引數：
        row - int 串列資料的行數
        data - list[[]] 二維串列資料（2列），形如：[[某一屬性值，分類目標]]
                  [ ['0-10',  'y1=Yes'   ],
                    ['30-60', 'y2=No'    ],
                      ........
                    ['30-60', 'y12=Yes'  ] ]
        回傳值：
        clf - float 資訊熵
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        # attributes
        dict_ = {}
        for i in range(row):
            if data[i][0] not in dict_:
                dict_[data[i][0]] = [0.0, 0.0]  # [yes, no]
            # attribute : yes or no
            if data[i][1][-1] == 's':   # yes
                dict_[data[i][0]][0] += 1.0
            else:                       # no
                dict_[data[i][0]][1] += 1.0
        sum = 0.0
        for key_ in dict_:
            p = 1.0 * dict_[key_][0] / (dict_[key_][0] + dict_[key_][1])
            sum += (1.0 * (dict_[key_][0] + dict_[key_][1]) / float(row)) * self._calc_bool_gain_(p)
        return sum
        #********** End **********#
    def _calc_bool_gain_(self, p):
        '''通用計算函式：計算二值隨機變數的資訊熵
        引數：
        p - float 二值隨機變數的概率 在[0, 1]之間
        回傳值：
        clf - float 資訊熵
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        if p == 1 or p == 0:
            return 0.0
        return -(p * math.log(p, 2) + (1-p) * math.log((1-p), 2))
        #********** End **********#
    def _get_targ_(self, node):
        '''計算葉子結點的決策分類標簽
        引數：
        node - TreeNode 決策樹結點
        回傳值：
        clf - string 分類標簽 Yes No
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        yes = 0
        no = 0
        for i in range(1, node.row, 1):
            if node.data[i][-1][-1] == 's':     # 'Yes'
                yes += 1
            else:                               # 'No'
                no += 1
        if yes > no:
            return 'Yes'
        else:
            return 'No'
        #********** End **********#
    def _is_leaf_(self, node):
        '''判斷該結點是否為葉子結點
        引數：
        node - TreeNode 決策樹結點
        回傳值：
        clf - bool 葉子結點True 非葉子結點False
        '''
        #請在這里補充代碼，完成本關任務
        #********** Begin **********#
        if node.col == 2:               # [ x* , y* ] without any attributes
            return True
        targ = node.data[-1][-1][-1]    # [ x* , attr , y* ] attributes
        for i in range(node.row):
            if i == 0:
                continue
            if node.data[i][-1][-1] != targ:
                return False
        return True                     # all y* are Yes or No

        #********** End **********#

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/394374.html

標籤：AI

上一篇：資訊安全技術真難啊

下一篇：sqoop匯入資料到mysql時報錯：ERROR tool.ExportTool: Error during export: Export job failed

頭歌平臺-人工智能之決策樹演算法

EduCoder平臺：人工智能之決策樹演算法

第1關：決策樹演算法求解分類預測問題