天池零基礎入門金融風控-貸款違約預測TOP6方案分享-有解無憂

詳細的特征工程+個人理解

- 前言
- 1 賽題理解
- 2 資料預處理
- - 2.1 缺失值分析
  - 2.2 編碼選擇
- 3 特征工程
- - 3.1 可解釋特征
  - 3.2 組合交叉特征
  - 3.3 暴力特征
- 4 模型融合
- - 4.1 模型選取
  - 4.2 特征篩選
  - 4.3 差異化模型和stacking融合
- 5 總結

前言

大家好，我是coggle開源小組成員廬州小火鍋，這篇文章將會介紹天池學習賽貸款違約預測的TOP6方案，現附上比賽鏈接天池學習賽貸款違約預測.
本次分享內容普遍適用于資料挖掘，金融風控比賽，相應長期賽剛剛開始，希望能給大家一點啟發，

1 賽題理解

在這里插入圖片描述
賽題以預測用戶貸款是否違約為任務，輸出不同用戶違約概率（0-1之間），該資料來自某信貸平臺的貸款記錄，總資料量超過120w，包含47列變數資訊，其中15列為匿名變數，從中抽取80萬條作為訓練集，20萬條作為測驗集A，20萬條作為測驗集B，同時會對employmentTitle、purpose、postCode和title等資訊進行脫敏，

2 資料預處理

原始資料包括訓練集和測驗集有100萬條，資料量相當大，后續進行特征工程會導致更多的記憶體損耗，因此進行記憶體優化是有必要的，先提供以下函式負責減少資料占用存盤空間大小，

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum()
    print('記憶體占用{:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum()
    print('優化后記憶體為: {:.2f} MB'.format(end_mem))
    print('記憶體使用減少 {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

2.1 缺失值分析

在這里插入圖片描述

訓練集和測驗集的缺失情況如上圖，缺失欄位包括’employmentTitle’, ‘employmentLength’, ‘postCode’, ‘dti’, ‘pubRecBankruptcies’, ‘revolUtil’, ‘title’, ‘n0’, ‘n1’, ‘n2’, ‘n4’, ‘n5’, ‘n6’, ‘n7’, ‘n8’, ‘n9’, ‘n10’, ‘n11’, ‘n12’, ‘n13’, ‘n14’，
缺失值的處理有三種方法：1.填充（平均值，眾數，或與已有數值明顯區分的數值-999等）；2.直接洗掉（洗掉存在缺失值或缺失率較高的樣本）；3.缺失值預測（利用其他列來預測缺失列資料，會有意想不到的驚喜，這里就不展開了），可知缺失情況較為嚴重的n0-n14特征在train和test中同時缺失較為突出，簡單的對這些缺失值填充-999，之后對缺失資料分析統計意義特征，
下面展示一些 行內代碼片，

cols = ['employmentTitle', 'employmentLength', 'postCode', 'dti', 'pubRecBankruptcies', 'revolUtil', 'title',
            'n0', 'n1', 'n2', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
for col in cols:
    data[col].fillna(r'\N', inplace=True)
cols = [f for f in cols if f not in ['employmentLength']]
for col in cols:
    data[col].replace({r'\N': -999}, inplace=True)
    data[col] = data[col]

2.2 編碼選擇

針對原始資料中出現的非數值型特征，我們需要將其轉換成模型可識別的數值符號，這就涉及到多種編碼形式，編碼選取應思考兩個問題：（1）盡可能準確的表示，（2）不帶來干擾或錯誤資訊，常見的編碼方法如下：
在這里插入圖片描述

其中count encoder，one-hot encoder，label encoder主要針對低基數無序特征，比如性別，針對高基數無序特征，比如地區，郵編等，可以采用target encoder或者mean encoder的方法，
值得注意的是，許多人認為目標編碼就是平均編碼，這是不正確的，兩者原理非常類似，常用的target encoder指的是用類別對應的標簽的期望來代替原始的類別，所謂的期望，簡單理解為均值即可（均值形式的target encoder），同時target encoder還可以采用類別對應的標簽中位值，標準差，最大值等形式來進行編碼表示，而mean encoder與均值形式的target encoder原理類似，但為避免過擬合加入了一些特別的手段而已，可以參考相關說明: 特征編碼方法總結，下面分別是兩種編碼的代碼應用，
先附上mean encoder代碼：

#平均編碼類定義
class MeanEncoder:
    def __init__(self, categorical_features, n_splits=5, target_type='classification', prior_weight_func=None):
        """
        :param categorical_features: list of str, the name of the categorical columns to encode

        :param n_splits: the number of splits used in mean encoding

        :param target_type: str, 'regression' or 'classification'

        :param prior_weight_func:
        a function that takes in the number of observations, and outputs prior weight
        when a dict is passed, the default exponential decay function will be used:
        k: the number of observations needed for the posterior to be weighted equally as the prior
        f: larger f --> smaller slope
        """

        self.categorical_features = categorical_features
        self.n_splits = n_splits
        self.learned_stats = {}

        if target_type == 'classification':
            self.target_type = target_type
            self.target_values = []
        else:
            self.target_type = 'regression'
            self.target_values = None

        if isinstance(prior_weight_func, dict):
            self.prior_weight_func = eval('lambda x: 1 / (1 + np.exp((x - k) / f))', dict(prior_weight_func, np=np))
        elif callable(prior_weight_func):
            self.prior_weight_func = prior_weight_func
        else:
            self.prior_weight_func = lambda x: 1 / (1 + np.exp((x - 2) / 1))

    @staticmethod
    def mean_encode_subroutine(X_train, y_train, X_test, variable, target, prior_weight_func):
        X_train = X_train[[variable]].copy()
        X_test = X_test[[variable]].copy()

        if target is not None:
            nf_name = '{}_pred_{}'.format(variable, target)
            X_train['pred_temp'] = (y_train == target).astype(int)  # classification
        else:
            nf_name = '{}_pred'.format(variable)
            X_train['pred_temp'] = y_train  # regression
        prior = X_train['pred_temp'].mean()

        col_avg_y = X_train.groupby(by=variable, axis=0)['pred_temp'].agg({'mean': 'mean', 'beta': 'size'})
        col_avg_y['beta'] = prior_weight_func(col_avg_y['beta'])
        col_avg_y[nf_name] = col_avg_y['beta'] * prior + (1 - col_avg_y['beta']) * col_avg_y['mean']
        col_avg_y.drop(['beta', 'mean'], axis=1, inplace=True)

        nf_train = X_train.join(col_avg_y, on=variable)[nf_name].values
        nf_test = X_test.join(col_avg_y, on=variable).fillna(prior, inplace=False)[nf_name].values

        return nf_train, nf_test, prior, col_avg_y

    def fit_transform(self, X, y):
        """
        :param X: pandas DataFrame, n_samples * n_features
        :param y: pandas Series or numpy array, n_samples
        :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features
        """
        X_new = X.copy()
        if self.target_type == 'classification':
            skf = StratifiedKFold(self.n_splits)
        else:
            skf = KFold(self.n_splits)

        if self.target_type == 'classification':
            self.target_values = sorted(set(y))
            self.learned_stats = {'{}_pred_{}'.format(variable, target): [] for variable, target in
                                  product(self.categorical_features, self.target_values)}
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = '{}_pred_{}'.format(variable, target)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, target,
                        self.prior_weight_func)
                    X_new.iloc[small_ind, -1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        else:
            self.learned_stats = {'{}_pred'.format(variable): [] for variable in self.categorical_features}
            for variable in self.categorical_features:
                nf_name = '{}_pred'.format(variable)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, None,
                        self.prior_weight_func)
                    X_new.iloc[small_ind, -1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        return X_new

    def transform(self, X):
        """
        :param X: pandas DataFrame, n_samples * n_features
        :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features
        """
        X_new = X.copy()

        if self.target_type == 'classification':
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = '{}_pred_{}'.format(variable, target)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits
        else:
            for variable in self.categorical_features:
                nf_name = '{}_pred'.format(variable)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits

        return X_new

對高基數無序特征進行平均編碼

class_list = ['postCode', 'purpose', 'regionCode', 'grade', 'subGrade', 'homeOwnership', 'employmentTitle','title']
MeanEnocodeFeature = class_list  # 宣告需要平均數編碼的特征
ME = MeanEncoder(MeanEnocodeFeature, target_type='classification')  # 宣告平均數編碼的類

常規的target encoder目標編碼代碼容易造成過擬合，因此我們引入5折交叉驗證的形式改進編碼，即將樣本分為5塊（fold），如下圖，每一fold中的該高基數無序特征類別由其它4個fold中的類別對應標簽平均值替換表示：
在這里插入圖片描述
代碼如下：

def kfold_stats_feature(train, test, feats, k):
    folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=6666)  # 這里最好和后面模型的K折交叉驗證保持一致

    train['fold'] = None
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['isDefault'])):
        train.loc[val_idx, 'fold'] = fold_

    kfold_features = []
    for feat in feats:
        nums_columns = ['isDefault']
        for f in nums_columns:
            colname = feat + '_' + f + '_kfold_mean'
            kfold_features.append(colname)
            train[colname] = None
            for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['isDefault'])):
                tmp_trn = train.iloc[trn_idx]
                order_label = tmp_trn.groupby([feat])[f].mean()
                tmp = train.loc[train.fold == fold_, [feat]]
                train.loc[train.fold == fold_, colname] = tmp[feat].map(order_label)
                # fillna
                global_mean = train[f].mean()
                train.loc[train.fold == fold_, colname] = train.loc[train.fold == fold_, colname].fillna(global_mean)
            train[colname] = train[colname].astype(float)

        for f in nums_columns:
            colname = feat + '_' + f + '_kfold_mean'
            test[colname] = None
            order_label = train.groupby([feat])[f].mean()
            test[colname] = test[feat].map(order_label)
            # fillna
            global_mean = train[f].mean()
            test[colname] = test[colname].fillna(global_mean)
            test[colname] = test[colname].astype(float)
    del train['fold']
    return train, test
    
target_encode_cols = ['postCode', 'regionCode', 'homeOwnership', 'employmentTitle','title']
kflod_num=5 #5折交叉驗證
train, test = kfold_stats_feature(train, test, target_encode_cols, kflod_num)

針對含有次序意義的特征，target編碼和mean編碼無法反映次序大小關系，會丟失部分資訊，比如兩個十分關鍵的特征grade（貸款等級）以及subGrade（貸款等級子等級）有著明顯的次序關系，我們對其進行自定義編碼，如下：


def gradeTrans(x):
    dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
    result = dict[x]
    return result


def subGradeTrans(x):
    dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
    result = dict[x[0]]
    result = result * 5 + int(x[1])
    return result
    
data['grade'] = data['grade'].apply(lambda x: gradeTrans(x))
data['subGrade'] = data['subGrade'].apply(lambda x: subGradeTrans(x))

3 特征工程

樣本質量和特征決定著指標的上限，模型只能決定接近這個上限的程度，特征工程是重中之重，下面介紹一些常用的組合，交叉以及暴力特征，

3.1 可解釋特征

金融風控領域比賽構造特征的一個思路在于衡量用戶的價值和創利能力，下面是基于此思路的一些自定義的可解釋特征，

// 以下是自定義的一些特征，用于衡量用戶價值和創利能力
data['avg_income'] = data['annualIncome'] / data['employmentLength']
data['total_income'] = data['annualIncome'] * data['employmentLength']
data['avg_loanAmnt'] = data['loanAmnt'] / data['term']
data['mean_interestRate'] = data['interestRate'] / data['term']
data['all_installment'] = data['installment'] * data['term']

data['rest_money_rate'] = data['avg_loanAmnt'] / (data['annualIncome'] + 0.1)  # 287個收入為0
data['rest_money'] = data['annualIncome'] - data['avg_loanAmnt']

data['closeAcc'] = data['totalAcc'] - data['openAcc']
data['ficoRange_mean'] = (data['ficoRangeHigh'] + data['ficoRangeLow']) / 2
del data['ficoRangeHigh'], data['ficoRangeLow']

data['rest_pubRec'] = data['pubRec'] - data['pubRecBankruptcies']
data['rest_Revol'] = data['loanAmnt'] - data['revolBal']
data['dis_time'] = data['issueDate_year'] - (2020 - data['earliesCreditLine_year'])

3.2 組合交叉特征

包括離散型特征（類別特征）和連續性特征的一階交叉：

//定義離散型特征和連續型特征
col_cat = ['subGrade', 'grade', 'employmentLength', 'term', 'homeOwnership', 'postCode', 'regionCode','employmentTitle','title']
col_num = ['dti', 'revolBal','revolUtil', 'ficoRangeHigh', 'interestRate', 'loanAmnt', 'installment', 'annualIncome', 'n14',
             'n2', 'n6', 'n9', 'n5', 'n8']
             
# 定義離散型特征和連續型特征交叉特征統計函式
def cross_cat_num(df, num_col, cat_col):
    for f1 in tqdm(cat_col):
        g = df.groupby(f1, as_index=False)
        for f2 in tqdm(num_col):
            feat = g[f2].agg({
                '{}_{}_max'.format(f1, f2): 'max', '{}_{}_min'.format(f1, f2): 'min',
                '{}_{}_median'.format(f1, f2): 'median',
            })
            df = df.merge(feat, on=f1, how='left')
    return (df)
    
data = cross_cat_num(data, col_num, col_cat)  # 一階交叉
print('一階交叉特征處理后：', data.shape)

類別特征之間的二階交叉：

def cross_qua_cat_num(df):
    for f_pair in tqdm([
        ['subGrade', 'regionCode'], ['grade', 'regionCode'], ['subGrade', 'postCode'], ['grade', 'postCode'], ['employmentTitle','title'],
        ['regionCode','title'], ['postCode','title'], ['homeOwnership','title'], ['homeOwnership','employmentTitle'],['homeOwnership','employmentLength'],
        ['regionCode', 'postCode']
    ]):
        ### 共現次數
        df['_'.join(f_pair) + '_count'] = df.groupby(f_pair)['id'].transform('count')
        ### n unique、熵
        df = df.merge(df.groupby(f_pair[0], as_index=False)[f_pair[1]].agg({
            '{}_{}_nunique'.format(f_pair[0], f_pair[1]): 'nunique',
            '{}_{}_ent'.format(f_pair[0], f_pair[1]): lambda x: entropy(x.value_counts() / x.shape[0])
        }), on=f_pair[0], how='left')
        df = df.merge(df.groupby(f_pair[1], as_index=False)[f_pair[0]].agg({
            '{}_{}_nunique'.format(f_pair[1], f_pair[0]): 'nunique',
            '{}_{}_ent'.format(f_pair[1], f_pair[0]): lambda x: entropy(x.value_counts() / x.shape[0])
        }), on=f_pair[1], how='left')
        ### 比例偏好
        df['{}_in_{}_prop'.format(f_pair[0], f_pair[1])] = df['_'.join(f_pair) + '_count'] / df[f_pair[1] + '_count']
        df['{}_in_{}_prop'.format(f_pair[1], f_pair[0])] = df['_'.join(f_pair) + '_count'] / df[f_pair[0] + '_count']
    return (df)

3.3 暴力特征

針對資料中的n0-n14匿名特征，通過一套組合拳提取暴力特征

//求熵
def myEntro(x):
    """
        calculate shanno ent of x
    """
    x = np.array(x)
    x_value_list = set([x[i] for i in range(x.shape[0])])
    ent = 0.0
    for x_value in x_value_list:
        p = float(x[x == x_value].shape[0]) / x.shape[0]
        logp = np.log2(p)
        ent -= p * logp
    #     print(x_value,p,logp)
    # print(ent)
    return ent

#求均方根
def myRms(records):
    records = list(records)
    """
    均方根值 反映的是有效值而不是平均值
    """
    return np.math.sqrt(sum([x ** 2 for x in records]) / len(records))

//求取眾數
def myMode(x):
    return np.mean(pd.Series.mode(x))
    
//分別求取10，25，75，90分位值
def myQ25(x):
    return x.quantile(0.25)
    
def myQ75(x):
    return x.quantile(0.75)

def myQ10(x):
    return x.quantile(0.1)
    
def myQ90(x):
    return x.quantile(0.9)
    
//求值的范圍
def myRange(x):
    return pd.Series.max(x) - pd.Series.min(x)

n_feat = ['n0', 'n1', 'n2', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14', ]
nameList = ['min', 'max', 'sum', 'mean', 'median', 'skew', 'std', 'mode', 'range', 'Q25','Q75']
statList = ['min', 'max', 'sum', 'mean', 'median', 'skew', 'std', myMode, myRange, myQ25, myQ75]

for i in range(len(nameList)):
	data['n_feat_{}'.format(nameList[i])] = data[n_feat].agg(statList[i], axis=1)
print('n特征處理后：', data.shape)

4 模型融合

4.1 模型選取

針對金融風控領域：大部分選取的是機器學習模型中的Random Forests, lightgbm，xgboost，catboost等樹模型，而不是深度學習模型，我認為原因主要包括：1.樣本數量小 2.樣本不均衡 3.深度學習模型對于特定結構的特征學習效果較好（比如文本和影像），而針對具有實際意義的金融領域特征來說，傳統樹模型構造的可解釋性特征效果顯著，

4.2 特征篩選

下面提供xgboost的模型代碼，利用生成的特征重要性可以進一步的篩選特征，特征重要性保存在feature_importance.csv中，

def xgb_model(train, target, test, k):

    feats = [f for f in train.columns if f not in ['id', 'isDefault']]
    feaNum = len(feats)
    print('參與訓練的特征數目:', len(feats))
#     seeds = [6666,2020]
    seeds = [2020]
    output_preds = 0
    xgb_oof_probs = np.zeros(train.shape[0])

    for seed in seeds:
        folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
        oof_probs = np.zeros(train.shape[0])

        offline_score = []
        feature_importance_df = pd.DataFrame()
        params = {'booster': 'gbtree',
                  'objective': 'binary:logistic',
                  'eval_metric': 'auc',
                  'min_child_weight': 5,
                  'max_depth': 8,
                  'subsample': ss,
                  'colsample_bytree': fs,
                  'eta': 0.01,
                  # 'scale_pos_weight': 0.2,
                  'seed': seed,
                  'nthread': -1,
                  'tree_method': 'gpu_hist'
                  }
        for i, (train_index, test_index) in enumerate(folds.split(train, target)):
            
            train_y, test_y = target[train_index], target[test_index]
            train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]
            train_matrix = xgb.DMatrix(train_X, label=train_y, missing=np.nan)
            valid_matrix = xgb.DMatrix(test_X, label=test_y, missing=np.nan)
            test_matrix = xgb.DMatrix(test[feats], missing=np.nan)
            watchlist = [(train_matrix, 'train'), (valid_matrix, 'eval')]
            model = xgb.train(params, train_matrix, num_boost_round=30000, evals=watchlist, verbose_eval=100,
                              early_stopping_rounds=600)
            val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
            train_pred = model.predict(train_matrix, ntree_limit=model.best_ntree_limit)
            xgb_oof_probs[test_index] += val_pred / len(seeds)
            # oof_probs[test_index] += val_pred
            test_pred = model.predict(test_matrix, ntree_limit=model.best_ntree_limit)

            # 繪制roc曲線
            train_auc_value, valid_auc_value = plotroc(train_y, train_pred, test_y, val_pred)
            print('train_auc:{},valid_auc{}'.format(train_auc_value, valid_auc_value))
            offline_score.append(valid_auc_value)
            print(offline_score)
            output_preds += test_pred / k / len(seeds)
         
            fold_importance_df = pd.DataFrame()
            fold_importance_df["Feature"] = model.get_fscore().keys()
            fold_importance_df["importance"] = model.get_fscore().values()
            fold_importance_df["fold"] = i + 1

            feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)



        print('all_auc:', roc_auc_score(target.values, oof_probs))
        print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))
        feature_sorted = feature_importance_df.groupby(['Feature'])['importance'].mean().sort_values(ascending=False)
        feature_sorted.to_csv('feature_importance.csv')
        top_features = feature_sorted.index
        print(feature_importance_df.groupby(['Feature'])['importance'].mean().sort_values(ascending=False).head(50))
    return output_preds, xgb_oof_probs, np.mean(offline_score), feaNum

4.3 差異化模型和stacking融合

模型融合示意圖
構建差異化模型的目的在于提高系統的穩定性和魯棒性，防止抖動，最終構建lightgbm，catboost，xgboost的pipline模型，同時利用皮爾遜相關系數分析模型結果的差異性，選取差異較大的結果檔案，通過第二層為RF的雙層stacking融合方式來進一步優化結果，

5 總結

第一次分享比賽經驗，希望大家有所識訓，作為競賽圈的新人，期待和各位朋友交流：caohuan8@mail.ustc.edu.cn，相關代碼見我的github主頁：代碼鏈接.
對競賽感興趣的朋友歡迎關注公眾號：Coggle資料科學，DataWhale，

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/240038.html

標籤：其他

上一篇：CDH安裝教程

下一篇：Linux[CentOS 7]下搭建hadoop偽分布式