背景

關于用戶留存有這樣一個觀點，如果將用戶流失率降低5%，公司利潤將提升25%-85%，如今高居不下的獲客成本讓電信運營商遭遇“天花板”，甚至陷入獲客難的窘境，隨著市場飽和度上升，電信運營商亟待解決增加用戶黏性，延長用戶生命周期的問題，因此，電信用戶流失分析與預測至關重要，
資料集來自kesci中的“電信運營商客戶資料集”
資料集：添加鏈接描述

本文將從以下方面進行分析：

1.背景
2.提出問題
3.理解資料
4.資料清洗
5.可視化分析
6.用戶流失預測
7.結論和建議

提出問題

1.分析用戶特征與流失的關系，
2.從整體情況看，流失用戶的普遍具有哪些特征？
3.嘗試找到合適的模型預測流失用戶，
4.針對性給出增加用戶黏性、預防流失的建議，

理解資料

根據介紹，該資料集有21個欄位，共7043條記錄，每條記錄包含了唯一客戶的特征，
我們目標就是發現前20列特征和最后一列客戶是否流失特征之間的關系，

資料清洗

資料清洗的“完全合一”規則：

完整性：單條資料是否存在空值，統計的欄位是否完善，
全面性：觀察某一列的全部數值，通過常識來判斷該列是否有問題，比如：資料定義、單位標識、資料本身，
合法性：資料的型別、內容、大小的合法性，比如資料中是否存在非ASCII字符，性別存在了未知，年齡超過了150等，
唯一性：資料是否存在重復記錄，因為資料通常來自不同渠道的匯總，重復的情況是常見的，行資料、列資料都需要是唯一的，
匯入工具包，

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
customerDF = pd.read_csv('/home/kesci/input/yidong4170/WA_Fn-UseC_-Telco-Customer-Churn.csv')
# 查看資料集大小
customerDF.shape
# 運行結果：(7043, 21)

# 設定查看列不省略
pd.set_option('display.max_columns',None)

# 查看前10條資料
customerDF.head(10)

在這里插入圖片描述

# Null計數
pd.isnull(customerDF).sum()
# 查看資料型別
customerDF.info()
#customerDf.dtypes

#將‘TotalCharges’總消費額的資料型別轉換為浮點型，發現錯#誤：字串無法轉換為數字，

依次檢查各個欄位的資料型別、欄位內容和數量，最后發現“TotalCharges”（總消費額）列有11個用戶資料缺失，

# 查看每一列資料取值
for x in customerDF.columns:
    test=customerDF.loc[:,x].value_counts()
    print('{0} 的行數是：{1}'.format(x,test.sum()))
    print('{0} 的資料型別是：{1}'.format(x,customerDF[x].dtypes))
    print('{0} 的內容是：\n{1}\n'.format(x,test))

在這里插入圖片描述
采用強制轉換，將“TotalCharges”（總消費額）轉換為浮點型資料，

#強制轉換為數字，不可轉換的變為NaN
customerDF['TotalCharges']=customerDF['TotalCharges'].convert_objects(convert_numeric=True)

#強制轉換為數字，不可轉換的變為NaN
customerDF[‘TotalCharges’]=customerDF[‘TotalCharges’].convert_objects(convert_numeric=True)

test=customerDF.loc[:,'TotalCharges'].value_counts().sort_index()
print(test.sum())
#運行結果：7032

print(customerDF.tenure[customerDF['TotalCharges'].isnull().values==True])
#運行結果：11

#將總消費額填充為月消費額
customerDF.loc[:,'TotalCharges'].replace(to_replace=np.nan,value=customerDF.loc[:,'MonthlyCharges'],inplace=True)
#查看是否替換成功
print(customerDF[customerDF['tenure']==0][['tenure','MonthlyCharges','TotalCharges']])

# 將‘tenure’入網時長從0修改為1
customerDF.loc[:,'tenure'].replace(to_replace=0,value=1,inplace=True)
print(pd.isnull(customerDF['TotalCharges']).sum())
print(customerDF['TotalCharges'].dtypes)

查看資料的描述統計資訊，根據一般經驗，所有資料正常，

查看資料的描述統計資訊，根據一般經驗，所有資料正常，

在這里插入圖片描述

可視化分析

根據一般經驗，將用戶特征劃分為用戶屬性、服務屬性、合同屬性，并從這三個維度進行可視化分析，

查看流失用戶數量和占比，

plt.rcParams['figure.figsize']=6,6
plt.pie(customerDF['Churn'].value_counts(),labels=customerDF['Churn'].value_counts().index,autopct='%1.2f%%',explode=(0.1,0))
plt.title('Churn(Yes/No) Ratio')
plt.show()

在這里插入圖片描述

churnDf=customerDF['Churn'].value_counts().to_frame()
x=churnDf.index
y=churnDf['Churn']
plt.bar(x,y,width = 0.5,color = 'c')

#用來正常顯示中文標簽（需要安裝字庫）
plt.title('Churn(Yes/No) Num')
plt.show()

在這里插入圖片描述

屬于不平衡資料集，流失用戶占比達26.54%，

（1）用戶屬性分析

def barplot_percentages(feature,orient='v',axis_name="percentage of customers"):
    ratios = pd.DataFrame()
    g = (customerDF.groupby(feature)["Churn"].value_counts()/len(customerDF)).to_frame()
    g.rename(columns={"Churn":axis_name},inplace=True)
    g.reset_index(inplace=True)

    #print(g)
    if orient == 'v':
        ax = sns.barplot(x=feature, y= axis_name, hue='Churn', data=g, orient=orient)
        ax.set_yticklabels(['{:,.0%}'.format(y) for y in ax.get_yticks()])
        plt.rcParams.update({'font.size': 13})
        #plt.legend(fontsize=10)
    else:
        ax = sns.barplot(x= axis_name, y=feature, hue='Churn', data=g, orient=orient)
        ax.set_xticklabels(['{:,.0%}'.format(x) for x in ax.get_xticks()])
        plt.legend(fontsize=10)
    plt.title('Churn(Yes/No) Ratio as {0}'.format(feature))
    plt.show()
barplot_percentages("SeniorCitizen")
barplot_percentages("gender")

在這里插入圖片描述

customerDF['churn_rate'] = customerDF['Churn'].replace("No", 0).replace("Yes", 1)
g = sns.FacetGrid(customerDF, col="SeniorCitizen", height=4, aspect=.9)
ax = g.map(sns.barplot, "gender", "churn_rate", palette = "Blues_d", order= ['Female', 'Male'])
plt.rcParams.update({'font.size': 13})
plt.show()

在這里插入圖片描述

小結：
用戶流失與性別基本無關；
年老用戶流失占顯著高于年輕用戶，

fig, axis = plt.subplots(1, 2, figsize=(12,4))
axis[0].set_title("Has Partner")
axis[1].set_title("Has Dependents")
axis_y = "percentage of customers"

# Plot Partner column
gp_partner = (customerDF.groupby('Partner')["Churn"].value_counts()/len(customerDF)).to_frame()
gp_partner.rename(columns={"Churn": axis_y}, inplace=True)
gp_partner.reset_index(inplace=True)
ax1 = sns.barplot(x='Partner', y= axis_y, hue='Churn', data=gp_partner, ax=axis[0])
ax1.legend(fontsize=10)
#ax1.set_xlabel('伴侶')


# Plot Dependents column
gp_dep = (customerDF.groupby('Dependents')["Churn"].value_counts()/len(customerDF)).to_frame()
#print(gp_dep)
gp_dep.rename(columns={"Churn": axis_y} , inplace=True)
#print(gp_dep)
gp_dep.reset_index(inplace=True)
#print(gp_dep)

ax2 = sns.barplot(x='Dependents', y= axis_y, hue='Churn', data=gp_dep, ax=axis[1])
#ax2.set_xlabel('家屬')


#設定字體大小
plt.rcParams.update({'font.size': 20})
ax2.legend(fontsize=10)

#設定
plt.show()

在這里插入圖片描述

# Kernel density estimaton核密度估計
def kdeplot(feature,xlabel):
    plt.figure(figsize=(9, 4))
    plt.title("KDE for {0}".format(feature))
    ax0 = sns.kdeplot(customerDF[customerDF['Churn'] == 'No'][feature].dropna(), color= 'navy', label= 'Churn: No', shade='True')
    ax1 = sns.kdeplot(customerDF[customerDF['Churn'] == 'Yes'][feature].dropna(), color= 'orange', label= 'Churn: Yes',shade='True')
    plt.xlabel(xlabel)
    #設定字體大小
    plt.rcParams.update({'font.size': 20})
    plt.legend(fontsize=10)
kdeplot('tenure','tenure')
plt.show()

在這里插入圖片描述

小結：
有伴侶的用戶流失占比低于無伴侶用戶；
有家屬的用戶較少；
有家屬的用戶流失占比低于無家屬用戶;
在網時長越久，流失率越低，符合一般經驗；
在網時間達到三個月，流失率小于在網率，證明用戶心理穩定期一般是三個月，

（2）服務屬性分析

plt.figure(figsize=(9, 4.5))
barplot_percentages("MultipleLines", orient='h')

在這里插入圖片描述

plt.figure(figsize=(9, 4.5))
barplot_percentages("InternetService", orient="h")

在這里插入圖片描述

cols = ["PhoneService","MultipleLines","OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"]
df1 = pd.melt(customerDF[customerDF["InternetService"] != "No"][cols])
df1.rename(columns={'value': 'Has service'},inplace=True)
plt.figure(figsize=(20, 8))
ax = sns.countplot(data=df1, x='variable', hue='Has service')
ax.set(xlabel='Internet Additional service', ylabel='Num of customers')
plt.rcParams.update({'font.size':20})
plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
plt.title('Num of Customers as Internet Additional Service')
plt.show()

在這里插入圖片描述

plt.figure(figsize=(20, 8))
df1 = customerDF[(customerDF.InternetService != "No") & (customerDF.Churn == "Yes")]
df1 = pd.melt(df1[cols])
df1.rename(columns={'value': 'Has service'}, inplace=True)
ax = sns.countplot(data=df1, x='variable', hue='Has service', hue_order=['No', 'Yes'])
ax.set(xlabel='Internet Additional service', ylabel='Churn Num')
plt.rcParams.update({'font.size':20})
plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
plt.title('Num of Churn Customers as Internet Additional Service')
plt.show()

在這里插入圖片描述
電話服務整體對用戶流失影響較小，
單光纖用戶的流失占比較高；
光纖用戶系結了安全、備份、保護、技術支持服務的流失率較低；
光纖用戶附加流媒體電視、電影服務的流失率占比較高，

（3）合同屬性分析?

plt.figure(figsize=(9, 4.5))
barplot_percentages("PaymentMethod",orient='h')

在這里插入圖片描述

g = sns.FacetGrid(customerDF, col="PaperlessBilling", height=6, aspect=.9)
ax = g.map(sns.barplot, "Contract", "churn_rate", palette = "Blues_d", order= ['Month-to-month', 'One year', 'Two year'])
plt.rcParams.update({'font.size':18})
plt.show()

在這里插入圖片描述

kdeplot('MonthlyCharges','MonthlyCharges')
kdeplot('TotalCharges','TotalCharges')
plt.show()

在這里插入圖片描述

小結：

采用電子支票支付的用戶流失率最高，推測該方式的使用體驗較為一般；
簽訂合同方式對客戶流失率影響為：按月簽訂 > 按一年簽訂 > 按兩年簽訂，證明長期合同最能保留客戶；
月消費額大約在70-110之間用戶流失率較高；
長期來看，用戶總消費越高，流失率越低，符合一般經驗，

用戶流失預測

對資料集進一步清洗和提取特征，通過特征選取對資料進行降維，采用機器學習模型應用于測驗資料集，然后對構建的分類模型準確性進行分析

（1）資料清洗

customerID=customerDF['customerID']
customerDF.drop(['customerID'],axis=1, inplace=True)
cateCols = [c for c in customerDF.columns if customerDF[c].dtype == 'object' or c == 'SeniorCitizen']
dfCate = customerDF[cateCols].copy()
dfCate.head(3)

#進行特征編碼，

for col in cateCols:
    if dfCate[col].nunique() == 2:
        dfCate[col] = pd.factorize(dfCate[col])[0]
    else:
        dfCate = pd.get_dummies(dfCate, columns=[col])
dfCate['tenure']=customerDF[['tenure']]
dfCate['MonthlyCharges']=customerDF[['MonthlyCharges']]
dfCate['TotalCharges']=customerDF[['TotalCharges']]

##查看關聯關系

plt.figure(figsize=(16,8))
dfCate.corr()['Churn'].sort_values(ascending=False).plot(kind='bar')
plt.show()

在這里插入圖片描述

（2）特征選取

# 特征選擇
dropFea = ['gender','PhoneService',
           'OnlineSecurity_No internet service', 'OnlineBackup_No internet service',
           'DeviceProtection_No internet service', 'TechSupport_No internet service',
           'StreamingTV_No internet service', 'StreamingMovies_No internet service',
           #'OnlineSecurity_No', 'OnlineBackup_No',
           #'DeviceProtection_No','TechSupport_No',
           #'StreamingTV_No', 'StreamingMovies_No',
           ]
dfCate.drop(dropFea, inplace=True, axis =1) 
#最后一列是作為標識
target = dfCate['Churn'].values
#串列：特征和1個標識
columns = dfCate.columns.tolist()

（3）構建模型

# 構造各種分類器
classifiers = [
    SVC(random_state = 1, kernel = 'rbf'),    
    DecisionTreeClassifier(random_state = 1, criterion = 'gini'),
    RandomForestClassifier(random_state = 1, criterion = 'gini'),
    KNeighborsClassifier(metric = 'minkowski'),
    AdaBoostClassifier(random_state = 1),   
]
# 分類器名稱
classifier_names = [
            'svc', 
            'decisiontreeclassifier',
            'randomforestclassifier',
            'kneighborsclassifier',
            'adaboostclassifier',
]
# 分類器引數
#注意分類器的引數，字典鍵的格式，GridSearchCV對調優的引數格式是"分類器名"+"__"+"引數名"
classifier_param_grid = [
            {'svc__C':[0.1], 'svc__gamma':[0.01]},
            {'decisiontreeclassifier__max_depth':[6,9,11]},
            {'randomforestclassifier__n_estimators':range(1,11)} ,
            {'kneighborsclassifier__n_neighbors':[4,6,8]},
            {'adaboostclassifier__n_estimators':[70,80,90]}
]

(4）模型引數調優和評估

對分類器進行引數調優和評估，最后得到試用AdaBoostClassifier(n_estimators=80)效果最好，

# 對具體的分類器進行 GridSearchCV 引數調優
def GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, param_grid, score = 'accuracy_score'):
    response = {}
    gridsearch = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=3, scoring = score)
    # 尋找最優的引數 和最優的準確率分數
    search = gridsearch.fit(train_x, train_y)
    print("GridSearch 最優引數：", search.best_params_)
    print("GridSearch 最優分數： %0.4lf" %search.best_score_)
    #采用predict函式（特征是測驗資料集）來預測標識，預測使用的引數是上一步得到的最優引數
    predict_y = gridsearch.predict(test_x)
    print(" 準確率 %0.4lf" %accuracy_score(test_y, predict_y))
    response['predict_y'] = predict_y
    response['accuracy_score'] = accuracy_score(test_y,predict_y)
    return response
 
for model, model_name, model_param_grid in zip(classifiers, classifier_names, classifier_param_grid):
    #采用 StandardScaler 方法對資料規范化：均值為0，方差為1的正態分布
    pipeline = Pipeline([
            #('scaler', StandardScaler()),
            #('pca',PCA),
            (model_name, model)
    ])
    result = GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, model_param_grid , score = 'accuracy')

結論和建議

根據以上分析，得到高流失率用戶的特征：

用戶屬性：老年用戶，未婚用戶，無親屬用戶更容易流失；
服務屬性：在網時長小于半年，有電話服務，光纖用戶/光纖用戶附加流媒體電視、電影服務，無互聯網增值服務；
合同屬性：簽訂的合同期較短，采用電子支票支付，是電子賬單，月租費約70-110元的客戶容易流失；
其它屬性對用戶流失影響較小，以上特征保持獨立，
針對上述結論，從業務角度給出相應建議：

根據預測模型，構建一個高流失率的用戶串列，通過用戶調研推出一個最小可行化產品功能，并邀請種子用戶進行試用，
用戶方面：針對老年用戶、無親屬、無伴侶用戶的特征退出定制服務如親屬套餐、溫暖套餐等，一方面加強與其它用戶關聯度，另一方對特定用戶提供個性化服務，
服務方面：針對新注冊用戶，推送半年優惠如贈送消費券，以渡過用戶流失高峰期，針對光纖用戶和附加流媒體電視、電影服務用戶，重點在于提升網路體驗、增值服務體驗，一方面推動技術部門提升網路指標，另一方面對用戶承諾免費網路升級和贈送電視、電影等包月服務以提升用戶黏性，針對在線安全、在線備份、設備保護、技術支持等增值服務，應重點對用戶進行推廣介紹，如首月/半年免費體驗，
合同方面：針對單月合同用戶，建議推出年合同付費折扣活動，將月合同用戶轉化為年合同用戶，提高用戶在網時長，以達到更高的用戶留存，針對采用電子支票支付用戶，建議定向推送其它支付方式的優惠券，引導用戶改變支付方式，

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/374640.html

標籤：其他

上一篇：奧賽一本通1026：空格分隔輸出

下一篇：元宇宙大比拼：英偉達、facebook、iwemeta

【資料分析與挖掘實戰】電信用戶流失分析與預測

背景

本文將從以下方面進行分析：

提出問題

理解資料

資料清洗

可視化分析

（1）用戶屬性分析

（2）服務屬性分析

（3）合同屬性分析?

用戶流失預測

（1）資料清洗

（2）特征選取

（3）構建模型

(4）模型引數調優和評估

結論和建議