一、賽題決議

【阿里云天池演算法挑戰賽】零基礎入門NLP - 新聞文本分類-Day1-賽題理解_202xxx的博客-CSDN博客

二、資料讀取

下載完成資料后推薦使用anaconda，python3.8進行資料讀取與模型訓練

首先安裝需要用到的模塊包：

pip版本：

pip添加國內源，增加下載速度_202xxx的博客-CSDN博客

pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple

conda版本：

新建conda虛擬環境：

conda create --name py38 python=3.8
conda activate py38

conda添加國內源_202xxx的博客-CSDN博客

conda install pandas

用讀取資料，train_dir為訓練集的存盤路徑，nrows為讀取資料的行數

import pandas as pd
train_dir = '../data/train_set.csv'
nrows=None
train_df = pd.read_csv(train_dir, sep='\t', nrows=nrows)

查看前5條新聞資料

train_df.head()

	label	text
0	2	2967 6758 339 2021 1854 3731 4109 3792 4149 15...
1	11	4464 486 6352 5619 2465 4802 1452 3137 5778 54...
2	3	7346 4068 5074 3747 5681 6093 1777 2226 7354 6...
3	2	7159 948 4866 2109 5520 2490 211 3956 5520 549...
4	3	3646 3055 3055 2490 4659 6065 3370 5814 2465 5...

三、資料分析

句子長度分布

統計每篇新聞的長度（詞的數量）

import matplotlib.pyplot as plt
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())

用describe()函式得到新聞長度的統計特征值

count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text_len, dtype: float64

得到訓練資料新聞總數為200000篇，其中最短的新聞字數為2，最長的新聞字數為57921，平均紫薯為907.2

繪制句子長度的直方圖：

_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")

可見大部分句子長度都在6000以內

新聞類別分布

繪制新聞類別標簽的統計直方圖

train_df['label'].value_counts().astype("object").plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")

在資料集中標簽的對應的關系如下：{'科技': 0, '股票': 1, '體育': 2, '娛樂': 3, '時政': 4, '社會': 5, '教育': 6, '財經': 7, '家居': 8, '游戲': 9, '房產': 10, '時尚': 11, '彩票': 12, '星座': 13}

可以看出數量最多的新聞為科技類，最少的為星座類，

字符分布統計

將訓練資料所有字符進行合并，統計出每個字符出現的頻數

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)

print(len(word_count))

print(word_count[0])

print(word_count[-1])

可以得到，訓練資料用到的字符數為6869個，其中用的最多的字符編號為“3750”，用了748萬次，最少字符編號為“3133”，用了2次：

6869
('3750', 7482224)
('3133', 1)

現對每篇文章字符進行去重再統計

from collections import Counter
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)

print(word_count[0])

print(word_count[1])

print(word_count[2])

可以得到用到編號“3750”的文章數量有197997篇，總文章數量為200000篇，因此可以猜測其為標點符號，

('3750', 197997)
('900', 197653)
('648', 191975)

四、作業思路

1. 假設字符3750，字符900和字符648是句子的標點符號，請分析賽題每篇新聞平均由多少個句子構成？

答：分別統計每篇新聞中包含有字符3750，字符900和字符648的數量，每篇新聞的句子數量=標點符號出現的數量，因此求平均可以得到每篇新聞平均由多少個句子構成，等價于已知道200000篇新聞中出現字符3750的數量為7482224次，字符900的數量為3262544，字符648的數量為4924890，求平均得到每篇新聞的句子數量為78.

#作業1
from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))

print(word_count["3750"])
print(word_count["900"])
print(word_count["648"])
print((word_count["3750"] + word_count["900"] + word_count["648"])/train_df.shape[0])

2.統計每類新聞中出現次數最多的字符？

答：根據label對df中的text新聞進行分組拼接，分別計算每個label中的新聞最大字符和數量

train_df_label = train_df[["label", "text"]].groupby("label").apply(lambda x:" ".join(x["text"])).reset_index()
train_df_label.columns = [["label", "text"]]
from collections import Counter
train_df_label['text_max'] = train_df_label["text"].apply(lambda x:sorted(Counter(x["text"].split(" ")).items(), key=lambda d:int(d[1]), reverse = True)[0], axis = 1)
train_df_label

	label	text	text_max
0	0	3659 3659 1903 1866 4326 4744 7239 3479 4261 4...	(3750, 1267331)
1	1	4412 5988 5036 4216 7539 5644 1906 2380 2252 6...	(3750, 1200686)
2	2	2967 6758 339 2021 1854 3731 4109 3792 4149 15...	(3750, 1458331)
3	3	7346 4068 5074 3747 5681 6093 1777 2226 7354 6...	(3750, 774668)
4	4	3772 4269 3433 6122 2035 4531 465 6565 498 358...	(3750, 360839)
5	5	2827 2444 7399 3528 2260 6127 1871 119 3615 57...	(3750, 715740)
6	6	5284 1779 2109 6248 7039 5677 1816 5430 3154 1...	(3750, 469540)
7	7	6469 1066 1623 1018 3694 4089 3809 4516 6656 3...	(3750, 428638)
8	8	2087 730 5166 3300 7539 1722 5305 913 4326 669...	(3750, 242367)
9	9	3819 4525 1129 6725 6485 2109 3800 5264 1006 4...	(3750, 178783)
10	10	26 4270 1866 5977 3523 3764 4464 3659 4853 517...	(3750, 180259)
11	11	4464 486 6352 5619 2465 4802 1452 3137 5778 54...	(3750, 83834)
12	12	2708 2218 5915 4559 886 1241 4819 314 4261 166...	(3750, 87412)
13	13	1903 2112 3019 3607 7539 3864 4939 4768 3420 2...	(3750, 33796)