詞袋不關注詞的先后順序---詞袋模型(bow--一元模型) bag of words
二元模型
n-gram

# 創建輸出目錄  保存訓練好的模型
import os#對檔案和目錄進行操作
output_dir = u'output'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

加載資料

import numpy as np#一個資料分析處理資料的常見的庫,它提供的資料結構比 Python 自身的更高效
import pandas as pd

1.Pandas 是基于 NumPy 的一個開源 Python 庫，它被廣泛用于快速分析資料，以及資料清洗和準備等作業，它的名字來源是由“ Panel data”（面板資料，一個計量經濟學名詞）兩個單詞拼成的，簡單地說，你可以把 Pandas 看作是 Python 版的 Excel，
2. Pandas能很好地處理來自各種不同來源的資料，比如 Excel 表格、CSV 檔案、SQL 資料庫，甚至還能處理存盤在網頁上的資料，
3. Pandas基于Numpy，常常與Numpy、matplotlib一起使用，
4. Pandas庫的兩個主要資料結構：
Series：一維
DataFrame：多維

python list 串列保存的是物件的指標，比如 [0,1,2] 需要保存 3 個指標和 3 個整數的物件，這樣就很浪費記憶體了，

Numpy 是儲存在一個連續的記憶體塊中，節約了計算資源，

# 查看訓練資料
train_data = pd.read_csv('sohu_train.txt', sep='\t', header=None, dtype=np.str_, encoding='utf8',error_bad_lines=False, delimiter="\t", names=[u'頻道', u'文章'])
train_data.head()

# 載入停用詞
stopwords = set()
with open('stopwords.txt', 'r',encoding='utf8') as infile:
    for line in infile:
        line = line.rstrip('\n')
        if line:
            stopwords.add(line.lower())

計算每個文章的tfidf特征

import jieba
from sklearn.feature_extraction.text import TfidfVectorizer

min_df去掉df值小的詞這樣的詞一般是非常專業的名詞或者是生僻詞是噪音
max_df 去掉df值很大的詞這樣詞是常用詞去掉不要

tfidf = TfidfVectorizer(tokenizer=jieba.lcut, stop_words=stopwords, min_df=50, max_df=0.3)#使用TfidfVectorizer實體化
x = tfidf.fit_transform(train_data[u'文章'])

·輸出結果

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\10248\AppData\Local\Temp\jieba.cache
Loading model cost 0.550 seconds.
Prefix dict has been built successfully.
E:\ANACODAN\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['&', ',', '.', ';', 'e', 'g', 'nbsp', '—', '\u3000', '儻', '兼', '前', '唷', '啪', '啷', '喔', '始', '漫', '然', '特', '竟', '若果', '莫', '見', '設', '說', '達', '非'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '

print(u'詞表大小: {}'.format(len(tfidf.vocabulary_)))

詞表大小: 14516

訓練分類器

編碼目標變數因為咱們的標簽是字串 sklearn只接受數值

from sklearn.preprocessing import LabelEncoder#LabelEncoder：將類別資料數字化
y_encoder = LabelEncoder()
y = y_encoder.fit_transform(train_data[u'頻道'])#將類別轉換成0,1,2,3,4,5,6,7,8,9...
y[:10]

·輸出結果

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

編碼X變數
x = tfidf.transform(train_data[u'文章'])

# 劃分訓練測驗資料
from sklearn.model_selection import train_test_split#分割資料集
# 根據y分層抽樣，測驗資料占20%
#因為現在資料量很大  此時采用對下標進行分割
train_idx, test_idx = train_test_split(range(len(y)), test_size=0.2, stratify=y)
train_x = x[train_idx, :]#訓練集
train_y = y[train_idx]
test_x = x[test_idx, :]#測驗集
test_y = y[test_idx]

訓練邏輯回歸模型我們是12分類屬于多分類

常用引數說明
penalty: 正則項型別，l1還是l2
C: 正則項懲罰系數的倒數，越大則懲罰越小
fit_intercept: 是否擬合常數項
max_iter: 最大迭代次數
multi_class: 以何種方式訓練多分類模型
ovr = 對每個標簽訓練二分類模型
multinomial ovo = 直接訓練多分類模型，僅當solver={newton-cg, sag, lbfgs}時支持
solver: 用哪種方法求解，可選有{liblinear, newton-cg, sag, lbfgs}
小資料liblinear比較好，大資料量sag更快
多分類問題，liblinear只支持ovr模式，其他支持ovr和multinomial
liblinear支持l1正則，其他只支持l2正則

from sklearn.linear_model import LogisticRegression#引入邏輯回歸
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')#solver='lbfgs'：求解方式
model.fit(train_x, train_y)

·輸出結果

E:\ANACODAN\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

LogisticRegression(multi_class='multinomial')

模型效果評估

from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
# 在測驗集上計算模型的表現
test_y_pred = model.predict(test_x)
# 計算混淆矩陣
pd.DataFrame(confusion_matrix(test_y, test_y_pred), columns=y_encoder.classes_, index=y_encoder.classes_)

·輸出結果

	體育	健康	女人	娛樂	房地產	教育	文化	新聞	旅游	汽車	科技	財經
體育	193	1	0	1	0	0	3	2	0	0	0	0
健康	0	165	9	0	0	4	0	7	3	0	4	8
女人	1	5	167	4	0	0	13	5	3	0	1	1
娛樂	0	1	9	164	0	5	17	2	0	0	1	1
房地產	0	1	4	0	180	0	0	3	0	0	1	11
教育	0	0	3	2	0	185	2	6	1	0	1	0
文化	0	3	13	17	0	1	153	8	2	1	2	0
新聞	1	4	6	5	1	12	4	124	5	2	11	25
旅游	0	2	8	0	6	1	8	8	163	0	1	3
汽車	1	1	3	0	0	0	0	4	2	182	1	6
科技	0	1	0	0	0	2	2	12	5	1	164	13
財經	1	4	3	0	12	0	4	19	2	4	11	140

# 計算各項評價指標
def eval_model(y_true, y_pred, labels):
    # 計算每個分類的Precision, Recall, f1, support
    p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
    # 計算總體的平均Precision, Recall, f1, support
    tot_p = np.average(p, weights=s)
    tot_r = np.average(r, weights=s)
    tot_f1 = np.average(f1, weights=s)
    tot_s = np.sum(s)
    res1 = pd.DataFrame({
        u'Label': labels,
        u'Precision': p,
        u'Recall': r,
        u'F1': f1,
        u'Support': s
    })
    res2 = pd.DataFrame({
        u'Label': [u'總體'],
        u'Precision': [tot_p],
        u'Recall': [tot_r],
        u'F1': [tot_f1],
        u'Support': [tot_s]
    })
    res2.index = [999]
    res = pd.concat([res1, res2])
    return res[[u'Label', u'Precision', u'Recall', u'F1', u'Support']]

·輸出結果

eval_model(test_y, test_y_pred, y_encoder.classes_)


Label	Precision	Recall	F1	Support
0	體育	0.979695	0.965	0.972292	200
1	健康	0.877660	0.825	0.850515	200
2	女人	0.742222	0.835	0.785882	200
3	娛樂	0.849741	0.820	0.834606	200
4	房地產	0.904523	0.900	0.902256	200
5	教育	0.880952	0.925	0.902439	200
6	文化	0.742718	0.765	0.753695	200
7	新聞	0.620000	0.620	0.620000	200
8	旅游	0.876344	0.815	0.844560	200
9	汽車	0.957895	0.910	0.933333	200
10	科技	0.828283	0.820	0.824121	200
11	財經	0.673077	0.700	0.686275	200
999	總體	0.827759	0.825	0.825831	2400

模型保存

# 保存模型到檔案  pip install dill 
#注意  我們要把tfidf特征提取模型保存  標簽轉換模型   預測模型
!pip install dill
import dill
import pickle
model_file = os.path.join(output_dir, u'model.pkl')
with open(model_file, 'wb') as outfile:
    dill.dump({
        'y_encoder': y_encoder,
        'tfidf': tfidf,
        'lr': model
    }, outfile)

·輸出結果

Requirement already satisfied: dill in e:\anacodan\lib\site-packages (0.3.4)

測驗模型，對新檔案預測

# 加載新檔案資料
new_data = pd.read_csv('sohu_test.txt', sep='\t', header=None, dtype=np.str_, encoding='utf8',error_bad_lines=False, delimiter="\t", names=[u'頻道', u'文章'])
new_data.head()

# 加載模型
import pickle
model_file = os.path.join(output_dir, u'model.pkl')
with open(model_file, 'rb') as infile:
    model = pickle.load(infile)

# 對新檔案預測（這里只對前10篇預測）
# 1. 轉化為詞袋表示
new_x = model['tfidf'].transform(new_data[u'文章'][:50])

·輸出結果


E:\ANACODAN\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['&', ',', '.', ';', 'e', 'g', 'nbsp', '—', '\u3000', '儻', '兼', '前', '唷', '啪', '啷', '喔', '始', '漫', '然', '特', '竟', '若果', '莫', '見', '設', '說', '達', '非'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '

# 2. 預測類別
new_y_pred = model['lr'].predict(new_x)
new_y_pred

·輸出結果

array([3, 0, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3])

# 3. 解釋類別
pd.DataFrame({u'預測頻道': model['y_encoder'].inverse_transform(new_y_pred), u'實際頻道': new_data[u'頻道'][:50]})

·輸出結果

	預測頻道	實際頻道
0	娛樂	娛樂
1	體育	娛樂
2	娛樂	娛樂
3	娛樂	娛樂
4	教育	娛樂
5	娛樂	娛樂
6	娛樂	娛樂
7	娛樂	娛樂
8	娛樂	娛樂
9	娛樂	娛樂
10	娛樂	娛樂
11	娛樂	娛樂
12	娛樂	娛樂
13	娛樂	娛樂
14	娛樂	娛樂
15	娛樂	娛樂
16	娛樂	娛樂
17	娛樂	娛樂
18	娛樂	娛樂
19	娛樂	娛樂
20	娛樂	娛樂
21	娛樂	娛樂
22	娛樂	娛樂
23	娛樂	娛樂
24	娛樂	娛樂
25	娛樂	娛樂
26	娛樂	娛樂
27	娛樂	娛樂
28	娛樂	娛樂
29	娛樂	娛樂
30	娛樂	娛樂
31	娛樂	娛樂
32	娛樂	娛樂
33	娛樂	娛樂
34	娛樂	娛樂
35	娛樂	娛樂
36	娛樂	娛樂
37	娛樂	娛樂
38	娛樂	娛樂
39	娛樂	娛樂
40	娛樂	娛樂
41	娛樂	娛樂
42	娛樂	娛樂
43	娛樂	娛樂
44	娛樂	娛樂
45	娛樂	娛樂
46	娛樂	娛樂
47	娛樂	娛樂
48	娛樂	娛樂
49	娛樂	娛樂

主函式，呼叫模型對新聞進行預測

# 加載模型
import pickle
import os
import numpy as np
import pandas as pd

output_dir = u'output'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

model_file = os.path.join(output_dir, u'model.pkl')
with open(model_file, 'rb') as infile:
    model = pickle.load(infile)

oo = 1
while oo == 1:
    f = open('yuce.txt', 'w', encoding='utf8')
    f.write(input())
    f.close()
    new1_data = pd.read_csv('yuce.txt', sep='\t', header=None, dtype=np.str_, encoding='utf8', names=[u'文章'])
    new1_data.head()
    # 加載模型
    import pickle

    model_file = os.path.join(output_dir, u'model.pkl')
    with open(model_file, 'rb') as infile:
        model = pickle.load(infile)
    new1_x = model['tfidf'].transform(new1_data[u'文章'])
    # 2. 預測類別
    new1_y_pred = model['lr'].predict(new1_x)
    pd.DataFrame({u'預測頻道': model['y_encoder'].inverse_transform(new1_y_pred)})
    print(pd.DataFrame({u'預測頻道': model['y_encoder'].inverse_transform(new1_y_pred)}))
    with open(r'yuce.txt', 'a+', encoding='utf-8') as test:
        test.truncate(0)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/333571.html

標籤：AI

上一篇：Machine Learning（吳恩達＜一＞）

下一篇：R語言使用pheatmap繪制熱力圖（資料歸一化、行列聚類、注釋、文字角度、字體）

用TFIDF詞袋模型進行新聞分類

加載資料

計算每個文章的tfidf特征

訓練分類器

模型效果評估

模型保存

測驗模型，對新檔案預測

主函式，呼叫模型對新聞進行預測