實作spaCy物體標注模型-有解無憂

命名物體識別是指對現實世界中某個物件的名稱的識別，與詞性標注一樣，是自然語言處理的技識訓礎之一，它的作用主要是通過模型識別出文本中需要的物體，也可以推匯出物體之間的關系（物體消歧），
本文介紹的是運用Python從頭訓練一個spaCy模型來識別中標公告中中標公司的名字，現通過爬蟲爬取了大約200篇中標公告（爬取程序省略），利用人工對其中的150篇訓練集公告進行標注中標公司，使用spaCy訓練一個物體抽取模型并進行本地保存，再調取訓練好的模型對剩余的50篇公告進行測驗，檢驗該模型對中標公司提取的準確率，

1、獲取資料和資料清洗

首先，需要對爬取下來的中標公告檔案資料進行清洗處理，分別對其進行去重和洗掉網路格式（比如&nbsp），清洗前后對比如下：
資料清洗對比

2、標注物體

對于清洗后的資料集，需要把標注后的結果以下例格式進行儲存（即文本+物體標注的索引+物體標注類別標簽）：

TRAIN_DATA = https://www.cnblogs.com/Ukiii/p/[
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]  
        ###物體標注的索引從0開始17是最后一字符的索引+1
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

也就是說，需要找出需要被標注物體的開始索引和結束索引，由于有204篇公告，每篇公告都都需要人工標注，鑒于資料量和尋找索引作業量都很大，所以通過撰寫Python程式且小組分工后每位成員人工標注中標公司名稱，并把結果儲存成上述格式，標注代碼如下：

import pandas as pd
import re

def find_company_name(name,text):
    if re.search(name,text):
        tup = list(re.search(name,text).span())
        tup.append("LOC")
        tup=tuple(tup)
        return (text,{'entities':tup})
    return False

from IPython.display import clear_output as clear

textall= []
data5 = pd.read_csv("home_work_clear.txt",encoding="utf-8",sep="\n",header=None)
start=int(input("請輸入第幾行開始"))
end=int(input("請輸入第幾行結束"))

for line in data5[0][start:end]:
    clear()#清除輸出
    print(line)
    while True:
        panduan="not break"
        company_name = input("請輸入公司名字")
        if find_company_name(company_name,line):
            print(find_company_name(company_name,line))
            textall.append(find_company_name(company_name,line))
            break
        elif company_name == "break":
            panduan='break'
            break
        else:
            print("輸入公司名字可能有誤，重新輸入")
        print
    if panduan=="break":
        break

f = open("result%d-%d.txt" % (start,end),"w+",encoding="gbk")
f.write(str(textall))
f.close()

（1）首先輸入需要標注的開始位置和結束位置：

輸入標注范圍

（2）然后輸入每份公告的中標公司名稱：

輸入中標公司名稱

（3）最后，把標注后的結果保存到txt中：

標注結果
上圖表示文本中索引121-133為中標公司，

3、劃分訓練集和測驗集

經標注后，一共有204個資料集，其中設定訓練集150篇公告，測驗集為54篇公告：
資料集總數

4、spaCy模型訓練

對于處理好的訓練集，輸入到spaCy模型中進行訓練，并對訓練后的模型進行保存，代碼如下：

def main(model=None, output_dir = None, n_iter=100):   ##引數意義，model：是否存在現有的模型，output_dir：模型存盤位置，n_iter迭代次數
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model  ###這里的作用是對現有的模型進行優化  *非常重要
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('zh')  # create blank Language class
        print("Created blank 'zh' model")

        if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe('ner', last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe('ner')

    # add labels
    for _, annotations in TRAIN_DATA:      ##添加train data的標簽
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # 僅訓練我們標注的標簽，假如沒有則會對所有的標簽訓練，
                                           #建議不要對下載的spacy的模型進行訓練可能導致下載的語言模型出錯，訓練一個空白語言模型就好
        optimizer = nlp.begin_training()   ##模型初始化
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)     ##訓練資料每次迭代打亂順序
            losses = {}                    ##定義損失函式
            for text, annotations in TRAIN_DATA:
                example = Example.from_dict(nlp.make_doc(text), annotations)    ##對資料進行整理成新模型需要的資料
                print("example:",example)
                nlp.update(
                    [example],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # 更新權重
                    losses=losses)
            print(losses)

    # 保存模型
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

對上述訓練程序進行計時，由于代碼運行時間過久，測驗得當訓練5個資料集需要花費差不多1分鐘的時間：
訓練5個資料集花費時間
通過檢查發現代碼中的迭代次數為100，為了提高速度把100改為了50，訓練150個訓練集，花費時間為（約47分鐘）：
迭代50次的花費時間

并把訓練好的模型保存到本地，方便后續測驗時可以直接加載已訓練好的模型得出預測結果，

5、測驗集測驗模型

訓練完spaCy模型后，匯入測驗資料進行測驗，代碼如下：

import spacy

def load_model_test(path,text):
    nlp = spacy.load(path)
    print("Loading from", path)
    doc = nlp(text)
    for i in doc.ents:
        print(i.text,i.label_)

訓練結果如下：
測驗集預測結果
根據上圖測驗結果來看，總體預測結果良好，能準確找出中標公司名稱，但是，也存在少許預測失敗的資料，說明該模型還不是非常精確，后續可以在迭代次數和訓練資料量兩方面進行改進，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/280864.html

標籤：Python

上一篇：Selenium 使用手冊

下一篇：python如何向字典添加新鍵？