詞性標注是指為輸入文本中的單詞標注對應詞性的程序,詞性標注的主要作用在于預測接下來一個詞的詞性,并為句法分析、資訊抽取等作業打下基礎,通常地,實作詞性標注的演算法有HMM(隱馬爾科夫)和深度學習(RNN、LSTM等),然而,在中文中,由于漢語是一種缺乏詞形態變化的語言,沒有直接判斷的依據,且常用詞兼類現象嚴重,研究者主觀原因造成的不同都給中文詞性標注帶來了很大的難點,
本文將介紹如何通過Python程式實作詞性標注,并運用spaCy訓練中文詞性標注模型:
1、對訓練集文本內容進行詞性標注
首先,對于給定的訓練集資料:

利用spaCy模塊進行nlp處理,初始化一個標簽串列和文本字串,將文本分詞后用“/”號隔開,并儲存文本的詞性標簽到標簽串列中,代碼如下:
def train_data(train_path):
nlp = spacy.load('zh_core_web_sm')
train_list = []
for line in open(train_path,"r",encoding="utf8"):
train_list.append(line)
#print(train_list)
result = []
train_dict = {}
for i in train_list:
doc = nlp(i)
label = []
text = ""
#print(doc)
for j in doc:
text += j.text+"/"
#result.append(str(j.text))
#print(text)
label.append(j.pos_[0])
#print(result)
train_dict[j.pos_[0]] = {"pos":j.pos_}
#print(train_dict)
result.append((text[:-1],{'tags':label}))
return result,train_dict
大致會得到如下結果:

2、利用spaCy訓練模型
然后,進行模型訓練:
@plac.annotations(
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(lang='zh', output_dir=None, n_iter=25):
nlp = spacy.blank(lang) ##創建一個空的模型,en表示是英文的模型
tagger = nlp.add_pipe('tagger')
# Add the tags. This needs to be done before you start training.
for tag, values in train_dict.items():
print("tag:",tag)
print("values:",values)
#tagger.add_label(tag, values)
tagger.add_label(tag)
#tagger.add_label(values['pos'])
#nlp.create_pipe(tagger)
print("3:",tagger)
#nlp.add_pipe(tagger)
optimizer = nlp.begin_training() ##模型初始化
for i in range(n_iter):
random.shuffle(result) ##打亂串列
losses = {}
for text, annotations in result:
example = Example.from_dict(nlp.make_doc(text), annotations)
#nlp.update([text], [annotations], sgd=optimizer, losses=losses)
nlp.update([example], sgd=optimizer, losses=losses)
print(losses)
運行結果如下:

3、測驗集驗證模型
最后,同樣程序處理測驗資料:

代碼如下:
test_path = r"E:\1\Study\大三下\自然語言處理\第五章作業\test.txt"
test_list = []
for line in open(test_path,"r",encoding="utf8"):
test_list.append(line)
for z in test_list:
txt = nlp(z)
test_text = ""
for word in txt:
test_text += word.text+"/"
print('test_data:', [(t.text, t.tag_, t.pos_) for t in txt])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
驗證結果如下:

轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/280867.html
標籤:Python
上一篇:python如何向字典添加新鍵?
