我正在嘗試創建一個用于 NER 識別的訓練資料集。為此,我有大量資料需要標記并洗掉不必要的句子。在洗掉不必要的句子時,必須更新索引藥水。昨天我從一些用戶那里看到了一些令人難以置信的代碼段,但我現在找不到了。調整他們的代碼段我可以簡要介紹我的問題
讓我們以訓練樣本資料為例:
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
這可以使用以下 spacy 顯示代碼可視化
import json
import spacy
from spacy import displacy
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
annot_tags = data[data_index]["annotations"]
entities = []
for j in annot_tags:
start = j["start"]
end = j["end"]
tag = j["tag"]
entitie = (start,end,tag)
entities.append(entitie)
data_gen = (data[data_index]["content"],{"entities":entities})
data_one = []
data_one.append(data_gen)
nlp = spacy.blank('en')
raw_text = data_one[0][0]
doc = nlp.make_doc(raw_text)
spans = data_one[0][1]["entities"]
ents = []
for span_start, span_end, label in spans:
ent = doc.char_span(span_start, span_end, label=label)
if ent is None:
continue
ents.append(ent)
doc.ents = ents
displacy.render(doc, style="ent", jupyter=True)
輸出將是
輸出 1
現在我想洗掉未標記的句子并更新索引值。所以所需的輸出就像
所需輸出
此外,資料必須采用以下格式。未標記的句子被洗掉,索引值必須更新,以便我可以獲得如上的輸出。
所需的輸出資料
[{"content":'''Hello we are hans and john.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":42,"end":48,"tag":"fruit"},
{"id":4,"start":50,"end":56,"tag":"name"}]}]
I was following a post last day and got a near working code.
Code
import re
data = [{"content":'''Hello we are hans and john. I enjoy playing Football.
I love eating grapes. Hanaan is great.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"}]}]
for idx, each in enumerate(data[0]['annotations']):
start = each['start']
end = each['end']
word = data[0]['content'][start:end]
data[0]['annotations'][idx]['word'] = word
sentences = [ {'sentence':x.strip() '.','checked':False} for x in data[0]['content'].split('.')]
new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
for idx_alpha, sentence in enumerate(sentences):
if sentence['checked'] == True:
continue
temp = each.copy()
check_word = temp['word']
if check_word in sentence['sentence']:
start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
end_idx = start_idx len(check_word)
current_len = len(new_data[0]['content'])
new_data[0]['content'] = sentence['sentence'] ' '
temp.update({'start':start_idx current_len, 'end':end_idx current_len})
new_data[0]['annotations'].append(temp)
sentences[idx_alpha]['checked'] = True
break
print(new_data)
Output
[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. ',
'annotations': [{'id': 1,
'start': 13,
'end': 17,
'tag': 'name',
'word': 'hans'},
{'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'},
{'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'}]}]
Here the name john is lost. If more than one tag is present, I can't lose that too
I know this will be a lot to ask. But any bit of help is appreciated
Thanks in Advance
Please upvote the question since I am beginner I can get more features to stackoverflow.
uj5u.com熱心網友回復:
這是一項非常復雜的任務,因為您需要識別句子,因為在 上進行簡單的拆分'.'可能不起作用,因為它會在諸如 之類的東西上拆分'Mr.'。
既然您使用的是 spacy,為什么不讓它識別句子,然后遍歷它們并計算出那些開始結束索引,而不包括任何沒有物體的句子。然后重構內容。
import json
import spacy
from spacy import displacy
import re
data = [{"content":'''Hello we are hans and john. I enjoy playing Football. \
I love eating grapes. Hanaan is great. Mr. Jones is nice.''',"annotations":[{"id":1,"start":13,"end":17,"tag":"name"},
{"id":2,"start":22,"end":26,"tag":"name"},
{"id":3,"start":68,"end":74,"tag":"fruit"},
{"id":4,"start":76,"end":82,"tag":"name"},
{"id":5,"start":93,"end":102,"tag":"name"}]}]
for idx, each in enumerate(data[0]['annotations']):
start = each['start']
end = each['end']
word = data[0]['content'][start:end]
data[0]['annotations'][idx]['word'] = word
text = data[0]['content']
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('sentencizer')
doc = nlp(text)
sentences = [i for i in doc.sents]
annotations = data[0]['annotations']
new_data = [{"content":'',
'annotations':[]}]
for sentence in sentences:
idx_to_remove = []
for idx, annotation in enumerate(annotations):
if annotation['word'] in sentence.text:
temp = annotation.copy()
start_idx = re.search(r'\b({})\b'.format(annotation['word']), sentence.text).start()
end_idx = start_idx len(annotation['word'])
current_len = len(new_data[0]['content'])
temp.update({'start':start_idx current_len, 'end':end_idx current_len})
new_data[0]['annotations'].append(temp)
idx_to_remove.append(idx)
if len(idx_to_remove) > 0:
new_data[0]['content'] = sentence.text ' '
for x in range(0,len(idx_to_remove)):
del annotations[0]
輸出:
print(new_data)
[{'content': 'Hello we are hans and john. I love eating grapes. Hanaan is great. Mr. Jones is nice. ',
'annotations': [
{'id': 1, 'start': 13, 'end': 17, 'tag': 'name', 'word': 'hans'},
{'id': 2, 'start': 22, 'end': 26, 'tag': 'name', 'word': 'john'},
{'id': 3, 'start': 42, 'end': 48, 'tag': 'fruit', 'word': 'grapes'},
{'id': 4, 'start': 50, 'end': 56, 'tag': 'name', 'word': 'Hanaan'},
{'id': 5, 'start': 67, 'end': 76, 'tag': 'name', 'word': 'Mr. Jones'}]}]
uj5u.com熱心網友回復:
洗掉即可
#sentences[idx_alpha]['checked'] = True
#break
輸出
[{'content': 'Hello we are hans and john. Hello we are hans and john. I love eating grapes. Hanaan is great. ',
'annotations':
[{'id': 1, 'start': 13, 'end': 17, 'tag': 'name', 'word': 'hans'},
{'id': 2, 'start': 50, 'end': 54, 'tag': 'name', 'word': 'john'},
{'id': 3, 'start': 70, 'end': 76, 'tag': 'fruit', 'word': 'grapes'},
{'id': 4, 'start': 78, 'end': 84, 'tag': 'name', 'word': 'Hanaan'}]}]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/337002.html
標籤:python string nlp nltk spacy
