我發現有一個代碼塊在我的專案中很有用,但我無法讓它以與列印時相同的給定/所需格式構建資料框(2 列)。
代碼塊和所需的輸出:
import nltk
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Step Two: Load Data
sentence = "Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr."
# Step Three: Tokenise, find parts of speech and chunk words
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
清除一列中的標簽和另一列中的物體的輸出:
PERSON Martin
PERSON Luther King
PERSON Michael King
ORGANIZATION American
GPE American
GPE Christian
PERSON Mahatma Gandhi
PERSON Martin Luther
我嘗試過這樣的事情,但結果并不那么干凈。
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
df.append(chunk)
輸出:
[Tree('PERSON', [('Martin', 'NNP')]),
Tree('PERSON', [('Luther', 'NNP'), ('King', 'NNP')]),
Tree('PERSON', [('Michael', 'NNP'), ('King', 'NNP')]),
Tree('ORGANIZATION', [('American', 'JJ')]),
Tree('GPE', [('American', 'NNP')]),
Tree('GPE', [('Christian', 'JJ')]),
Tree('PERSON', [('Mahatma', 'NNP'), ('Gandhi', 'NNP')]),
Tree('PERSON', [('Martin', 'NNP'), ('Luther', 'NNP')])]
有沒有一種簡單的方法可以將列印格式更改為 df 僅 2 列?
uj5u.com熱心網友回復:
創建嵌套串列并轉換為 DataFrame:
L = []
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
L.append([chunk.label(), ' '.join(c[0] for c in chunk)])
df = pd.DataFrame(L, columns=['a','b'])
print (df)
a b
0 PERSON Martin
1 PERSON Luther King
2 PERSON Michael King
3 ORGANIZATION American
4 GPE American
5 GPE Christian
6 PERSON Mahatma Gandhi
7 PERSON Martin Luther
串列理解解決方案是:
L= [[chunk.label(), ' '.join(c[0] for c in chunk)]
for sent in nltk.sent_tokenize(sentence)
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)))
if hasattr(chunk, 'label')]
df = pd.DataFrame(L, columns=['a','b'])
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/409493.html
標籤:
上一篇:ValueError:形狀不匹配:繪制條形時無法將物件廣播到單個形狀
下一篇:如何在資料框中按串列值分組和計數
