將句子串列（帶有ntlk的標記）與pandas資料框中的列匹配-有解無憂

我是 python 新手，所以仍在努力學習基礎知識，但我已經解決了這個問題，任何幫助將不勝感激。因此，我有這個長資料框，其中數百行由從體檢中提取的特定 pdf 頁面的文本組成，每一行都是不同的人。

我成功地提取了文本（使用 pymupdf）并對每一行進行了迭代，盡可能多地清理了文本，最后得到了一個類似于下面這個的資料框，其中有一列句子是使用 nltk sent_tokenize 和多行獲得的。

import pandas as pd
from nltk.tokenize import sent_tokenize

df = pd.DataFrame({"text":["hello, this is a sentence. the sun shines. the the night is beautiful",
              "the sun shines",
              "the night is beautiful. tomorrow i work"]})

df["token"] = df["text"].apply(sent_tokenize)

我任務的最后一部分是將串列醫學短語（特定于考試）中的特定句子與我的資料框中的句子進行匹配，并僅將匹配項保留在新列中。為此，我發現這個執行緒回圈遍歷串列和行以在熊貓資料框中使用@furas 解決方案進行關鍵字匹配，干凈并且看起來可以完成這項作業。所以，最后，我有一個 pandas 的句子列（ntlk 標記）和醫學短語串列（還有 ntlk 標記），并且需要匹配它們。

specific_sent = "the sun shines. hello, this is a sentence."
query = sent_tokenize(''.join(specific_sent))

df["query_match"] = df["token"].str.contains(query) 
df["word"] = df["token"].str.extract('({})'.format(query))

當我運行此代碼時，我收到此錯誤“TypeError: unhashable type: 'list'”，這并不罕見，我對此有所了解，但我正在努力克服。非常感謝有關如何在此特定示例中克服此錯誤以及將來防止此錯誤的方法的任何幫助。謝謝！

這是所需輸出的示例：

文本	令牌	查詢匹配	單詞
你好，這是一個句子。陽光普照。夜晚很美	【你好，這是一句話。，陽光普照，夜色很美】	真的	陽光普照。，你好，這是一句話。
陽光普照。	[陽光普照。]	真的	陽光普照。
夜晚很美。明天我上班	【夜很美，明天上班。】	錯誤的	鈉

uj5u.com熱心網友回復：

一旦對 DataFrame 中的每個句子和特定句子進行標記，您就會獲得串列，您可以從中找到共同的元素并構造列word。之后，您還可以填充列query_match，檢查包含共同元素的結果串列是否為空。

df = pd.DataFrame({"text":["hello, this is a sentence. the sun shines. the the night is beautiful",
              "the sun shines.",
              "the night is beautiful. tomorrow i work"]})

specific_sent = "the sun shines. hello, this is a sentence."
query = sent_tokenize(''.join(specific_sent))

df["token"] = df["text"].apply(sent_tokenize)

# check elements in common between each sentence and query
df["word"] = df["token"].apply(lambda x: list(set(query).intersection(x)))

# if they had elements in common insert True, otherwise False
df["query_match"] = df["word"].apply(lambda x: 'True' if x else 'False')

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/428371.html

標籤：Python 熊猫 pdf 文本矿业

上一篇：iText7：在單元格內添加PDF（頁面末尾截斷）

下一篇：如何檢查PDF是掃描影像還是包含R中的文本