我正在嘗試在 python 中遍歷一些非結構化文本資料。最終目標是在資料框中構建它。現在我只是想在一個陣列中獲取相關資料并理解 python 中的 readline() 功能。
這是文本的樣子:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
對于同一檔案中的許多文本文章,重復使用相同的格式。到目前為止,我已經弄清楚如何提取包含某些文本的行。例如,我可以遍歷它并將所有文章標題放在一個串列中,如下所示:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
現在我想做以下事情:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
作為一名 Python 初學者,我正在旋轉我的輪子在谷歌上搜索這個主題。任何指標將不勝感激。
uj5u.com熱心網友回復:
如果你想堅持你的 for 回圈,你可能需要這樣的東西:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text = line
else:
# Possibly throw a format error here?
pass
(有幾件事:Python 在名稱方面很奇怪,當您撰寫 時list = [],您實際上是在覆寫list類的標簽,這可能會在以后給您帶來問題。您真的應該像對待關鍵字一樣對待list,set等等 - 甚至認為 Python從技術上講不是 - 只是為了避免自己頭疼。此外startswith,鑒于您對資料的描述,這里的方法更加精確。)
或者,您可以將檔案物件包裝在迭代器(i = iter(f), 然后next(i))中,但這會導致捕獲StopIteration例外的麻煩- 但它會讓您對整個事情使用更經典的 while 回圈。就我自己而言,我會堅持使用上面的狀態機方法,并使其足夠健壯以處理所有合理預期的邊緣情況。
uj5u.com熱心網友回復:
當你的目標是建立一個資料幀,這里是一個re numpy pandas的解決方案:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
輸出:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/312992.html
上一篇:在python中創建新的資料框列和填充值的有效方法是什么?
下一篇:使用apply從其他表中查找資料
