PythonReadline回圈和子回圈-有解無憂

我正在嘗試在 python 中遍歷一些非結構化文本資料。最終目標是在資料框中構建它。現在我只是想在一個陣列中獲取相關資料并理解 python 中的 readline() 功能。

這是文本的樣子：

Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number 
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python

對于同一檔案中的許多文本文章，重復使用相同的格式。到目前為止，我已經弄清楚如何提取包含某些文本的行。例如，我可以遍歷它并將所有文章標題放在一個串列中，如下所示：

a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
      if a in line:
        titleList.append(line)

現在我想做以下事情：

a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
  if a in line:
    list.append(line)
  if b in line:
     1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
     2. Continue the for loop within which all of this sits

作為一名 Python 初學者，我正在旋轉我的輪子在谷歌上搜索這個主題。任何指標將不勝感激。

uj5u.com熱心網友回復：

如果你想堅持你的 for 回圈，你可能需要這樣的東西：

titles = []
texts = []
subjects = []

with open('sample.txt', encoding="utf8") as f:
    inside_fulltext = False
    for line in f:
        if line.startswith("Title:"):
            inside_fulltext = False
            titles.append(line)
        elif line.startswith("Full text:"):
            inside_fulltext = True
            full_text = line
        elif line.startswith("Subject:"):
            inside_fulltext = False
            texts.append(full_text)
            subjects.append(line)
        elif inside_fulltext:
            full_text  = line
        else:
            # Possibly throw a format error here?
            pass

（有幾件事：Python 在名稱方面很奇怪，當您撰寫時list = []，您實際上是在覆寫list類的標簽，這可能會在以后給您帶來問題。您真的應該像對待關鍵字一樣對待list,set等等 - 甚至認為 Python從技術上講不是 - 只是為了避免自己頭疼。此外startswith，鑒于您對資料的描述，這里的方法更加精確。）

或者，您可以將檔案物件包裝在迭代器（i = iter(f), 然后next(i)）中，但這會導致捕獲StopIteration例外的麻煩- 但它會讓您對整個事情使用更經典的 while 回圈。就我自己而言，我會堅持使用上面的狀態機方法，并使其足夠健壯以處理所有合理預期的邊緣情況。

uj5u.com熱心網友回復：

當你的目標是建立一個資料幀，這里是一個re numpy pandas的解決方案：

import re
import pandas as pd
import numpy as np

# read all file
with open('sample.txt', encoding="utf8") as f:
    text = f.read()


keys = ['Subject', 'Title', 'Full text']

regex = '(?:^|\n)(%s): ' % '|'.join(keys)

# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])

輸出：

                      Title                                                                                                                                               Full text Subject
0       title of an article  unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three..  Python
1  title of another article                                                                               again unfortunately the full text of each article,\nis on numerous lines.  Python

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/312992.html

標籤：Python 熊猫数据框无印良品阅读线

上一篇：在python中創建新的資料框列和填充值的有效方法是什么？

下一篇：使用apply從其他表中查找資料