使用正則運算式清理格式錯誤的問卷-有解無憂

我有一份格式錯誤的問卷，其中答案（和隨附的換行符）經常出現在問題的某個地方。這是句子（即問題和相應答案）分割的問題，因此模型很難從每個問答中提取資訊！

例子：

\n01 Do you have preexisting      No\nconditions?\n02 Within the past 12 months I worried about          Never True\nmy health would get worse.\n03 Within the past 12 months I have had         Never True\nhigh blood pressure.\n04 What is your housing situation today?   I have housing\n05 How many times have you moved in the past 12        Zero (I did not move)\nmonths?\n06 Are you worried that in the next 2 months, you may not    No\nhave your own housing to live in?\n07 Do you have trouble paying your heating or electricity    No\nbill?\n08 Do you have trouble paying for medicines?                 No\n09 Are you currently unemployed and looking for work?        No\n10 Are you interested in more education?                     Yes\n\n

示例的列印版本：

01 Do you have preexisting      No
conditions?
02 Within the past 12 months I worried about          Never True
my health would get worse.
03 Within the past 12 months I have had         Never True
high blood pressure.
04 What is your housing situation today?   I have housing
05 How many times have you moved in the past 12        Zero (I 
did not move)
months?
06 Are you worried that in the next 2 months, you may not    No
have your own housing to live in?
07 Do you have trouble paying your heating or electricity    No
bill?
08 Do you have trouble paying for medicines?                 No
09 Are you currently unemployed and looking for work?        No
10 Are you interested in more education?                     Yes

預期輸出：

如果答案位于問題的某處，則移至句末；
洗掉問題中不必要的空格和換行符；
將問題末尾的問號或其他標點符號替換為，:以便句子分割模型:在下一個問題之前包含答案。

預期的示例輸出：

\n01 您是否有既往病史：否\n02 在過去 12 個月內我擔心自己的健康會變得更糟：從來沒有\n03 在過去 12 個月內我患有高血壓：從來沒有\n04 您的住房情況如何今天：我有房\n05 過去12個月你搬了多少次：零（我沒有搬家）\n06 你是否擔心在接下來的兩個月里，你可能沒有自己的房子住：沒有\n07 您是否在支付取暖費或電費方面遇到困難：否\n08 您是否在支付醫藥費方面遇到困難：否\n09 您目前是否正在失業并正在尋找作業：否\n10 您是否有興趣接受更多教育：是\n\ n

我一直在嘗試匹配連續\n(0[1-9]|1[0-3])的 s，并使用re.subwithlambda m: m.group()但到目前為止沒有運氣。歡迎任何建議！

uj5u.com熱心網友回復：

這很接近，我相信：

import re

question_break_re = re.compile("\n(?=\d{2} )")
answer_re = re.compile("\s{2,}([^\n] )")
whitespace_re = re.compile("\s ")
end_of_question_mark_re = re.compile(r"(?:\?|\.)?$")

def tidy_up_question(question):
    answer = None
    match = answer_re.search(question)
    if match:
        answer = match.group(1)
        question = question[:match.start(0)]   question[match.end(0):]
    question = whitespace_re.sub(' ', question).strip()
    if answer is not None:
        question = end_of_question_mark_re.sub(f": {answer}", question, count=1)
    return question


text = "\n01 Do you have preexisting      No\nconditions?\n02 Within the past 12 months I worried about          Never True\nmy health would get worse.\n03 Within the past 12 months I have had         Never True\nhigh blood pressure.\n04 What is your housing situation today?   I have housing\n05 How many times have you moved in the past 12        Zero (I did not move)\nmonths?\n06 Are you worried that in the next 2 months, you may not    No\nhave your own housing to live in?\n07 Do you have trouble paying your heating or electricity    No\nbill?\n08 Do you have trouble paying for medicines?                 No\n09 Are you currently unemployed and looking for work?        No\n10 Are you interested in more education?                     Yes\n\n"

q_and_a = [
    tidy_up_question(question)
    for question in question_break_re.split(text)
    if question.strip()
]

print('\n'.join(q_and_a))

輸出：

01 Do you have preexisting conditions: No
02 Within the past 12 months I worried about my health would get worse: Never True
03 Within the past 12 months I have had high blood pressure: Never True
04 What is your housing situation today: I have housing
05 How many times have you moved in the past 12 months: Zero (I did not move)
06 Are you worried that in the next 2 months, you may not have your own housing to live in: No
07 Do you have trouble paying your heating or electricity bill: No
08 Do you have trouble paying for medicines: No
09 Are you currently unemployed and looking for work: No
10 Are you interested in more education: Yes

這在某些極端情況下會失敗：例如，如果那12是在下一行的開頭，它會被認為是一個新問題的開始。此外，任何不緊接在答案之前的多個連續空格也會使事情變得混亂。

我使用的方法：用作業理論將字串切成問題，所有問題都以兩位數開始一行；將答案標識為多個空格和換行符之間的一段文本；最后用冒號和答案替換結束標點符號。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/465823.html

標籤：python-3.x 正则表达式

上一篇：使用正則運算式捕獲兩個資訊

下一篇：將RegExp建構式轉換為與Safari兼容