我有一個資料框如下:
import pandas as pd
df = pd.DataFrame({'text':['she is a. good 15. year old girl. she goes to school on time.', 'she is not an A. level student. This needs to be discussed.']})
為了在 (.) 上分裂和爆炸,我做了以下事情:
df = df.assign(text=df['text'].str.split('.')).explode('text')
但是我不想在每個點之后分開。所以我想分割點,除非點被數字包圍(例如 22.、3.4)或圍繞點的單個字符(例如 a. ,ab, bd
desired_output:
text
'she is a. good 15. year old girl'
'she goes to school on time'
'she is not an A. level student'
'This needs to be discussed.'
所以,我也嘗試了以下模式,希望忽略單個字符和數字,但它從句子的最后一個單詞中洗掉了最后一個字母。
df.assign(text=df['text'].str.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.) ')).explode('text')
我編輯了模式,所以現在它匹配數字或單個字母之后的所有型別的點:r'(?:(?<=.|\s)[[a-zA-Z]].|(?<= .|\s)\d ) ' 所以,我想我只需要弄清楚如何在點上分割,除了最后一個模式
uj5u.com熱心網友回復:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
input = 'she is a. good 15. year old girl. she goes to school on time. she is not an A. level student. This needs to be discussed.'
sentences = re.split(r'\.', input)
output = []
text = ''
for v in sentences:
text = text v
if(re.search(r'\s([a-z]{1}|[0-9] )$', v, re.IGNORECASE)):
text = text "."
pass
else:
text = text.strip()
if text != '':
output.append(text)
text = ''
print(output)
輸出:
['she is a. good 15. year old girl', 'she goes to school on time', 'she is not an A. level student', 'This needs to be discussed']
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/377513.html
