根據標題,我正在嘗試清理大量短文本,以洗掉以某些單詞開頭的句子 - 但前提是該文本是>1 個句子中的最后一個。
假設我想洗掉以“Jack is ...”開頭的最后一句話,
這是一個包含各種情況的示例:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
這是我目前擁有的正則運算式: "(?![A-Z']. [\\.|'] )[Jj]ack,? is. \\.$"
map_chr(test_strings, ~str_replace(.x, "(?![A-Z']. [\\.|'] )[Jj]ack,? is. \\.$", "[TRIM]"))
產生這些結果:
[1] "[TRIM]"
[2] "and [TRIM]"
[3] "There are mirrors. And [TRIM]"
[4] "There are dogs. And [TRIM]"
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
## Basically my current regex is still too greedy.
## No trimming should happen for the first 4 examples.
## 5 - 7th examples are correct.
## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it.
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets.
謝謝你的幫助!
uj5u.com熱心網友回復:
gsub("^(.*\\.)\\s*Jack,? is[^.]*\\.?$", "\\1 [TRIM]", test_strings, ignore.case = TRUE)
# [1] "Jack is the tallest person."
# [2] "and Jack is the one who said, let there be fries."
# [3] "There are mirrors. And Jack is there to be suave."
# [4] "There are dogs. And jack is there to pat them. Very cool."
# [5] "Jack is your lumberjack. [TRIM]"
# [6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
# [7] "'Jack is so cool!' Jack is cool. [TRIM]"
分解:
^(.*\\.)\\s*: 因為我們需要在我們剪掉的東西之前至少有一個句子,所以我們需要找到一個前面的點\\.;Jack,? is從你的正則運算式[^.]*\\.?$:零個或多個“非-.點”后跟一個-.點和字串結尾;如果您想在最后一個句點后留有空格,那么您可以將其更改為[^.]*\\.?\\s*$,在您的示例中似乎沒有必要
uj5u.com熱心網友回復:
您可以匹配一個點(或使用字符類匹配更多字符[.!?],然后匹配包含 Jack 的最后一個句子并以點結尾(或再次匹配更多字符的字符類):
\.\K\h*[Jj]ack,? is[^.\n]*\.$
模式匹配:
\.\K匹配 a.并忘記到目前為止匹配的內容\h*[Jj]ack,? is匹配可選的水平空白字符,然后是 Jack 或 jack,以及可選的逗號和is[^.\n]*\.可選擇匹配除 a.或換行符以外的任何字符$字串結束
正則運算式演示| R 演示
示例代碼:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
sub("\\.\\K\\h*[Jjack,? is[^.\\n]*\\.$", " [TRIM]", test_strings, perl=TRUE)
輸出
[1] "Jack is the tallest person."
[2] "and Jack is the one who said, let there be fries."
[3] "There are mirrors. And Jack is there to be suave."
[4] "There are dogs. And jack is there to pat them. Very cool."
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/341568.html
上一篇:修改DOM一次導致后續修改出錯
