正則運算式從多到少找到重復的句子-有解無憂

我有類似的字串

$string = "hello this is a string and hello but this is a string hello but this is a string and ";

里面有重復的單詞和重復的句子，但我只想要句子，所以我期待

hello but this is a string被捕獲

我嘗試使用這個正則運算式 (.{10,}).*?\1，但它讓我 this is a string and

但我想得到，hello but this is a string因為它是來自最多的字母，10 而不是讓它{25,}匹配更多

但它也非常非常慢

Cary Swoveland：我的計劃是捕獲重復的最長字串并將其從字串中洗掉并只留下一個，所以在我的示例中它將是

hello this is a string and hello but this is a string and

uj5u.com熱心網友回復：

在單詞邊界位置收集以單詞 char 開頭的所有子字串，并使用額外的步驟獲得最長的子字串（因為使用普通正則運算式是不可能的）：

$string = "hello this is a string and hello but this is a string hello but this is a string and ";
if (preg_match_all('~(?=(\b(\w.{9,})(?=.*?\b\2)))~u', $string, $m)) {
    echo array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; });
}
// => hello but this is a string

請參閱PHP 演示。請參閱正則運算式演示。

注意：如果您計劃將匹配的長度限制為 25 個字符，請使用'~(?=(\b(\w.{9,24})(?=.*?\b\2)))~u'.

詳情：

(?=- 積極前瞻的開始：
- (- 第 1 組：
  - \b- 單詞邊界 - (\w.{9,})- 一個單詞字符，然后是除換行符以外的 9 個或更多字符
  - (?=.*?\b\2)- 一個正向前瞻，要求除換行符之外的任何零個或多個字符盡可能少，然后在第 2 組中捕獲的相同字串前面有一個單詞邊界
- )- 第 1 組結束
)- 前瞻結束。

$m[1]我們只使用 . 從陣列中獲取最長的字串array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; })。

uj5u.com熱心網友回復：

我希望建議一種演算法來計算給定字串的最長重復子字串，該字串由單詞組成，并且前后沒有單詞字符。

該方法很簡單：以字串中的單詞為條件，該單詞是子字串的第一個單詞。我最初洗掉字串開頭的非單詞字符，然后找到從該字串開頭開始的最長重復字串。

接下來，無論是否找到重復字串，都通過洗掉修改后的字串的第一個單詞以及后面的非單詞字符來形成一個新字串。重復該程序，在找到重復字串的每個步驟中，將該字串的長度與先前已知最長的重復字串的長度進行比較。

該演算法可以在 Ruby 中實作如下。當然，我意識到需要 PHP 解決方案，但我不了解 PHP。然而，Ruby 代碼讀起來很像偽代碼。也許有靈感的讀者會對將其轉換為 PHP 感興趣。

RGX = /\A(.*\w)\b(?=.*\b\1\b)/

def longest_repeating(str)
  longest = { str: '', len: 0 } # best solution known so far
  loop do                       # loop until breaking out by returning
    i = (str =~ /\w/)           # index of first word char if present
    return longest if i.nil?    # no more words to examine
    str = str[i..-1]            # remove first i characters  
    s = str[RGX]                # obtain string matched by RGX
    if s                        # match found
      n = s.length              # update longest if new longest found
      longest = { str: s, len: n } if n > longest[:len]
    end
    str = str[/ .*/]            # remove leading spaces from str
  end
end

longest_repeating "hello this is a string and hello but this is a string hello but this is a string and "
  #=> {:str=>"hello but this is a string", :len=>26}

longest_repeating "aaa bbb ccc ddd bbb ccc ddd eee fff fff ggg hhh iii jjj kkk lll eee fff fff ggg hhh iii jjj"           
  #=> {:str=>"eee fff fff ggg hhh iii jjj", :len=>27}

持有的正則運算式RGX可以分解如下。

\A        # match beginning of string
(.*\w)    # match zero or more characters followed by a word
          # characters, save to capture group 1
\b        # match a word boundary
(?=       # begin a positive lookahead
  .*      # match zero or more characters
  \b\1\b  # match the content of capture group 1 with word boundaries
)

請注意，代碼確保匹配的第一個字符(.*\w)是單詞字符。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/427084.html

標籤：php 正则表达式

上一篇：如何忽略RegExp中的標簽

下一篇：正則運算式只找到開頭有空格的行