從兩個字串中洗掉一個字符子集后檢查兩個字串是否相等-有解無憂

我最近遇到了這個問題：

您有兩個字串 s1 和 s2，它們完全由小寫字母“a”到“r”組成，需要處理一系列查詢。每個查詢提供從“a”到“r”的小寫英文字母子集。對于每個查詢，確定 s1 和 s2 在僅限于查詢中的字母時是否相等。s1 和 s2 最多可以包含 10^5 個字符，最多可以有 10^5 個查詢。

例如，如果 s1 是“aabcd”而 s2 是“caabd”，并且要求您處理帶有子集“ac”的查詢，那么 s1 變為“aac”，而 s2 變為“caa”。這些不匹配，因此查詢將回傳 false。

通過執行以下操作，我能夠在 O(N^2) 時間內解決此問題：對于每個查詢，我通過遍歷兩個字串（一次一個字符）檢查 s1 和 s2 是否相等，跳過不相等的字符位于允許字符的子集中，并檢查 s1 和 s2 中的“允許”字符是否匹配。如果在某些時候，字符不匹配，則字串不相等。否則，當僅限于查詢中的字母時，s1 和 s2 相等。每個查詢需要 O(N) 時間來處理，并且有 N 個查詢，總共需要 O(N^2) 時間。

然而，有人告訴我有一種方法可以比 O(N^2) 更快地解決這個問題，使用位運算子和一些聰明的數學。有誰知道如何做到這一點？

uj5u.com熱心網友回復：

第一個明顯的加速是確保你的集合成員測驗是 O(1)。為此，有幾個選項：

將每個字母表示為一個位——現在每個字符都是一個 18 位值，只設定了一個位。允許的字符集現在是這些位 OR 在一起的掩碼，您可以使用按位與來測驗字符的成員資格；
或者，您可以有一個 18 值陣列并按字符對其進行索引（c - 'a'將給出 0 到 17 之間的值）。成員資格測驗基本上是陣列查找的成本（并且您可以通過不進行減法來節省操作 - 而只是使陣列更大并直接按字符索引。

思想實驗

下一個潛在的加速是認識到在兩個字串中出現的次數不完全相同的任何字符將立即成為失敗的匹配。您可以使用可以在 O(N) 時間內完成的直方圖計算兩個字串中的所有字符頻率。這樣，如果這樣的字符出現在查詢中，您可以修剪搜索空間，并且可以在恒定時間內對此進行測驗。

當然，這對真正的壓力測驗沒有幫助，因為它可以保證所有可能的字母在兩個字串中都具有匹配的頻率。那么，你會怎么做呢？

好吧，您通過認識到對于字串 1中字符的任何位置x和字串 2 中該字符的某個位置將是有效匹配來擴展上述前提（即相同數量的字符x出現在兩個字串中，直到它們各自的位置），那么在這些位置之前的任何其他字符的總數也必須相等。對于任何不正確的角色，它不可能與 character 兼容x。

概念

讓我們從一種稱為記憶的技術開始考慮這一點，您可以利用預先計算或部分計算的資訊并從中獲得大量資訊。所以考慮兩個這樣的字串：

a  b  a  a  c  d  e   |   e  a  b  a  c  a  d

你可以在這里做些什么有用的事情？那么，為什么不存盤每個字母的部分計數總和：

          a  b  a  a  c  d  e  |  e  a  b  a  c  a  d
         ----------------------|----------------------
freq(a) = 1  1  2  3  3  3  3  |  0  1  1  2  2  3  3
freq(b) = 0  1  1  1  1  1  1  |  0  0  1  1  1  1  1
freq(c) = 0  0  0  0  1  1  1  |  0  0  0  0  1  1  1
freq(d) = 0  0  0  0  0  1  1  |  0  0  0  0  0  0  1
freq(e) = 0  0  0  0  0  0  1  |  1  1  1  1  1  1  1

這會占用大量記憶體，但別擔心——我們稍后會處理。相反，花時間吸收我們實際在做的事情。

查看上表，我們在這些字串的每個位置都有兩個字串的運行字符總數。

"ab"現在讓我們通過顯示匹配查詢和不匹配查詢的示例來看看我們的匹配規則是如何作業的"acd"：

對于"ab"：

           a  b  a  a  c  d  e  |  e  a  b  a  c  a  d
          ----------------------|----------------------
 freq(a) = 1  1  2  3  3  3  3  |  0  1  1  2  2  3  3
 freq(b) = 0  1  1  1  1  1  1  |  0  0  1  1  1  1  1
           ^  ^  ^  ^                 ^  ^  ^     ^

我們掃描頻率陣列，直到我們在查詢中找到我們的一個字母。我在^上面標記的位置。如果您洗掉所有未標記的列，您將看到剩余的列在兩邊都匹配。所以這是一場比賽。

對于"acd"：

           a  b  a  a  c  d  e  |  e  a  b  a  c  a  d
          ----------------------|----------------------
 freq(a) = 1  1  2  3  3  3  3  |  0  1  1  2  2  3  3
 freq(c) = 0  0  0  0  1  1  1  |  0  0  0  0  1  1  1
 freq(d) = 0  0  0  0  0  1  1  |  0  0  0  0  0  0  1
           ^     ^  #  #  ^           ^     ^  #  #  ^

Here, all columns are matching except those marked with #.

Putting it together

All right so you can see how this works, but you may wondering about the runtime, because those examples above seem to be doing even more scanning than you were doing before!

Here's where things get interesting, and where our character frequency counts really kick in.

First, observe what we're actually doing on those marked columns. For any one character that we care about (for example, the 'a'), we're looking for only the positions in both strings where its count matches, and then we're comparing these two columns to see what other values match. This gives us a set of all other characters that are valid when used with 'a'. And of course, 'a' itself is also valid.

And what was our very first optimization? A bitset -- an 18-bit value the represents valid characters. You can now use this. For the two columns in each string, you set a 1 for characters with matching counts and a zero for characters with non-matching counts. If you process every single pair of matching 'a' values in this manner, what you get is a bunch of sets that they work with. And you can keep a running "master" set that represents the intersection of these -- you just intersect it with each intermediate set you calculate, which is a single bitwise-AND.

By the time you reach the end of both strings, you have performed a O(N) search and you examined 18 rows any time you encountered 'a'. And the result was the set of all characters that work with 'a'.

Now repeat for the other characters, one at a time. Every time it's a O(N) scan as above and you wind up with the set of all other characters that work with that the one you're processing.

After processing all these rows you now have an array containing 18 values representing the set of all characters that work with any one character. The operation took O(18N) time and O(18N) storage.

Query

Since you now have an array where for each character you have all possible characters that work with it, you simply look up each character in the query, and you intersect their sets (again, with bitwise-AND). A further intersection is required by using the set of all characters present in the query. That prunes off all characters that are not relevant.

This leaves you with a value which, for all values in the query, represents the set of all values that can result in a matching string. So if this value is then equal to the query then you have a match. Otherwise, you don't.

This is the part that is now fast. It has essentially reduced your query tests to constant time. However, the original indexing did cost us a lot of memory. What can be done about that?

Dynamic Programming

Is it really necessary to allocate all that storage for our frequency arrays?

Well actually, no. It was useful as a visual tool by laying out our counts in tabular form, and to explain the method conceptually but actually a lot of those values are not very useful most of the time and it sure made the method seem a bit complicated.

The good news is we can compute our master sets at the same time as computing character counts, without needing to store any frequency arrays. Remember that when we're computing the counts we use a histogram which is as simple as having one small 18-value array array where you say count[c] = 1 (if c is a character or an index derived from that character). So we could save a ton of memory if we just do the following:

Processing the set (mask) of all compatible characters for character c:

Initialize the mask for character c to all 1s (mask[c] = (1 << 18) - 1) -- this represents all characters are currently compatible. Initialize a character histogram (count) to all zero.
Walk through string 1 until you reach character c. For every character x you encounter along the way, increase its count in the histogram (count[x] ).
Walk through string 2 until you reach character c. For every character x you encounter along the way, decrease its count in the histogram (count[x]--).
Construct a 'good' set where any character that currently has a zero-count has a 1-bit, otherwise 0-bit. Intersect this with the current mask for c (using bitwise-AND): mask[c] &= good
Continue from step 2 until you have reached the end of both strings. If you reach the end of one of the strings prematurely, then the character count does not match and so you set the mask for this character to zero: mask[c] = 0
對每個字符從 1 開始重復，直到完成所有字符。

上面，我們基本上具有相同的 O(18N) 時間復雜度，除了我們有絕對最小的額外存盤——每個字符只有一個計數陣列和一個掩碼陣列。

結合上述技術來真正快速地解決看似復雜的組合問題通常被稱為動態編程。我們將問題簡化為一個真值表，該表表示與任何單個字符一起使用的所有可能字符。時間復雜度相對于字串的長度保持線性，并且僅根據可能的字符數進行縮放。

這是上面用 C 破解的演算法：https ://godbolt.org/z/PxzYvGs8q

uj5u.com熱心網友回復：

令 RESTRICT(s,q) 為字串 s 對集合 q 中字母的限制。

如果 q 包含兩個以上的字母，則可以從所有字串 RESTRICT(s,q _ij ) 重構完整的字串 RESTRICT(s,q)，其中 q _ij是q 中的一對字母。

因此，RESTRICT(s1,q) = RESTRICT(s2,q) 當且僅當 RESTRICT(s1,q _ij ) = RESTRICT(s2,q _ij ) 對于 q 中的所有對 q _ij。

由于您被限制為 18 個字母，因此只有 153 個字母對，或者只有 171 個基本的單字母或雙字母查詢。

如果您預先計算這 171 個查詢的結果，那么您可以通過組合它們的結果來回答任何其他更復雜的查詢，而根本不需要檢查字串。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qukuanlian/451712.html

標籤：算法数学

上一篇：熊貓分組和日志添加

下一篇：如何通過重復最后一位來增加整數？