查找字串陳述句組合-頻率表到目標頻率表的組合-有解無憂

該賞金過期7天。此問題的答案有資格獲得 100聲望獎勵。 BigChief正在尋找這個問題的更詳細的答案：

使用例如當前答案中陳述的可能性找到最佳解決方案，最佳解決方案將具有句子/頻率表的輸入，以及最接近目標頻率表的所選句子串列的輸出，Python 中的當前答案/CPP 應該改進以獲得最佳解決方案，它們很接近但尚不完美

我有一個句子串列，例如 1000 個句子的串列。

我想找到匹配/“匹配最接近”某個頻率表的句子組合：

[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]

我想通過使用這里的組合從句子串列中找到所有可能的組合 (so comb(1000, 1); to comb(100, 1000); ) 然后將每個組合與頻率表進行比較，使距離最小. 因此，從一個可能的組合中對所有頻率表求和，并將該總和與目標進行比較，應記錄與目標差異最小的組合。可能有多個匹配最接近的組合。

問題是所有組合的計算需要很長時間才能完成，顯然需要幾天時間。有沒有一種已知的演算法可以有效地解決這個問題？理想情況下最多幾分鐘？

輸入陳述句：

在倉庫中看到的房車比在露營地多。

她盡力幫助他。曾經有過想要與身體分離的日子，但今天不是那些日子之一。

旋轉的棒棒糖與流行的搖滾糖果有問題。

兩人無視遠處的雷聲，順著狹縫峽谷而下。

數英畝的杏樹在州際公路兩旁排列，贊美了瘋狂的駕駛者。

他不是詹姆斯邦德；他叫羅杰·摩爾。

風滾草拒絕翻滾，但更??愿意騰躍。

她很反感他分不清檸檬水和酸橙水的區別。

他不想去看牙醫，但他還是去了。

找出最接近以下頻率表的句子組合：

[a:5, b:5, c:5, d:5, e:5, f:5, g:5, h:5 ..... z:5]

例子：

第六句頻數表

他不是詹姆斯邦德；他叫羅杰·摩爾。

是 [a:2, e:5, g:1, h:1, i:3, j:1, m:3, n:3, o:5, r:3, s:4]

頻數表取上下相等，不包括特殊字符。

uj5u.com熱心網友回復：

這可以簡化為與目標問題具有最小絕對差異的子序列和。

問題如下：你有一個A包含整數值的陣列，比如[1,5,3,2,6]，還有一個整數值T，目標。你想找到的子序列A'從元素A，這樣的abs(target - sum(A'))最小化。

在您的情況下，的各個整數值A是二維的，其中包含每個句子的字符頻率表，而目標也是二維的，因為它包含字符數。您想最小化絕對差的總和。

這顯然是一個動態規劃問題。如果沒有優化，時間復雜度將是指數級的，我們需要檢查2^n可能性（對于每個元素，我們有 2 種可能性：我們要么接受它，要么離開它）。我認為這就是您通過創建所有組合在問題中提到的內容。

但是通過優化，我們可以實作n * T其中n的元素數量A和T目標值。這當然是如果我們只想要最接近的數字本身，而不是與該數字相加的元素。

要獲得導致最佳解決方案的子序列本身的元素，您有兩個選擇：

回溯，具有前面解釋的指數時間復雜度。
具有路徑重建的 DP，其中時間復雜度保持可控，如上所述。

這些問題和演算法是眾所周知的，我認為它們不需要解釋。

你的具體問題如何映射到這個問題，據我所知，也很明顯。當然，您希望如何實作它有一些復雜性。但是，如果您的問題與上述子序列求和問題之間的關系不清楚，請告訴我，以便我進一步解釋。

以下是我發現的一些鏈接，可以幫助您解決此問題。請注意，它們不是一個直接的答案，因為這個問題相對復雜。

LeetCode 上的最近子序列求和問題。這處理您只尋找最接近的總和而不是導致該總和的路徑的情況。討論頁面充滿了不同的想法和詳細的解釋（按多數票排序）。
DP 和路徑重建：這是關于 DP 的系列的一部分。
DP 入門
重構最優解的路徑

uj5u.com熱心網友回復：

貪心演算法

您測驗所有可能的句子組合的第一個想法太慢了。如果您有n句子，則有2**n（2 的 n 次方）可能的句子組合。例如，當 n=1000 時，有2**1000 ≈ 10**300可能的組合。那是一個 1 后跟 300 個零：比宇宙中粒子的數量還要多，而且比國際象棋可能的不同游戲的數量還要多！

這是一個貪婪演算法的建議。它沒有特別優化，它的運行時間是O(k * n**2)，其中n是句子的數量，k是最長句子的長度。

想法如下：

給每個句子打分number of useful characters - number of superfluous characters。例如，如果一個句子包含 20 個'a'，而目標只需要 15 個'a'，我們將計算 15 個有用的'a'和 5 個多余的'a'，因此字符'a'對該句子的得分貢獻了 10。
將得分最高的句子添加到結果中；
更新目標以去除結果中已經存在的字符；
更新每個句子的分數以反映更新后的目標。
回圈直到沒有句子的得分為正。

我懶得用 C 實作它，所以這里是在 python 中，使用一個最大堆和一個計數器。在代碼之后我寫了一個快速解釋來幫助你把它翻譯成 C 。

from collections import Counter
import heapq

sentences = ['More RVs were seen in the storage lot than at the campground.', 'She did her best to help him.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.', 'The swirled lollipop had issues with the pop rock candy.', 'The two walked down the slot canyon oblivious to the sound of thunder in the distance.', 'Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'He is no James Bond; his name is Roger Moore.', 'The tumbleweed refused to tumble but was more than willing to prance.', 'She was disgusted he couldn’t tell the difference between lemonade and limeade.', 'He didn’t want to go to the dentist, yet he went anyway.']

target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
Counter({'a': 10, 'b': 10, 'c': 10, 'd': 10, 'e': 10, 'f': 10, 'g': 10, 'h': 10, 'i': 10, 'j': 10, 'k': 10, 'l': 10, 'm': 10, 'n': 10, 'o': 10, 'p': 10, 'q': 10, 'r': 10, 's': 10, 't': 10, 'u': 10, 'v': 10, 'w': 10, 'x': 10, 'y': 10, 'z': 10})

print(target)

counts = [Counter(''.join(filter(str.isalpha, s)).lower()) for s in sentences]  # remove punctuation, spaces, uncapitalize, then count frequencies

def get_score(sentence_count, target):
    return sum((sentence_count & target).values()) - sum((sentence_count - target).values())

candidates = []
for sentence, count in zip(sentences, counts):
    score = get_score(count, target)
    candidates.append((-score, sentence, count))

heapq.heapify(candidates)    # order candidates by score
                             # python's heapq only handles min-heap
                             # but we need a max-heap
                             # so I added a minus sign in front of every score

selection = []
while candidates and candidates[0][0] < 0:  # while there is a candidate with positive score
    score, sentence, count = heapq.heappop(candidates)  # greedily selecting best candidate
    selection.append(sentence)
    target = target - count                             # update target by removing characters already accounted for
    candidates = [(-get_score(c,target), s, c) for _,s,c in candidates]  # update scores of remaining candidates
    heapq.heapify(candidates)                       # reorder candidates according to new scores

# HERE ARE THE SELECTED SENTENCES:
print(selection)
# ['Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.']

# HERE ARE THE TOTAL FREQUENCIES FOR THE SELECTED SENTENCES:
final_frequencies = Counter(filter(str.isalpha, ''.join(selection).lower()))
print(final_frequencies)
# Counter({'e': 22, 't': 15, 'a': 12, 'h': 11, 's': 10, 'o': 10, 'n': 10, 'd': 10, 'i': 9, 'r': 8, 'y': 7, 'm': 5, 'w': 5, 'c': 4, 'b': 4, 'f': 3, 'l': 3, 'g': 2, 'p': 2, 'v': 2, 'u': 2, 'z': 1})

# CHARACTERS IN EXCESS:
target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
print(final_frequencies - target)
# Counter({'e': 12, 't': 5, 'a': 2, 'h': 1})

# CHARACTERS IN DEFICIT:
print(target - final_frequencies)
# Counter({'j': 10, 'k': 10, 'q': 10, 'x': 10, 'z': 9, 'g': 8, 'p': 8, 'u': 8, 'v': 8, 'f': 7, 'l': 7, 'b': 6, 'c': 6, 'm': 5, 'w': 5, 'y': 3, 'r': 2, 'i': 1})

說明：

PythonCounter( )將句子轉換為映射character -> frequency；
對于兩個 Counters aand b，a & b是多集交集，a - b是多集差集；
對于 Counter a，sum(a.values())是總計數（所有頻率的總和）；
heapq.heapify將串列轉換為最小堆，這是一種允許以最低分數輕松訪問元素的資料結構。我們實際上想要分數最高的句子，而不是最低分，所以我用負數替換了所有分數。

貪心演算法的非最優性

我應該提到這個貪心演算法是一個近似演算法。在每次迭代中，它選擇得分最高的句子；但不能保證最佳解決方案實際上包含該句子。

很容易構建一個貪心演算法無法找到最優解的例子：

target = Counter('abcdefghijklmnopqrstuvwxyz')
print(target)
# Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1, 'g': 1, 'h': 1, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1, 'o': 1, 'p': 1, 'q': 1, 'r': 1, 's': 1, 't': 1, 'u': 1, 'v': 1, 'w': 1, 'x': 1, 'y': 1, 'z': 1})

sentences = [
    'The quick brown fox jumps over the lazy dog.',
    'abcdefghijklm',
    'nopqrstuvwxyz'
]

With this target, the scores are as follow:

[
    (17, 'The quick brown fox jumps over the lazy dog.'),
    (13, 'abcdefghijklm'),
    (13, 'nopqrstuvwxyz')
]

The two "half-alphabets" have a score of 13 each, because they contain 13 letters of the alphabet. The sentence "The quick brown fox..." has a score of 17 = 26 - 9, because it contains the 26 letters of the alphabets, plus 9 excess letters (for instance, there are 3 excess 'o' and 2 excess 'e').

The optimal solution, obviously, is to cover the target perfectly with the two halves of the alphabet. But our greedy algorithm will select the "quick brown fox" sentence first, because it has a higher score.

uj5u.com熱心網友回復：

typedef struct
{
    wstring text{ L"" };            
    vector<int> encoded_text;
    int counter[26] // frequency table
    {
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,
    };

    int score = INT_MIN;

} Sentence;  

 
int m_target[26]
{
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10
};

bool orderByScore(const Sentence &a, const Sentence &b)
{
    return b.score < a.score;
}

int SentencesCounter::GetScore(Sentence sentence, int* target)
{
    int sum1 = 0;
    int sum2 = 0;

    for (size_t i = 0; i < 26; i  )
    {
        int sentenceFreq = sentence.counter[i];
        int targetFreq = target[i];

        sum1  = min(sentenceFreq, targetFreq);
        sum2  = max(0, sentenceFreq - targetFreq);
    }

    return sum1 - sum2;
}

vector<Sentence> SentencesCounter::SolveSO(vector<Sentence> &sentences)
{
    vector<Sentence> candidates{ sentences };

    for (size_t i = 0; i < candidates.size(); i  )
    {
        candidates[i].score = GetScore(candidates[i], m_target);
    }

    sort(candidates.begin(), candidates.end(), orderByScore);

    int target[26];
    memcpy(target, m_target, 26 * sizeof(int));

    vector<Sentence> selection;
    while (candidates.front().score > 0) // while there is a candidate with positive score
    {
        Sentence s = candidates.front();
        if(s.encoded_text.size() > 0) selection.push_back(s);
        candidates.front().score = INT_MIN;

        for (size_t i = 0; i < 26; i  ) { target[i] -= s.counter[i]; } // update target

        size_t i;
        for (i = 0; i < candidates.size(); i  )
        {
            if (candidates[i].score > INT_MIN) // int min means already added to selection
                candidates[i].score = GetScore(candidates[i], target);
            else if (i != 0) break; // int min found at other index than top
        }

        partial_sort(candidates.begin(), candidates.begin()   i, candidates.end(), orderByScore);
    }
    return selection
}

嘗試在偽 CPP 中從 Stef 復制 Python 代碼

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/374292.html

標籤：C 细绳算法数据结构计算机科学

上一篇：2個串列的最大路徑和

下一篇：使用Python的匈牙利演算法約束