我有兩個不同的文本檔案,我必須找到其中 10 個最長的單詞。我必須列印出這些單詞的串列并寫出頻率 - 它們在這些單獨的檔案中重復的次數。我當前代碼的問題在于它會找到單詞,但是當涉及到頻率時 - 它結合了頻率計數。如何更改代碼以了解單獨檔案的頻率計數?
這是我用于查找兩個文本檔案中的單詞的代碼:
public static Dictionary<string, int> PopularWords(string data1, string data2, char[] punctuation)
{
string[] book1 = data1.Split(punctuation, StringSplitOptions.RemoveEmptyEntries);
string[] book2 = data2.Split(punctuation, StringSplitOptions.RemoveEmptyEntries);
Dictionary<string, int> matches = new Dictionary<string, int>();
for (int i = 0; i < book1.Length; i )
{
if (matches.ContainsKey(book1[i]))
{
matches[book1[i]] ;
continue;
}
for (int j = 0; j < book2.Length; j )
{
if (book1[i] == book2[j])
{
if (matches.ContainsKey(book1[i]))
{
matches[book1[i]] ;
} else
{
matches.Add(book1[i], 2);
}
}
}
}
return matches;
這是我的閱讀和列印代碼:
public static void ProcessPopular(string data, string data1, string results)
{
char[] punctuation = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\n' };
string lines = File.ReadAllText(data, Encoding.UTF8);
string lines2 = File.ReadAllText(data1, Encoding.UTF8);
var popular = PopularWords(lines, lines2, punctuation);
KeyValuePair<string, int>[] popularWords = popular.ToArray();
Array.Sort(popularWords, (x, y) => y.Key.Length.CompareTo(x.Key.Length));
using (var writerF = File.CreateText(results))
{
int foundWords = 0;
writerF.WriteLine("{0, -25} | {1, -35} | {2, -35}", "Longest words", "Frequency in 1 .txt file", "Frequency in 2 .txt file");
writerF.WriteLine(new string('-', 101));
// not finished
}
}
uj5u.com熱心網友回復:
這是我的看法:
public static Dictionary<string, Dictionary<string, int>> PopularWords(string data1, string data2, char[] punctuation)
{
string[] book1 = data1.Split(punctuation, StringSplitOptions.RemoveEmptyEntries);
string[] book2 = data2.Split(punctuation, StringSplitOptions.RemoveEmptyEntries);
return
Enumerable
.Concat(
book1.Select(x => (word: x, book: "book1")),
book2.Select(x => (word: x, book: "book2")))
.ToLookup(x => x.word, x => x.book)
.OrderByDescending(x => x.Key.Length)
.Take(10)
.ToDictionary(x => x.Key, x => x.GroupBy(y => y).ToDictionary(y => y.Key, y => y.Count())); ;
}
如果我從這些資料開始:
char[] punctuation = new char[] { ' ', ',', '.', '?', '-', ':' };
string data1 = "I have two different text files and I have to find 10 longest words that are in both of them. I have to print the list of those words out and write the frequency - how many times they are repeated in those separate files. The problem I have with my current code is that it finds the words, but when it comes to frequency - it combines the frequency count. How can I change the code to know the frequency count for separate files?";
string data2 = "This solution is more general: it works whatever number of files you wish to process. This is an extremely raw query that could be separated in smaller queries, but it gives the logical basis. Other requirements, like only 10 words or minimum word length etc can be easily applied. Please do mind that this a bare-bone example, without any safety checks. It also omits reading data from files. The problem I have with my current code is that it finds the words, but when it comes to frequency - it combines the frequency count. How can I change the code to know the frequency count for separate files?";
我得到這個結果:
"requirements": { "book2" = 1 }
"different": { "book1" = 1 }
"frequency": { "book1" = 4, "book2" = 3 }
"extremely": { "book2" = 1 }
"separated": { "book2" = 1 }
"repeated": { "book1" = 1 }
"separate": { "book1" = 2, "book2" = 1 }
"combines": { "book1" = 1, "book2" = 1 }
"solution": { "book2" = 1 }
"whatever": { "book2" = 1 }
uj5u.com熱心網友回復:
為簡化起見,如果性能不是這里的關鍵,我會這樣做:
public static void Method()
{
var a = "A deep blue raffle, very deep and blue, raffle raffle. An old one was there";
var b = "deep blue raffle, very very very long and blue, raffle RAFFLE. A new one was there";
char[] punctuation = { '.', ',', '!', '?', ':', ';', '(', ')', '\n' };
var fileOne = new string(a.Where(c => punctuation.Contains(c) is false).ToArray()).Split(" ");
var fileTwo = new string(b.Where(c => punctuation.Contains(c) is false).ToArray()).Split(" ");
var duplicates = fileOne.Intersect(fileTwo, StringComparer.OrdinalIgnoreCase);
var result = new List<(int, int, string)>(duplicates.Count());
foreach(var duplicat in duplicates)
{
result.Add((fileOne.Count(x => x.Equals(duplicat, StringComparison.OrdinalIgnoreCase)), fileTwo.Count(x => x.Equals(duplicat, StringComparison.OrdinalIgnoreCase)), duplicat));
}
foreach (var val in result)
{
Output.WriteLine($"Word: {val.Item3} | In file one: {val.Item1} | In file two: {val.Item2}");
}
}
這會給你的結果
字:A | 在檔案一中: 1 | 在檔案二中: 1
字:深 | 在檔案一中:2 | 在檔案二中: 1
Word: blue | 在檔案一中:2 | 在檔案二中:2
字:抽獎 | 在檔案一中: 3 | 在檔案二中:3
字:非常 | 在檔案一中: 1 | 在檔案二中: 3
Word: 和 | 在檔案一中: 1 | 在檔案二中: 1
Word: one | 在檔案一中: 1 | 在檔案二中: 1
Word: was | 在檔案一中: 1 | 在檔案二中: 1
Word: there | 在檔案一中: 1 | 在檔案二中:1
可以輕松應用其他要求,例如僅 10 個字或最小字長等。
請注意這是一個簡單的例子,沒有任何安全檢查。它還省略了從檔案中讀取資料。
uj5u.com熱心網友回復:
編輯我對我原來的解決方案不是很滿意,所以我重新設計了它。我在之前的解決方案中放棄了我喜歡的一件事:它不依賴于標點符號的外部串列,但該串列是由查詢本身生成的。但它使查詢更加復雜和冗長。
如果您對不同的編碼風格感到好奇,這里有一個使用 Linq 的解決方案。
這個解決方案更通用:它可以處理您希望處理的檔案數量。
這是一個 Linqpad 查詢,您可以通過復制/粘貼直接運行,但您當然需要提供文本檔案:
// Choose here how many different words you want.
var resultCount = 10;
// Add as many files as needed.
var Files = new List<string>
{
@"C:\Temp\FileA.txt",
@"C:\Temp\FileB.txt",
@"C:\Temp\FileC.txt",
};
char[] punctuation = { '.', ',', '!', '?', ':', ';', '(', ')', '\n', '"', ' ' };
// Perform the calculation.
var LongestCommonWords = Files
.SelectMany(f => File.ReadAllText(f)
.Split(punctuation, StringSplitOptions.TrimEntries)
.ToLookup(w => ( word: w.ToLower(), fileName: f))
)
.ToLookup(e => e.Key.word)
.Where(g => g.Count() == Files.Count())
.OrderByDescending(g => g.Key.Length)
.Take(resultCount); // Take only the desired amount (10 for instance)
// Display the results.
foreach (var word in LongestCommonWords)
{
var occurences = string.Join(" / ", word.Select(g => $"{Path.GetFileName(g.Key.fileName)} - {g.Count()}"));
Console.WriteLine($"{word.Key} - {occurences}");
}
這是使用三個維基百科頁面的內容獲得的輸出:
貢獻 - FileA.txt - 9 / FileB.txt - 1 / FileC.txt - 5
隨后 - FileA.txt - 2 / FileB.txt - 1 / FileC.txt - 1
介紹 - FileA.txt - 1 / FileB.txt - 4 / FileC.txt - 3
替代 - FileA.txt - 2 / FileB.txt - 1 / FileC.txt - 1
獨立 - FileA.txt - 5 / FileB.txt - 3 / FileC.txt - 3
重要 - FileA.txt - 2 / FileB.txt - 1 / FileC.txt - 3
建立 - FileA.txt - 1 / FileB.txt - 1 / FileC.txt - 1
未完成 - FileA.txt - 1 / FileB.txt - 3 / FileC.txt - 3
編程 - FileA.txt - 1 / FileB.txt - 2 / FileC.txt - 4
大學 - FileA.txt - 44 / FileB.txt - 17 / FileC.txt - 7
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/366898.html
下一篇:在Python中更新重復的字典
