如何準確計算檔案中的句子數?
我的檔案中有一段文字。有 7 個句子,但我的代碼顯示有 9 個句子。
String path = "C:/CT_AQA - Copy/src/main/resources/file.txt";
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path)));
String line;
int countWord = 0;
int sentenceCount = 0;
int characterCount = 0;
int paragraphCount = 0;
int countNotLetter = 0;
int letterCount = 0;
int wordInParagraph = 0;
List<Integer> wordsPerParagraph = new ArrayList<>();
while ((line = br.readLine()) != null) {
if (line.equals("")) {
paragraphCount ;
wordsPerParagraph.add(wordInParagraph);
System.out.printf("In %d paragraph there are %d words\n", paragraphCount, wordInParagraph);
wordInParagraph = 0;
} else {
characterCount = line.length();
String[] wordList = line.split("[\\s—]");
countWord = wordList.length;
wordInParagraph = wordList.length;
String[] letterList = line.split("[^a-zA-Z]");
countNotLetter = letterList.length;
String[] sentenceList = line.split("[.:]");
sentenceCount = sentenceList.length;
}
letterCount = characterCount - countNotLetter;
}
if (wordInParagraph != 0) {
wordsPerParagraph.add(wordInParagraph);
}
br.close();
System.out.println("The amount of words are " countWord);
System.out.println("The amount of sentences are " sentenceCount);
System.out.println("The amount of paragraphs are " paragraphCount);
System.out.println("The amount of letters are " letterCount);
uj5u.com熱心網友回復:
您的代碼看起來可以正常作業,盡管它并未在任何地方遵循最佳實踐。
我懷疑得到錯誤答案的根本原因是計算句子結尾的正則運算式不準確。您的代碼計算以點或冒號結尾的句子。問題出在這一行:
String[] sentenceList = line.split("[.:]");
但冒號不是句子的結尾,除此之外,句子還以其他字符結尾(感嘆號和問號、省略號)。這種模式在我的評估中更準確:
"[!?.] (?=$|\\s)"
并顯示您得到錯誤結果的檔案內容。那么我的假設就有可能被說服。
僅計算檔案中句子數的完整代碼:
int sentenceCount = 0;
while ((line = br.readLine()) != null) {
if (!"".equals(line)) {
String[] sentencesArray = line.split("[!?.] (?=$|\\s)");
sentenceCount = sentencesArray.length;
}
}
br.close();
System.out.println("The amount of sentences are " sentenceCount);
uj5u.com熱心網友回復:
您可能會在句子中提取尾隨空格,這會為陣列添加額外的值。您可以洗掉從空白line,然后split它為你的句子使用replaceAll("\\s ", "")。
更改后的代碼如下所示:
String[] sentenceList = line.replaceAll("\\s ","").split("[.:]");
我并沒有改變,但是你定義什么句子,!并?能明顯是句分隔符為好。
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/337903.html
