我正在使用Swift 5并且想將學術文本分成句子。
我了解了這個NaturalLanguage包,它對大多數文本都有效。但是,我意識到他們無法充分處理某些部分,例如et al.and page number: "p. ", and()()模式。
這是可復制的代碼:
import NaturalLanguage
var sentences: [String] = []
var str = "The information was reported by Brown et al. (2000). This should not have been the case (Brown et al., 2001, p. 10). Several other studies corroborate this (i.e., I don't know but something important, etc.) (Brown et al., 2002; Brown et al., 2003). But this is weird given the results of White et al. (2001)."
str.enumerateSubstrings(in: str.startIndex..., options: [.localized, .bySentences]) { (tag, _, _, _) in
sentences.append(tag ?? "")
}
sentences.forEach {
print($0)
}
print(sentences)
預期成績:
The information was reported by Brown et al. (2000).
This should not have been the case (Brown et al., 2001, p. 10).
Several other studies corroborate this (i.e., I don't know but something important, etc.) (Brown et al., 2002; Brown et al., 2003).
But this is weird given the results of White et al. (2001).
結果:
The information was reported by Brown et al.
(2000).
This should not have been the case (Brown et al., 2001, p.
10).
Several other studies corroborate this (i.e., I don't know but something important, etc.)
(Brown et al., 2002; Brown et al., 2003).
But this is weird given the results of White et al.
(2001).
["The information was reported by Brown et al. ", "(2000). ", "This should not have been the case (Brown et al., 2001, p. ", "10). ", "Several other studies corroborate this (i.e., I don\'t know but something important, etc.) ", "(Brown et al., 2002; Brown et al., 2003). ", "But this is weird given the results of White et al. ", "(2001)."]
我該如何處理?有什么辦法可以手動處理嗎?或者我可以使用類似但更好的包嗎?
uj5u.com熱心網友回復:
一種可能的解決方案是轉義這些單詞/括號。我沒有對此進行性能測驗,可能還有改進的余地。
let escapingElements = [" p." : " $p$", " et al." : " $et al$", "(" : "(", ")" : ")", " etc." : " $etc$"]
escapingElements.forEach { original, escaped in
str = str.replacingOccurrences(of: original, with: escaped)
}
str.enumerateSubstrings(in: str.startIndex..., options: [.localized, .bySentences]) { (tag, _, _, _) in
sentences.append(tag ?? "")
}
escapingElements.forEach{ original, escaped in
sentences = sentences.map{
$0.replacingOccurrences(of: escaped, with: original)
}
}
這背后的想法是用一個唯一的值替換每個關注的元素,這不是文本的一部分。然后使用 useNaturalLanguage來拆分句子。最后遍歷每個句子并用原始值替換分隔值。(而不是您自己的轉義序列,您可以使用UUID)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/497658.html
標籤:迅速
上一篇:帶有內部運算式條件的正則運算式
