文本分析,在資料挖掘,甚至是深度學習中很重要的分支研究領域,如下運用R語言,通過采用文本相似度演算法Jaro-Winkler Distance,能實作:
在題庫中查找出相似度高的題并輸出自動聚類的結果,從而提煉出練習重點,提高閱讀效率,
## 尋找練習重點
library('xlsx')
library('DBI')
library('RSQLite')
library('ff')
library('bit')
library('RecordLinkage')
library('stringr')
library('plyr')
# 讀取指定題目檔案
file <- "D:/data/Q_1.xlsx"
Q <- read.xlsx(file, 1, encoding = "UTF-8")
# 按照規則尋找相似度等于或者高于80%的題
PickOutGroup <- function() {
{
#NO_B <- list()
#PickingList_B <- list()
i = 1
for (i in 1:length(Q$題號)) {
Q_Main1 <- Q$題干[i] %>% as.character()
Q_Branches1 <- Q$選項[i] %>% as.character()
Q_Main_len <- Q$題干長度[i] %>% as.numeric()
Q_list <- list()
Q_list[i] <- Q$題號[i] %>% as.numeric()
a = 1
for (a in 1:length(Q$題號)) {
b = a + 1
Q_list_Pick <- Q$題號[b] %>% as.numeric()
# 題干
Q_Main2 <- Q$題干[b]
Q_Main_scores <- jarowinkler(Q_Main1, Q_Main2) %>% as.numeric()
# 選項
Q_Branches2 <- Q$選項[b]
Q_Branches_scores <- jarowinkler(Q_Branches1, Q_Branches2) %>% as.numeric()
# 題干長度
Q_Main_Len <- Q$題干長度[b] %>% as.numeric()
Q_Main_length_Con1 <- if (is.na((Q_Main_len >= as.numeric(Q_Main_Len - 10)) %>% as.logical())) { FALSE } else { TRUE }
Q_Main_length_Con2 <- if (is.na((Q_Main_len <= as.numeric(Q_Main_Len + 10)) %>% as.logical())) { FALSE } else { TRUE }
Q_Main_length <- tryCatch(if ((Q_Main_length_Con1) & (Q_Main_length_Con2)) { "Yes" } else { "No" }, error = function(e) { cat("ERROR:", conditionMessage((e))) })
#將相似選項加入串列
Q_list_Con1 <- (if (as.numeric(length(Q_Main_scores)) == 0) { FALSE } else { Q_Main_scores >= 0.8 }) %>% as.logical()
Q_list_Con2 <- (if (as.numeric(length(Q_Branches_scores)) == 0) { FALSE } else { Q_Branches_scores >= 0.8 }) %>% as.logical()
Q_list_Con3 <- (Q_Main_length == "Yes") %>% as.logical()
Q_list[b] <- tryCatch(if ((Q_list_Con1) & (Q_list_Con2) & (Q_list_Con3)) { Q_list_Pick } else { 0 }, error = function(e) { cat("ERROR:", conditionMessage((e))) })
a = a + 1
}
NO <- Q$題號[i] %>% as.numeric()
Q_list <- str_c(Q_list, sep = "", collapse = ";") %>% as.character() %>% gsub(pattern = ";0", replacement = "", .) %>% gsub(pattern = "NULL;", replacement = "", .)
PickingList <- data.frame(NO = NO, PickingList = Q_list)
unique(write.csv(PickingList, "D:/data/Q_2.csv", append = T))
}
i = i + 1
}
}
# 計算代碼運行時間
system.time(PickOutGroup())
參考:“文本相似度演算法Jaro-Winkler Distance” 介紹
Jaro-Winkler Distance是一個度量兩個字符序列之間的編輯距離的字串度量標準,是由William E. Winkler在1990年提出的Jaro Distance度量標準的一種變體,Jaro Distance是兩個單詞之間由一個轉換為另一個所需的單字符轉換的最小數量,Jaro-Winkler Distance通過前綴因子使Jaro Distance相同時共同前綴長度越大的相似度越高,Jaro–Winkler Distance越小,兩個字串越相似,如果分數是0,則表示完全不同,分數為1則表示完全匹配,Jaro–Winkler相似度是1 - Jaro–Winkler Distance,其公式如下:

轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/245963.html
標籤:R
上一篇:scrapy爬蟲框架你還不會嗎?簡單使用爬蟲框架采集網站資料
下一篇:C語言重點難點決議—冒泡法排序
