如何從字串中提取符號、數字、最多 3 個字母的單詞和至少 4 個字母的單詞并將每個單詞存盤到相應分類的陣列中?
給定的字串是:
const string = 'There are usually 100 to 200 words in a paragraph';
預期的回應是:
const numbers = ['200', '100'];
const wordsMoreThanThreeLetters = ['There', 'words ', 'paragraph', 'usually'];
const symbols = [' '];
const words = ['are', 'to', 'in', 'a'];
uj5u.com熱心網友回復:
一種有效的方法是split在任何空白序列處處理字串,然后reduce在split方法的結果陣列上操作方法。
reducer 函式的實作方式是,它根據 OP 的類別收集和聚合特定陣列中的字串項(令牌),并由輔助方法支持,例如數字和單詞測驗......
function collectWordsDigitsAndRest(collector, token) {
const isDigitsOnly = value => (/^\d $/).test(token);
const isWord = value => (/^\w $/).test(token);
const listName = isDigitsOnly(token)
? 'digits'
: (
isWord(token)
? (token.length <= 3) && 'shortWords' || 'longWords'
: 'rest'
);
(collector[listName] ??= []).push(token);
return collector;
}
const {
longWords: wordsMoreThanThreeLetters = [],
shortWords: words = [],
digits: numbers = [],
rest: symbols = [],
} = 'There are usually 100 to 200 words in a paragraph'
.split(/\s /)
.reduce(collectWordsDigitsAndRest, {});
console.log({
wordsMoreThanThreeLetters,
words,
numbers,
symbols,
});
.as-console-wrapper { min-height: 100%!important; top: 0; }
當然,也可以matchAll通過單個正則運算式RegExp來獲取所需的標記/它具有命名的捕獲組,并且還使用Unicode 轉義以實作更好的國際化 ( i18n ) 覆寫范圍。
正則運算式本身的外觀和作業方式如下......
(?:\b(?<digit>\p{N} )|(?<longWord>\p{L}{4,})|(?<shortWord>\p{L} )\b)|(?<rest>[^\p{Z}] )
……源自……
(?:\b(?<digit>\p{N} )|(?<word>\p{L} )\b)|(?<rest>[^\p{Z}] )
第一種方法的 reducer 函式必須適應第二種方法,以便相應地處理每個捕獲的組......
function collectWordsDigitsAndRest(collector, { groups }) {
const { shortWord, longWord, digit, rest } = groups;
const listName = (shortWord
&& 'shortWords') || (longWord
&& 'longWords') || (digit
&& 'digits') || (rest
&& 'rest');
if (listName) {
(collector[listName] ??= []).push(shortWord || longWord || digit || rest);
}
return collector;
}
// Unicode Categories ... [https://www.regularexpressions.info/unicode.html#category]
// regex101.com ... [https://regex101.com/r/nCga5u/2]
const regXWordDigitRestTokens =
/(?:\b(?<digit>\p{N} )|(?<longWord>\p{L}{4,})|(?<shortWord>\p{L} )\b)|(?<rest>[^\p{Z}] )/gmu;
const {
longWords: wordsMoreThanThreeLetters = [],
shortWords: words = [],
digits: numbers = [],
rest: symbols = [],
} = Array
.from(
'There are usually 100 to 200 words -- ** in a paragraph.'
.matchAll(regXWordDigitRestTokens)
)
.reduce(collectWordsDigitsAndRest, {});
console.log({
wordsMoreThanThreeLetters,
words,
numbers,
symbols,
});
.as-console-wrapper { min-height: 100%!important; top: 0; }
uj5u.com熱心網友回復:
您正在嘗試做的稱為標記化。通常這是通過正則運算式完成的。您為每個想要識別的標記撰寫一個正則運算式。每個標記都被空格包圍。空格和單詞之間的位置稱為單詞邊界,由 匹配\b。以下正則運算式使用Unicode 字符類。符號不是單詞,因此它們沒有單詞邊界。
- 包含三個或更少字母的單詞:
\b\p{Letter}{1,3}\b. - 超過三個字母的單詞:
\b\p{Letter}{4,}\b. - 數字:
\b\p{Number} \b - 符號:
\p{Symbol}
為了決議不同的令牌它把正則運算式到名為捕獲組是有用的:(?<anything>.*)。這將匹配任何內容并將匹配存盤在捕獲組中anything。
const input = 'There are usually 100 to 200 words in a paragraph';
let rx = new RegExp ([
'(?<wle3>\\b\\p{L}{1,3}\\b)',
'(?<wgt3>\\b\\p{L}{4,}\\b)',
'(?<n>\\b\\p{N} \\b)',
'(?<s>\\p{S} )'
].join ('|'),
'gmu');
let words_le_3 = [];
let words_gt_3 = [];
let numbers = [];
let symbols = [];
for (match of input.matchAll(rx)) {
let g = match.groups;
switch (true) {
case (!!g.wle3): words_le_3.push (g.wle3); break;
case (!!g.wgt3): words_gt_3.push (g.wgt3); break;
case (!!g.n): numbers .push (g.n); break;
case (!!g.s): symbols .push (g.s); break;
}
}
console.log (`Words with up to three letters: ${words_le_3}`);
console.log (`Words with more than three letters: ${words_gt_3}`);
console.log (`Numbers: ${numbers}`);
console.log (`Symbols: ${symbols}`);
如果將匹配項存盤在一個物件中而不是四個頂級陣列中,則代碼會更簡單。在這種情況下,switch 陳述句可以被組上的回圈和賦值替換。
uj5u.com熱心網友回復:
const string = 'There are usually 100 to 200 words in a paragraph';
const response = [];
for (let i = 0; i < string.length; i ) {
response.push(string[i]);
// console.log(response); All process of the loop
}
console.log(response);
uj5u.com熱心網友回復:
您可以為這些情況撰寫一個單獨的函式:
const txt = 'There are usually 100 to 200 words in a paragraph';
console.log(txt);
console.log( ctrim(txt) )
function ctrim(txt) {
let w = txt.split(' ');
let _w = []
w.forEach((w) => {
if(w.length <= 3) {
_w.push( w )
}
})
return _w
}
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/389356.html
標籤:javascript 正则表达式 细绳 减少 标记化
