我正在尋找一種方法來規范 unicode 輸入文本,其中包括印刷連字,例如
# Things to replace, for instance:
U FB00 (?): ff
U FB01 (?): fi
U FB02 (?): fl
U FB03 (?): ffi
U FB04 (?): ffl
U FB05 (?): st
U FB06 (?): st
我想保留所有可以分解的變音符號、標點符號和其他標記,但不是印刷連字。
例如,我想保留商標符號或省略號。
# Things to keep, for instance:
U 2122 (?): TM
U 2026 (…): ...
U 2120 (?): SM
U 2121 (℡): TEL
我已經搜索了一個解決方案并找到了一些相關的答案:
- https://superuser.com/questions/669130/double-latin-letters-in-unicode-ligatures
- 分隔 Unicode 連字字符
有Ruby特定的方式嗎?
uj5u.com熱心網友回復:
我目前的黑客解決方案:
def self.remove_ligatures input
@@ligature_char_regex ||= /[#{ligature_chars.join('')}]/
input.gsub(@@ligature_char_regex) { |c|
c.unicode_normalize(:nfkc)
}
end
哪個有效,但依賴于手動定義的一長串字符(見下文),并且在性能方面可能不是最快的方法。
# Return the list of all characters which decompose
# into multiple ascii/accented characters
#
# Manually commented out those that are not typographic
# ligatures such as Trademark, Medical Doctor, CD
#
# List from: https://superuser.com/questions/669130/double-latin-letters-in-unicode-ligatures
def self.ligature_chars
return [
"\u0132", # (?): IJ
"\u0133", # (?): ij
"\u01C7", # (?): LJ
"\u01C8", # (?): Lj
"\u01C9", # (?): lj
"\u01CA", # (?): NJ
"\u01CB", # (?): Nj
"\u01CC", # (?): nj
"\u01F1", # (?): DZ
"\u01F2", # (?): Dz
"\u01F3", # (?): dz
"\u20A8", # (?): Rs
"\u2116", # (№): No
# "\u2120", # (?): SM
# "\u2121", # (℡): TEL
# "\u2122", # (?): TM
"\u213B", # (?): FAX
"\u2161", # (Ⅱ): II
"\u2162", # (Ⅲ): III
"\u2163", # (Ⅳ): IV
"\u2165", # (Ⅵ): VI
"\u2166", # (Ⅶ): VII
"\u2167", # (Ⅷ): VIII
"\u2168", # (Ⅸ): IX
"\u216A", # (Ⅺ): XI
"\u216B", # (Ⅻ): XII
"\u2171", # (ⅱ): ii
"\u2172", # (ⅲ): iii
"\u2173", # (ⅳ): iv
"\u2175", # (ⅵ): vi
"\u2176", # (ⅶ): vii
"\u2177", # (ⅷ): viii
"\u2178", # (ⅸ): ix
"\u217A", # (?): xi
"\u217B", # (?): xii
"\u3250", # (?): PTE
"\u32CC", # (?): Hg
"\u32CD", # (?): erg
"\u32CE", # (?): eV
"\u32CF", # (?): LTD
"\u3371", # (?): hPa
"\u3372", # (?): da
"\u3373", # (?): AU
"\u3374", # (?): bar
"\u3375", # (?): oV
"\u3376", # (?): pc
"\u3377", # (?): dm
"\u337A", # (?): IU
"\u3380", # (?): pA
"\u3381", # (?): nA
"\u3383", # (?): mA
"\u3384", # (?): kA
"\u3385", # (?): KB
"\u3386", # (?): MB
"\u3387", # (?): GB
"\u3388", # (?): cal
"\u3389", # (?): kcal
"\u338A", # (?): pF
"\u338B", # (?): nF
"\u338E", # (㎎): mg
"\u338F", # (㎏): kg
"\u3390", # (?): Hz
"\u3391", # (?): kHz
"\u3392", # (?): MHz
"\u3393", # (?): GHz
"\u3394", # (?): THz
"\u3396", # (?): ml
"\u3397", # (?): dl
"\u3398", # (?): kl
"\u3399", # (?): fm
"\u339A", # (?): nm
"\u339C", # (㎜): mm
"\u339D", # (㎝): cm
"\u339E", # (㎞): km
"\u33A9", # (?): Pa
"\u33AA", # (?): kPa
"\u33AB", # (?): MPa
"\u33AC", # (?): GPa
"\u33AD", # (?): rad
"\u33B0", # (?): ps
"\u33B1", # (?): ns
"\u33B3", # (?): ms
"\u33B4", # (?): pV
"\u33B5", # (?): nV
"\u33B7", # (?): mV
"\u33B8", # (?): kV
"\u33B9", # (?): MV
"\u33BA", # (?): pW
"\u33BB", # (?): nW
"\u33BD", # (?): mW
"\u33BE", # (?): kW
"\u33BF", # (?): MW
"\u33C3", # (?): Bq
"\u33C4", # (㏄): cc
"\u33C5", # (?): cd
"\u33C8", # (?): dB
"\u33C9", # (?): Gy
"\u33CA", # (?): ha
"\u33CB", # (?): HP
"\u33CC", # (?): in
"\u33CD", # (?): KK
"\u33CE", # (㏎): KM
"\u33CF", # (?): kt
"\u33D0", # (?): lm
"\u33D1", # (㏑): ln
"\u33D2", # (㏒): log
"\u33D3", # (?): lx
"\u33D4", # (?): mb
"\u33D5", # (㏕): mil
"\u33D6", # (?): mol
"\u33D7", # (?): PH
"\u33D9", # (?): PPM
"\u33DA", # (?): PR
"\u33DB", # (?): sr
"\u33DC", # (?): Sv
"\u33DD", # (?): Wb
"\u33FF", # (?): gal
"\uFB00", # (?): ff
"\uFB01", # (?): fi
"\uFB02", # (?): fl
"\uFB03", # (?): ffi
"\uFB04", # (?): ffl
"\uFB05", # (?): st
"\uFB06", # (?): st
# "\u1F12D", # (??): CD
# "\u1F12E", # (??): WZ
# "\u1F14A", # (??): HV
# "\u1F14B", # (??): MV
# "\u1F14C", # (??): SD
# "\u1F14D", # (??): SS
# "\u1F14E", # (??): PPV
# "\u1F14F", # (??): WC
# "\u1F16A", # (??): MC
# "\u1F16B", # (??): MD
"\u1F19", #0 (??): DJ
"\u01C4", # (?): DZ?
"\u01C5", # (?): Dz?
"\u01C6", # (?): dz?
]
end
uj5u.com熱心網友回復:
您可以按如下方式進行。
h = { "\uFB00"=>"ff", "\uFB01"=>"fi", "\uFB02"=>"fl", "\uFB03"=>"ffi",
"\uFB04"=>"ffl", "\uFB05"=>"st", "\uFB06"=>"st", "\uFB06"=>"st" }
#=> {"?"=>"ff", "?"=>"fi", "?"=>"fl", "?"=>"ffi",
# "?"=>"ffl", "?"=>"st", "?"=>"st"}
s = "A ? or ?was ? seen? ? before ? had ? go ? before ?"
r = /\b(?:#{h.keys.join('|')})\b/
#=> /\b(?:?|?|?|?|?|?|?)\b/
s.gsub(r, h)
#=> "A ff or ?was fi seen? fl before ffi had ffl go st before st"
請注意,由于正則運算式中的單詞邊界 ( ) ,"?"in"?was"不匹配\b。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/418357.html
標籤:
上一篇:如何測驗/除錯Jekyll插件?
