我有這個來自 MySQL 系統的表轉儲,雖然它遵循 RFC 標準,但它似乎在存盤 HTML 文本的列中添加了不需要的空間。例如:
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
這是大約 30K 行中的一個,所以我試圖找出一種聰明的方法來洗掉 " 和 <div (可能還有其他) 之間的空格。我試過了:
awk '{$1=$1;printf $0}'
這種作品,但它把所有東西都混合成一條線,這不是我想要的。我想保留 CSV 轉儲中的換行符。我很想聽聽你關于如何解決這個問題的想法。
uj5u.com熱心網友回復:
即使您的輸入檔案很大,以下使用 GNU awk 進行多字符 RS、RT 和 gensub() 的操作也可以作業,因為它不會將整個檔案讀入記憶體,它只是讀取由分隔"<spaces><符或換行符分隔的字串時間:
$ awk -v RS='"\\s <|\n' '{printf "%s%s", $0, gensub(/"\s </,"\"<",1,RT)}' file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
我假設當你and possibly others在你的問題中說你的意思是其他情況,比如"<spaces><div>有一個"空格然后一個標簽開頭,<但這顯然只是一個猜測。
uj5u.com熱心網友回復:
僅使用您顯示的示例,請嘗試以下awk代碼。用 GNU 撰寫和測驗awk。簡單的解釋是,將RS(記錄分隔符)設定為 null 并在主程式中,全域替換新行,后跟空格,后跟<divin<div行,并awk使用 ish 方式列印行1。
awk -v RS="" '{gsub(/\n [[:space:]] <div/,"<div")} 1' Input_file
uj5u.com熱心網友回復:
假設你的要求是去掉<div標簽開始前的空格,你可以試試這個 GNUsed
$ sed -z 's/\(\"\)[[:space:]]\ \(<div .*\)/\1\n\2/' input_file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC 03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
uj5u.com熱心網友回復:
你可以這樣做perl:
perl -0777 -i -pe 's/"\K\s (?=<div)//g' file
細節
0777將檔案 slur 成單個字串,以便該模式可以匹配換行符序列-i- 檔案行內替換開啟"\K\s (?=<div)- 用 匹配"從匹配值中洗掉的字符\K,然后消耗一個或多個空格(用\s),然后<div必須立即跟隨,匹配被替換為空字串g替換所有出現。
您可以使用 GNU 實作相同的目的sed:
sed -i -Ez 's/"\s <div/"<div/g' file
where-i啟用就地檔案替換并-E啟用 POSIX ERE 正則運算式語法,并將z檔案文本拉入模式空間,其中換行符對于正則運算式模式“可見”。
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/428828.html
上一篇:XSLT獲取最后一個節點名稱
