希望拆分內容行,保留標題。
我進行了大量的文本處理,并且我喜歡使用 unix 單行代碼,因為隨著時間的推移它們很容易組織(相對于大量腳本),我可以輕松地將它們鏈接在一起,并且我喜歡(重新)學習如何使用經典的 Unix 函式。通常我會使用簡短的 awk、perl 或 ruby?? 單線,這取決于哪個最優雅。
在這里,我有 X 個逗號分隔專案的行。我想把它們分開,保留標題。
輸入:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
輸出:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
演算法細節:
- 輸入行由一個詞條組成,然后是等號,然后是一個逗號分隔的至少 1 個專案的串列。
- 在這個例子中,大多數單詞都是單字,但是單詞可以包含空格(例如“horseshoecrab”在結尾)
- 拆分為 9 個專案,除非有 <3 個,在這種情況下,最終拆分可能會在一行上產生 12
- 有多條線路。例如,下一行可能是行星。
我有一個轉義空格的想法,然后使用 unix fold,然后 awk 拉下第一列。這與上面的完全一樣:
echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} $1==""{$1=p} {p=$1} 1' \
| tr '\t _' '=, '
但它只考慮字符長度(而不是專案數),并沒有考慮我的特殊情況,即我不希望 <3 個專案掛在最后一行。
我認為這是一個優雅的小謎題,有想法嗎?
uj5u.com熱心網友回復:
使用 Perl,一種方法
perl -wnE'
($head, @items) = split /\s*[,=]\s*/;
while (@items) {
@elems = splice @items, 0, 9;
if (@elems < 3) { $lines[-1] .= ", " . join ", ", @elems }
else { push @lines, join ", ", @elems }
}
say "$head = $_" for @lines; @lines = ()
' file
或者
perl -wnE'
($head, @items) = split /\s*[,=]\s*/;
push @lines, join ", ", splice @items, 0, 9 while @items;
$lines[-2] .= ", " . pop @lines if 2 > $lines[-1] =~ tr/,//;
say "$head = $_" for @lines; @lines = ()
' file
為了便于閱讀,顯示為多行,并且可以復制粘貼到 bash 終端中,但也可以在一行中輸入。測驗增加了 11 (9 2) 個專案。
筆記
split
,或先提取詞頭=,然后是一行中的專案splice洗掉并回傳(第一個
9)元素,這些元素通過,生成一行來列印,直到所有元素都被處理。如果最后一個組的元素少于 3 個,則將其添加到上一個要列印的行中在第二個版本中,所有元素都被處理,然后最后一行列印檢查是否需要將其添加到前一個版本中,方法是計算其中的逗號
uj5u.com熱心網友回復:
你可以考慮這個awk:
awk 'BEGIN {FS=OFS=" = "} {
s = $2
while (match(s, /([^,] , ){1,9}(([^,] , ){2}[^,] $)?/)) {
v = substr(s, RSTART, RLENGTH)
sub(/, $/, "", v)
print $1, v
s = substr(s, RLENGTH 1)
}
}' file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
特別注意這里使用的正則運算式/([^,] , ){1,9}(([^,] , ){2}[^,] $)?/
這匹配用, 分隔符分隔的 1 到 9 個單詞。此正則運算式還有一個可選部分,可在行尾前匹配最多 3 個單詞。
uj5u.com熱心網友回復:
僅使用您顯示的示例,請嘗試以下awk程式。用 GNU 撰寫和測驗awk應該可以在任何awk.
我創建了一個awk名為的變數numberOfFields,其中包含要列印的欄位數(根據顯示的示例與新行分隔)。
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=$0
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i ){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
或者上面的代碼printf在兩行中有陳述句(出于可讀性目的),如果你想把它放在一行中,那么請嘗試以下操作:
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=$0
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i ){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
說明:為上述添加詳細說明。
awk -v numberOfFields="9" ' ##Starting awk program from here, creating variable named numberOfFields and setting its value to 9 here.
BEGIN{ ##Starting BEGIN section of awk here.
FS=", ";OFS=", " ##Setting FS and OFS to comma space here.
}
{
line=$0 ##Setting value of $0 to line here.
sub(/ = .*/,"",line) ##Substituting space = space everything till last of value in line with NULL.
sub(/^[^ ]* =[^ ]* /,"") ##Substituting from starting till first occurrence of space followed by = followed by again first occurrence of space with NULL in current line.
for(i=1;i<=NF;i ){ ##Running for loop here for all fields.
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\ ##Using printf and its conditions are explained below of code.
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{ ##Starting END block of this program from here.
print "" ##Printing newline here.
}
' Input_file ##Mentioning Input_file name here.
上述情況說明printf:
(
i%numberOfFields==0 ##checking if modules value of i%numberOfFields is 0 here, if this is TRUE:
?OFS $i ORS line" = " ##Then printing OFS $i ORS line" = "(comma space field value new line line variable and space = space)
:(i==1 ##If very first condition is FALSE then checking again if i==1
?line " = " $i ##Then print line variable followed by space = space followed by $i
:(i%numberOfFields>1?OFS $i:$i) ##Else if if modules value of i%numberOfFields is greater than 1 then print OFS $i else print $i.
)
)
uj5u.com熱心網友回復:
一個awk想法:
awk -F'[=,]' -v min=3 -v max=9 '
{ for (i=2; i<=NF; i ) {
if ( (i-1) % max == 1 && (NF-i 1 > min) ) {
if ( i > max ) print newline
newline=$1 "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}
' raw.dat
樣本資料:
$ cat raw.dat
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto, vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
-v min=3 -v max=9我們得到:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
解決 OP 關于使用單線的評論......
雖然這個awk腳本肯定會被卡在一個單行中,但我猜 OP 會 a)發現很難編輯/維護 b)如果不得不一遍又一遍地(重新)輸入,太容易搞砸了。
一個(顯而易見的?)想法是將awk代碼包裝在一個函式中,例如:
splitme() {
awk -F'[=,]' -v min=$1 -v max=$2 '
{ for (i=2; i<=NF; i ) {
if ( (i-1) % max == 1 && (NF-i 1 > min) ) {
if ( i > max ) print newline
newline=$1 "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}' "${3:--}"
}
筆記:
- 引數化
minandmax值以便從命令列中提取 - 引數化檔案參考以從命令列 (
$3) 或標準輸入 (-)中提取 - OP 可以根據需要添加更多邏輯來驗證/驗證輸入引數
是否作為獨立檔案呼叫:
$ splitme 3 9 raw.dat
或在管道中呼叫:
$ cat raw.dat | splitme 3 9
兩者都生成:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
uj5u.com熱心網友回復:
awk -F"[=,]" -v max="9" '{
for(i=2; i<=NF; i =max){
row = ""
for(j=i; j<=i max-1; j ){
row=row $(j) ","
}
gsub(/, $/, "", row)
printf "%s=%s \n", $1, row
}
}' input_file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers = 10, 11, 12, 13, 14, 15, 16
cars = mercedes benz, bmw, audi, vw, porsche, seat, skoda, opel, renault
cars = mazda, toyota, honda
uj5u.com熱心網友回復:
這是處理一行的兩種 Ruby 解決方案。該變數str包含一行('animals = ...'示例中開始的行)。
#1 使用正則運算式
RGX = \A\w | *= *|(?:[^,] , *){0,10}[^,] \z|(?:[^,] , *){9}
def break_line(str)
headword, _, *lines = str.scan(RGX)
lines.each { |line| puts "#{headword} = #{line.sub(/, *\z/, '')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
正則運算式可以以自由間距模式撰寫,以使其自記錄。
RGX =
/
\A # match beginning of string
\w # match one or more word chars (e.g., "animals")
| # or
[ ]*=[ ]* # "=" preceded and followed by zero or more spaces
| # or
(?: # begin a non-capture group
[^,] # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){0,10} # end non-capture group and execute 0-10 times
[^,] # match one or more chars other than a comma
\z # match end of string
| # or
(?: # begin a non-capture group
[^,] # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){9} # end non-capture group and execute 1-7 times
/x # invoke free-spacing regex definition mode
演示
當為示例執行時,str我們會發現以下內容。
headword
#=> "animals"
_
#=> "="
lines
#=> ["lizard, bird, bee, snake, whale, eagle, beetle, ",
"mule, hare, goose, horse, mouse, pig, dog, ",
"frog, bug, fish, duck, camel, squirrel, owl, ",
"chicken, pigeon, lion, sheep, bear, spider, deer, ",
"tiger, lobster, dinosaur, cat, goat, rat, cricket, ",
"rabbit, elephant, crow, fox, donkey, monkey, butterfly, ",
"crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab"]
Ruby 有一個約定,即_在其值隨后不用于計算的情況下使用該變數。這主要是為了告知讀者。
#2 提取和分組單詞
def break_line(str)
headword, *words = str.split(/ *[,=] */)
groups = words.each_slice(9).to_a
if groups[-1].size < 3
groups[-2] = groups[-1]
groups.pop
end
groups.each { |group| puts "#{headword} = #{group.join(', ')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
通過部分解釋,我們將獲得以下示例:
headword
#=> "animals"
words
#=> ["lizard", "bird",,..."horseshoe crab"]
groups
#=> [["lizard", "bird", "bee", "snake", "whale", "eagle",
"beetle", "mule", "hare"],
["goose", "horse", "mouse", "pig", "dog", "frog",
"bug", "fish", "duck"],
["camel", "squirrel", "owl", "chicken", "pigeon", "lion",
"sheep", "bear", "spider"],
["deer", "tiger", "lobster", "dinosaur", "cat", "goat",
"rat", "cricket", "rabbit"],
["elephant", "crow", "fox", "donkey", "monkey", "butterfly",
"crab", "leopard", "moth"],
["shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]]
由于元素groups包含兩個以上的元素(它包含五個),groups因此不進行后續修改。如果最后一行被允許最多包含 14 個(而不是 11 個)元素,它將被更改為
["elephant", "crow", "fox", "donkey", "monkey", "butterfly", "crab",
"leopard", "moth", "shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]
uj5u.com熱心網友回復:
花了一些時間來修改我的解決方案,使其在這兩者上作業,gawk并在正則運算式鏈的末端mawk執行等效的操作;$1 = $1
$(NF!=NF=NF)擴展到NF != (NF=NF)內部,這總是錯誤的,所以整個事情只是意味著$0,但嵌入$1=$1其中:
input ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
2 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
command ::
[mg]awk '
BEGIN {
FS = (OFS = " = ") "*"
_=__ = (___="[^,] ")"[,]"
gsub(".",_,__)
__ = (__)_ "(("_")?("_")?"___"$)?"
_ = ORS } gsub(__,"&"_ $1 OFS) gsub("[,]"_,_) sub((_)" ([^,]*)$","", $(NF!=NF=NF))'
output (mawk 1.3.4) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
output (gawk 5.1.1) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/470772.html
