環繞分隔線，保留第一列，最小最終長度-有解無憂

希望拆分內容行，保留標題。

我進行了大量的文本處理，并且我喜歡使用 unix 單行代碼，因為隨著時間的推移它們很容易組織（相對于大量腳本），我可以輕松地將它們鏈接在一起，并且我喜歡（重新）學習如何使用經典的 Unix 函式。通常我會使用簡短的 awk、perl 或 ruby?? 單線，這取決于哪個最優雅。

在這里，我有 X 個逗號分隔專案的行。我想把它們分開，保留標題。

輸入：

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab

輸出：

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

演算法細節：

輸入行由一個詞條組成，然后是等號，然后是一個逗號分隔的至少 1 個專案的串列。
在這個例子中，大多數單詞都是單字，但是單詞可以包含空格（例如“horseshoecrab”在結尾）
拆分為 9 個專案，除非有 <3 個，在這種情況下，最終拆分可能會在一行上產生 12
有多條線路。例如，下一行可能是行星。

我有一個轉義空格的想法，然后使用 unix fold，然后 awk 拉下第一列。這與上面的完全一樣：

echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} $1==""{$1=p} {p=$1} 1' \
| tr '\t _'  '=, '

但它只考慮字符長度（而不是專案數），并沒有考慮我的特殊情況，即我不希望 <3 個專案掛在最后一行。

我認為這是一個優雅的小謎題，有想法嗎？

uj5u.com熱心網友回復：

使用 Perl，一種方法

perl -wnE'
    ($head, @items) = split /\s*[,=]\s*/; 
    while (@items) { 
        @elems = splice @items, 0, 9;
        if (@elems < 3) { $lines[-1] .= ", " . join ", ", @elems }
        else            { push @lines, join ", ", @elems }
    }
    say "$head = $_" for @lines; @lines = ()
' file

或者

perl -wnE'
    ($head, @items) = split /\s*[,=]\s*/; 
    push @lines, join ", ", splice @items, 0, 9  while @items; 
    $lines[-2] .= ", " . pop @lines  if 2 > $lines[-1] =~ tr/,//;
    say "$head = $_" for @lines; @lines = ()
' file

為了便于閱讀，顯示為多行，并且可以復制粘貼到 bash 終端中，但也可以在一行中輸入。測驗增加了 11 (9 2) 個專案。

筆記

split,或先提取詞頭=，然后是一行中的專案
splice洗掉并回傳（第一個9）元素，這些元素通過, 生成一行來列印，直到所有元素都被處理。如果最后一個組的元素少于 3 個，則將其添加到上一個要列印的行中
在第二個版本中，所有元素都被處理，然后最后一行列印檢查是否需要將其添加到前一個版本中，方法是計算其中的逗號

uj5u.com熱心網友回復：

你可以考慮這個awk：

awk 'BEGIN {FS=OFS=" = "} {
   s = $2
   while (match(s, /([^,] , ){1,9}(([^,] , ){2}[^,] $)?/)) {
      v = substr(s, RSTART, RLENGTH)
      sub(/, $/, "", v)
      print $1, v
      s = substr(s, RLENGTH 1)
   }
}' file

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

特別注意這里使用的正則運算式/([^,] , ){1,9}(([^,] , ){2}[^,] $)?/

這匹配用, 分隔符分隔的 1 到 9 個單詞。此正則運算式還有一個可選部分，可在行尾前匹配最多 3 個單詞。

uj5u.com熱心網友回復：

僅使用您顯示的示例，請嘗試以下awk程式。用 GNU 撰寫和測驗awk應該可以在任何awk.

我創建了一個awk名為的變數numberOfFields，其中包含要列印的欄位數（根據顯示的示例與新行分隔）。

awk  -v numberOfFields="9" '
BEGIN{
  FS=", ";OFS=", "
}
{
  line=$0
  sub(/ = .*/,"",line)
  sub(/^[^ ]* =[^ ]* /,"")
  for(i=1;i<=NF;i  ){
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\
    (i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{
  print ""
}
'  Input_file

或者上面的代碼printf在兩行中有陳述句（出于可讀性目的），如果你想把它放在一行中，那么請嘗試以下操作：

awk  -v numberOfFields="9" '
BEGIN{
  FS=", ";OFS=", "
}
{
  line=$0
  sub(/ = .*/,"",line)
  sub(/^[^ ]* =[^ ]* /,"")
  for(i=1;i<=NF;i  ){
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{
  print ""
}
'  Input_file

說明：為上述添加詳細說明。

awk  -v numberOfFields="9" '            ##Starting awk program from here, creating variable named numberOfFields and setting its value to 9 here.
BEGIN{                                  ##Starting BEGIN section of awk here.
  FS=", ";OFS=", "                      ##Setting FS and OFS to comma space here.
}
{
  line=$0                               ##Setting value of $0 to line here.
  sub(/ = .*/,"",line)                  ##Substituting space = space everything till last of value in line with NULL.
  sub(/^[^ ]* =[^ ]* /,"")              ##Substituting from starting till first occurrence of space followed by = followed by again first occurrence of space with NULL in current line.
  for(i=1;i<=NF;i  ){                   ##Running for loop here for all fields.
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\  ##Using printf and its conditions are explained below of code.
    (i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{                                    ##Starting END block of this program from here.
  print ""                              ##Printing newline here.
}
'  Input_file                           ##Mentioning Input_file name here.

上述情況說明printf：

(
  i%numberOfFields==0                   ##checking if modules value of i%numberOfFields is 0 here, if this is TRUE:
    ?OFS $i ORS line" = "               ##Then printing OFS $i ORS line" = "(comma space field value new line line variable and space = space)
    :(i==1                              ##If very first condition is FALSE then checking again if i==1
       ?line " = " $i                   ##Then print line variable followed by space = space followed by $i
       :(i%numberOfFields>1?OFS $i:$i)  ##Else if if modules value of i%numberOfFields is greater than 1 then print OFS $i else print $i.
     )
)

uj5u.com熱心網友回復：

一個awk想法：

awk -F'[=,]' -v min=3 -v max=9 '
{ for (i=2; i<=NF; i  ) {
      if ( (i-1) % max == 1 && (NF-i 1 > min) ) {
         if ( i > max ) print newline
         newline=$1 "="
         pfx=""
      }
      newline=newline pfx $i
      pfx=","
  }
  print newline
}
' raw.dat

樣本資料：

$ cat raw.dat
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto, vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13

-v min=3 -v max=9我們得到：

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13

解決 OP 關于使用單線的評論......

雖然這個awk腳本肯定會被卡在一個單行中，但我猜 OP 會 a）發現很難編輯/維護 b）如果不得不一遍又一遍地（重新）輸入，太容易搞砸了。

一個（顯而易見的？）想法是將awk代碼包裝在一個函式中，例如：

splitme() {
    awk -F'[=,]' -v min=$1 -v max=$2 '
    { for (i=2; i<=NF; i  ) {
          if ( (i-1) % max == 1 && (NF-i 1 > min) ) {
             if ( i > max ) print newline
             newline=$1 "="
             pfx=""
          }
          newline=newline pfx $i
          pfx=","
      }
      print newline
    }' "${3:--}"
}

筆記：

引數化minandmax值以便從命令列中提取
引數化檔案參考以從命令列 ( $3) 或標準輸入 ( -)中提取
OP 可以根據需要添加更多邏輯來驗證/驗證輸入引數

是否作為獨立檔案呼叫：

$ splitme 3 9 raw.dat

或在管道中呼叫：

$ cat raw.dat | splitme 3 9

兩者都生成：

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13

uj5u.com熱心網友回復：

awk -F"[=,]" -v max="9" '{
        for(i=2; i<=NF; i =max){
                row = ""
                for(j=i; j<=i max-1; j  ){
                        row=row $(j) ","
                }
                gsub(/, $/, "", row)
                printf "%s=%s \n", $1, row
        }
    }' input_file

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers = 10, 11, 12, 13, 14, 15, 16
cars = mercedes benz, bmw, audi, vw, porsche, seat, skoda, opel, renault
cars = mazda, toyota, honda

uj5u.com熱心網友回復：

這是處理一行的兩種 Ruby 解決方案。該變數str包含一行（'animals = ...'示例中開始的行）。

#1 使用正則運算式

RGX = \A\w | *= *|(?:[^,] , *){0,10}[^,] \z|(?:[^,] , *){9}

def break_line(str)
  headword, _, *lines = str.scan(RGX)
  lines.each { |line| puts "#{headword} = #{line.sub(/, *\z/, '')}" }
end

brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

正則運算式可以以自由間距模式撰寫，以使其自記錄。

RGX =
/
\A         # match beginning of string
\w         # match one or more word chars (e.g., "animals")
|          # or
[ ]*=[ ]*  # "=" preceded and followed by zero or more spaces
|          # or         
(?:        # begin a non-capture group
  [^,]     # match one or more chars other than a comma
  ,[ ]*    # match a comma and zero or more spaces
){0,10}    # end non-capture group and execute 0-10 times
[^,]       # match one or more chars other than a comma
\z         # match end of string
|          # or
(?:        # begin a non-capture group
 [^,]      # match one or more chars other than a comma
 ,[ ]*     # match a comma and zero or more spaces
){9}     # end non-capture group and execute 1-7 times
/x         # invoke free-spacing regex definition mode

演示

當為示例執行時，str我們會發現以下內容。

headword
  #=> "animals"
_
  #=> "="
lines
  #=> ["lizard, bird, bee, snake, whale, eagle, beetle, ",
       "mule, hare, goose, horse, mouse, pig, dog, ",
       "frog, bug, fish, duck, camel, squirrel, owl, ",
       "chicken, pigeon, lion, sheep, bear, spider, deer, ",
       "tiger, lobster, dinosaur, cat, goat, rat, cricket, ",
       "rabbit, elephant, crow, fox, donkey, monkey, butterfly, ",
       "crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab"]

Ruby 有一個約定，即_在其值隨后不用于計算的情況下使用該變數。這主要是為了告知讀者。

#2 提取和分組單詞

def break_line(str)
  headword, *words = str.split(/ *[,=] */)
  groups = words.each_slice(9).to_a
  if groups[-1].size < 3
    groups[-2]  = groups[-1]
    groups.pop
  end
  groups.each { |group| puts "#{headword} = #{group.join(', ')}" }
end

brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

通過部分解釋，我們將獲得以下示例：

headword
  #=> "animals"
words
  #=> ["lizard", "bird",,..."horseshoe crab"]
groups
  #=> [["lizard", "bird", "bee", "snake", "whale", "eagle",
        "beetle", "mule", "hare"],
       ["goose", "horse", "mouse", "pig", "dog", "frog",
        "bug", "fish", "duck"],
       ["camel", "squirrel", "owl", "chicken", "pigeon", "lion",
        "sheep", "bear", "spider"],
       ["deer", "tiger", "lobster", "dinosaur", "cat", "goat",
        "rat", "cricket", "rabbit"],
       ["elephant", "crow", "fox", "donkey", "monkey", "butterfly",
        "crab", "leopard", "moth"],
       ["shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]]

由于元素groups包含兩個以上的元素（它包含五個），groups因此不進行后續修改。如果最后一行被允許最多包含 14 個（而不是 11 個）元素，它將被更改為

["elephant", "crow", "fox", "donkey", "monkey", "butterfly", "crab",
 "leopard", "moth", "shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]

uj5u.com熱心網友回復：

花了一些時間來修改我的解決方案，使其在這兩者上作業，gawk并在正則運算式鏈的末端mawk執行等效的操作；$1 = $1

$(NF!=NF=NF)擴展到NF != (NF=NF)內部，這總是錯誤的，所以整個事情只是意味著$0，但嵌入$1=$1其中：

 input ::

     1  animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
     2  planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX

 command ::

 [mg]awk '
 BEGIN {
     FS = (OFS = " = ") "*" 
   _=__ = (___="[^,] ")"[,]"
           gsub(".",_,__)
     __ = (__)_ "(("_")?("_")?"___"$)?"
 
      _ = ORS } gsub(__,"&"_ $1 OFS) gsub("[,]"_,_) sub((_)" ([^,]*)$","", $(NF!=NF=NF))' 

 output (mawk 1.3.4) ::

     1  animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
     2  animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
     3  animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
     4  animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
     5  animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
     6  animals = shark, salmon, shrimp, mosquito, horseshoe crab
     7  planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX

 output (gawk 5.1.1) ::

     1  animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
     2  animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
     3  animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
     4  animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
     5  animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
     6  animals = shark, salmon, shrimp, mosquito, horseshoe crab
     7  planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/470772.html

標籤：红宝石 perl awk 命令行

上一篇：使用perl或bash將Json字串文字轉換為utf8字符

下一篇：Perl字串替換為反向參考的值和shell變數