Linux文本處理三劍客之awk學習筆記12：實戰演練-有解無憂

此博文的例題來源于駿馬金龍的awk課程以及awk示例的整合，一些在以往的awk學習筆記中有涉及的示例，這里就不再重復了，

處理代碼注釋

# cat comment.txt 
/*AAAAAAAAAA*/    # 整行都被注釋所占滿，
1111
222

/*aaaaaaaaa*/
32323
12341234
12134 /*bbbbbbbbbb*/ 132412    # 注釋的左右兩邊有內容，需保留，

14534122
/*    # 跨行注釋，
    cccccccccc
*/
xxxxxx /*ddddddddddd    # 跨行注釋且注釋的左邊有內容，需保留，
    cccccccccc
    eeeeeee
*/ yyyyyyyy    # 跨行注釋且注釋的右邊有內容，需保留，
5642341

需要充分理解哪些是應該洗掉的，哪些是應該保留的，

# cat comment.awk
index($0,"/*"){
    if(index($0,"*/")){    # 同行包含“*/”字串，
        # 12134 /*bbbbbbbbbb*/ 132412
        print gensub("^(.*)/\\*.*\\*/(.*)$","\\1\\2","g")
    }else{    # 同行不包含“*/”字串，
        print gensub("^(.*)/\\*.*$","\\1","g")
        while(getline){
            if(index($0,"*/")){
                print gensub("^.*\\*/(.*)$","\\1","g")
                next    # 這里不能使用break，請理解它們的區別，
            }
        }
    }
}

!index($0,"/*"){
    print
}

# awk -f comment.awk comment.txt 

1111
222


32323
12341234
12134  132412

14534122


xxxxxx 
 yyyyyyyy
5642341

這個代碼還有一些可以優化的點，例如去除空白行與空行，

前后段落判斷

有這樣的一個檔案，

# cat order.txt
2019-09-12 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-order',
  ],
]
2019-09-12 07:16:27 [-][
  'data' => [
    false,
  ],
]
2019-09-21 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-order',
  ],
]
2019-09-21 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-user',
  ],
]
2019-09-17 18:34:37 [-][
  'data' => [
    false,
  ],
]

由多段構成，每一段的格式類似如下：

YYYY-MM-DD HH:mm:SS [-][
  'data' => [
    'URL',
  ],
]

需求：找出段資訊包含“false”并且它的前一段包含“i-order”，然后將符合條件的這兩段資訊列印出來，

思路：

文本資訊具有規律性，修改RS使得每段資訊成為一條記錄，
需要定義一個變數來保存前一段資訊，
當前段資訊和前一段資訊需要同時滿足條件，

# cat order.awk
BEGIN{
    ORS=RS="]\n"
}
{
    if($0~/false/&&prev~/i-order/){    # 只有第一條記錄的$0會和prev相同，如果第一條記錄同時包含了“false”和“i-order”，那么就要另作考慮了，
        print prev
        print $0
    }
    prev=$0
}
# awk -f order.awk order.txt 
2019-09-12 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-order',
  ],
]
2019-09-12 07:16:27 [-][
  'data' => [
    false,
  ],
]

行列轉換

示例一

這道題我個人認為是比較經典的一道題目，尤其是進階版的考察了awk的許多方面，

首先我們來看基礎版，也就是作者的原版，

# cat RowColumnConvert.txt
ID  name   gender  age  email
1   Bob    male    28   qq.com
2   Alice  female  20   163.com
3   Tony   male    18   gmail.com
4   Kevin  female  30   xyz.com

期望將行轉換成列，

ID      1       2       3           4
name    Bob     Alice   Tony        Kevin
gender  male    female  male        female
age     28      20      18          30
email   qq.com  163.com gmail.com   xyz.com

原作者給出的答案，

# cat RowColumnConvert.awk
{
    for(i=1;i<=NF;i++){
        if(typeof(arr[i])=="unassigned"){
            arr[i]=$i
        }else{
            arr[i]=arr[i]"\t"$i
        }
    }
}

END{
    for(i=1;i<=NF;i++){
        print arr[i]
    }
}

這種使用字串連接再在其中加入一個制表符來構建的方式，如果某些記錄的長度過長或者過短，就會導致排版的不統一，

在該示例中則是原第5行第4列“gmail.com”長度過長導致的，

這個代碼要求每一行同欄位之間的長度不可以太長，

因此我們來看一下進階版，要求行列轉換以后要對齊，

首先需要先將原始資料保存起來，然后再輸出，原始資料由第N行第N列以及其對應的具體值來表述，例如“第3行第3列是female”，那么需要存盤的資訊就有3個，就可以使用二維陣列，
使用變數i表示原始資料的行，變數j表示原始資料的列，在腦中要有這樣的思路，不然很容易出錯，
原檔案行數和列數一致，容易造成誤導，最好修改一下，使它們不一致，
對齊的思路是我們去計算應該填充多少空格字符，

# cat RowColumnConvert2.awk
{
    for(j=1;j<=NF;j++){
        arr[NR,j]=$j
        len[j]=length($j)
        maxLength[NR]=len[j]>maxLength[NR]?len[j]:maxLength[NR]
    }
}
func cat(count    ,str,x){    # 這里的“區域變數”的定義很重要，尤其是如果這里使用了同名變數i或者j的情況下！
    for(x=1;x<=count;x++){
        str=str" "
    }
    return str
}
END{
    for(j=1;j<=NF;j++){
        for(i=1;i<=NR;i++){
            if(typeof(brr[j])=="unassigned"){
                brr[j]=arr[i,j]""cat(maxLength[i]-length(arr[i,j]))"  "
            }else{
                brr[j]=brr[j]""arr[i,j]""cat(maxLength[i]-length(arr[i,j]))"  "
            }
        }
        print brr[j]
    }
}
# awk -f RowColumnConvert2.awk RowColumnConvert.txt 
ID      1       2        3          4        5            
name    Bob     Alice    Tony       Kevin    Tom          
gender  male    female   male       female   male         
age     28      20       18         30       25           
email   qq.com  163.com  gmail.com  xyz.com  alibaba.com

示例二

name age
alice 21
ryan 30

期望轉換成：

name alice ryan
age 21 30

# cat RowColumnConvert3.awk
{
    for(i=1;i<=NF;i++){
        if(typeof(arr[i])=="unassigned"){
            arr[i]=$i
        }else{
            arr[i]=arr[i]" "$i
        }
    }
}
END{
    for(i=1;i<=NF;i++){
        print arr[i]
    }
}
# awk -f RowColumnConvert3.awk test.txt 
name alice ryan
age 21 30

示例三

# cat test.txt
74683 1001
74683 1002
74683 1011
74684 1000
74684 1001
74684 1002
74685 1001
74685 1011
74686 1000
100085 1000
100085 1001

期望輸出：

74683 1001 1002 1011
74684 1000 1001 1002
74685 1001 1011
74686 1000
100085 1000 1001

# cat RowColumnConvert4.awk
{
    if(!$1 in arr){
        arr[$1]=$2
    }else{
        arr[$1]=arr[$1]" "$2
    }
}
END{
    for(i in arr){
        print i,arr[i]
    }
}
# awk -f RowColumnConvert4.awk test.txt 
74683  1001 1002 1011
74684  1000 1001 1002
74685  1001 1011
74686  1000
100085  1000 1001

格式化空白字符

主要涉及awk對于$N進行修改時會基于OFS來重建$0，在【欄位與記錄的重建】中我們已經提到過，

# cat chaos.txt
      aaa          bb cccc
dd ee        ff gg
  hhhhh    i                jjjj
# awk 'BEGIN{OFS="\t"}{$1=$1;print}' chaos.txt
aaa    bb    cccc
dd    ee    ff    gg
hhhhh    i    jjjj

在Linux中是對齊的，不曉得是不是博客園【插入代碼】顯示的問題，

篩選IP地址

目標是從ifconfig的輸出結果中篩選出IPv4地址，這題我們以前就做過，具體的解題思路詳見讀取檔案中的【資料篩選示例】，這里直接給答案，

ifconfig | awk '/inet /&&!/127.0.0.1/{print $2}'
ifconfig | awk 'BEGIN{RS=""}!/^lo/{print $6}'
ifconfig | awk 'BEGIN{RS="";FS="\n"}!/^lo/{FS=" ";$0=$2;print $2;FS="\n"}'

讀取組態檔中的某段

這里我們以yum源的組態檔為例，我們過濾掉注釋和空行，

# grep -vE "^#|^$" /etc/yum.repos.d/CentOS-Base.repo
... ...
[extras]
name=CentOS-$releasever - Extras
mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras&infra=$infra
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
... ...

期望僅取出某一段資料，例如[extras]段，

思路一：

組態檔具備規律性，將中括號作為記錄分隔符，
基于上面那點再修修補補即可取到想要的資訊，

# grep -vE "^#|^$" /etc/yum.repos.d/CentOS-Base.repo | awk 'BEGIN{RS="[";ORS=""}/^extras/{print "["$0}'
[extras]
name=CentOS-$releasever - Extras
mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras&infra=$infra
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

思路二：

先找extras那行，找到以后輸出，
隨后回圈getline并列印，直到遇到下一個配置段“[.+]”，

# cat extract.awk
index($0,"[extras]"){
    print
    while((getline)>0){
        if($0~/\[.+\]/){
            break
        }
        print
    }
}
# grep -vE "^#|^$" /etc/yum.repos.d/CentOS-Base.repo | awk -f extract.awk 
[extras]
name=CentOS-$releasever - Extras
mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras&infra=$infra
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

根據$0中的部分資訊進行去重

首先來看示例檔案，

# cat partDuplicate.txt
2019-01-13_12:00_index?uid=123
2019-01-13_13:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
2019-01-14_12:00_index?uid=123
2019-01-14_13:00_index?uid=123
2019-01-15_14:00_index?uid=333
2019-01-16_15:00_index?uid=9710

如果問號后面的“uid=xxx”相同，我們就認為是重復的資料，并且將其去除，

輸出的時候，我們要保證原本的資料出現的順序，因此就不應存入陣列并進行無序遍歷了，

思路在陣列的實戰中我們就有接觸過了，

思路一：

以問號作為FS，將$2作為陣列索引，每次awk內部回圈對arr[$2]進行自增，第一次出現的資料arr[$2]的值就為1，僅針對第一次出現的資料進行輸出即可，

# awk 'BEGIN{FS="?"}{arr[$2]++;if(arr[$2]==1){print}}' partDuplicate.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710

思路二：

我們可以將“!arr[$2]++”拿來做pattern，第一次出現資料時回傳值為1，往后的回傳值均是0，

action部分只需要輸出，并且以下三者等價：

PAT{print $0}
PAT{print}
PAT

關于pattern和action的省略情況，詳見這里，因此我們就只需要pattern即可，

# awk 'BEGIN{FS="?"}!arr[$2]++' partDuplicate.txt
2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710

次數統計

示例檔案：

# cat test.txt
portmapper
portmapper
portmapper
portmapper
portmapper
portmapper
status
status
mountd
mountd
mountd
mountd
mountd
mountd
nfs
nfs
nfs_acl
nfs
nfs
nfs_acl
nlockmgr
nlockmgr
nlockmgr
nlockmgr
nlockmgr

# awk '{arr[$0]++}END{for(i in arr){print i"-->"arr[i]}}' test.txt
nfs-->4
status-->2
nlockmgr-->5
portmapper-->6
nfs_acl-->2
mountd-->6

統計TCP連接狀態數量

詳見陣列的實戰部分，

根據http狀態碼統計日志中各IP的出現次數

需求：統計web日志中，http狀態碼非200的客戶端IP的出現次數，按照降序的方式統計出前10行，

日志檔案放百度網盤了，提取碼是jtlg，

111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" "-"

# awk '$8!=200{arr[$1]++} END{PROCINFO["sorted_in"]="@val_num_desc";for(i in arr){if(cnt++==10){break}print arr[i]"-->"i}}' access.log 
896-->60.21.253.82
75-->216.83.59.82
21-->211.95.50.7
21-->61.241.50.63
20-->59.36.132.240
18-->182.254.52.17
16-->50.7.235.2
15-->101.89.19.140
15-->94.102.50.96
13-->198.108.67.80

統計獨立IP

# cat independence.txt
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.23|2015-11-20 20:34:48|guest
c.com.cn|202.109.134.24|2015-11-20 20:34:48|guest
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
a.com.cn|202.109.134.24|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.25|2015-11-20 20:34:48|guest

從該檔案中統計每個域名及其對應的獨立IP數，

例如，a.com.cn的行有3條，但是獨立IP只有2個，因此需要記錄的資訊就是：

a.com.cn 2

將所有的域名及其獨立IP的數量統計后輸出到“域名.txt”格式的檔案中，

# awk 'BEGIN{FS="|"} !arr[$1,$2]++{brr[$1]++} END{for(i in brr){print i,brr[i]>i".txt"}}' independence.txt
# cat a.com.cn.txt 
a.com.cn 2
# cat b.com.cn.txt 
b.com.cn 2
# cat c.com.cn.txt 
c.com.cn 1

兩個檔案的處理

存在兩個檔案file1.txt和file2.txt：

# cat file1.txt
50.481  64.634  40.573  1.00  0.00
51.877  65.004  40.226  1.00  0.00
52.258  64.681  39.113  1.00  0.00
52.418  65.846  40.925  1.00  0.00
49.515  65.641  40.554  1.00  0.00
49.802  66.666  40.358  1.00  0.00
48.176  65.344  40.766  1.00  0.00
47.428  66.127  40.732  1.00  0.00
51.087  62.165  40.940  1.00  0.00
52.289  62.334  40.897  1.00  0.00

# cat file2.txt
48.420  62.001  41.252  1.00  0.00
45.555  61.598  41.361  1.00  0.00
45.815  61.402  40.325  1.00  0.00
44.873  60.641  42.111  1.00  0.00
44.617  59.688  41.648  1.00  0.00
44.500  60.911  43.433  1.00  0.00
43.691  59.887  44.228  1.00  0.00
43.980  58.629  43.859  1.00  0.00
42.372  60.069  44.032  1.00  0.00
43.914  59.977  45.551  1.00  0.00

需求：替換file2.txt的第5列的值為file2.txt的第1列減去file1.txt的第1列的值，

方法一

# cat twoFile1.awk
{
    num1=$1
    if((getline < "file2.txt")>0){
        $5=$1-num1
        print $0
    }
}
# awk -f twoFile1.awk file1.txt 
48.420 62.001 41.252 1.00 -2.061
45.555 61.598 41.361 1.00 -6.322
45.815 61.402 40.325 1.00 -6.443
44.873 60.641 42.111 1.00 -7.545
44.617 59.688 41.648 1.00 -4.898
44.500 60.911 43.433 1.00 -5.302
43.691 59.887 44.228 1.00 -4.485
43.980 58.629 43.859 1.00 -3.448
42.372 60.069 44.032 1.00 -8.715
43.914 59.977 45.551 1.00 -8.375

方法二

我們期望將file1.txt和file2.txt都直接作為命令的引數，形如：

awk '...rule...' file1.txt file2.txt

# cat twoFile2.awk
NR==FNR{    # 如果NR和FNR相等，那么就表示awk在處理的檔案是第一個檔案
    arr[FNR]=$1
}
NR!=FNR{
    $5=$1-arr[FNR]
    print $0
}
# awk -f twoFile2.awk file1.txt file2.txt 
48.420 62.001 41.252 1.00 -2.061
45.555 61.598 41.361 1.00 -6.322
45.815 61.402 40.325 1.00 -6.443
44.873 60.641 42.111 1.00 -7.545
44.617 59.688 41.648 1.00 -4.898
44.500 60.911 43.433 1.00 -5.302
43.691 59.887 44.228 1.00 -4.485
43.980 58.629 43.859 1.00 -3.448
42.372 60.069 44.032 1.00 -8.715
43.914 59.977 45.551 1.00 -8.375

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/257301.html

標籤：其他

上一篇：linux查看磁盤資訊

下一篇：docker部署 springboot 多模塊專案+vue