如何將第二列中的重復項分組并在第三列中為它們分配一個新數字？-有解無憂

我有一個看起來像這樣的檔案

1, C10 C11 N3 O1
2, C19 C23 O2
3, C19 N2 O2
4, C10 C11 O1
5, C11 N3 O1
6, C13 C8 O3
7, C8 N5 O3

第一列是組號，第二列是該組中的專案。我想搜索第二列并查看行中的一個字串與其他行中的任何字串匹配的次數。然后，我需要在與具有兩個或更多匹配項的組相對應的第三列中放置一個新數字。

例子：

1, C10 C11 N3 O1
4, C10 C11 O1
5, C11 N3 O1

第 1、4 和 5 組都有兩個或多個匹配的字串。因此，他們都會被分配到一個新的組，如下所示：

1, C10 C11 N3 O1, 1
4, C10 C11 O1, 1
5, C11 N3 O1, 1

我對此很陌生并且很難過。任何幫助表示贊賞。謝謝。

編輯：我試圖讓我的代碼像這樣，但我無法讓它作業。

while read -r line; do 
     awk '$1 !=$1 && $2 == $2 {print $0}'
done

如果另一行屬于兩個組，那么我會把它放在第一個對應的組中。該資訊不會太重要，只要它在匹配組中即可。不過，該組最好包含所有 2 個匹配字串。

uj5u.com熱心網友回復：

這是一個可能有幫助的python答案。如果您愿意，我可以解釋邏輯，以便可以在 awk 中對其進行編碼：

#!/usr/bin/env python3
import sys
import collections

groups = {}
items_in_groups = collections.defaultdict(list)
with open(sys.argv[1]) as f:
  for line in f:
    group_num,items = line.split(",")
    if items:
      groups[ group_num ] = items.split()
      for item in groups[ group_num ]: 
        items_in_groups[ item ].append(group_num)

for group,items in groups.items():
  for item in items:
    if len( items_in_groups[ item ] ) >= 2: 
      print( f'{group},{" ".join(items)},{items_in_groups[ item ][0]}' )
      break
  else:
      print( f'{group},{" ".join(items)},N/A' )

上面的代碼（似乎有很多，不是嗎？）產生以下輸出：

1 , C10 C11 N3 O1 , 1
2 , C19 C23 O2 , 2
3 , C19 N2 O2 , 2
4 , C10 C11 O1 , 1
5 , C11 N3 O1 , 1
6 , C13 C8 O3 , 6
7 , C8 N5 O3 , 6

從本質上講，我正在對您的資料進行雙重傳遞。第一遍，我們將資料讀入一個非常簡單的關聯陣列（python 中的 dict），稱為組。組將組號映射到專案串列（例如 3 --> C19,N2,02 ）。這是填充后的組字典：

團體：

{'1': ['C10', 'C11', 'N3', 'O1'], '2': ['C19', 'C23', 'O2'], '3': ['C19', 'N2', 'O2'], '4': ['C10', 'C11', 'O1'], '5': ['C11', 'N3', 'O1'], '6': ['C13', 'C8', 'O3'], '7': ['C8', 'N5', 'O3']}

items_in_groups:

{'C10': ['1', '4'], 'C11': ['1', '4', '5'], 'N3': ['1', '5'], 'O1': ['1', '4', '5'], 'C19': ['2', '3'], 'C23': ['2'], 'O2': ['2', '3'], 'N2': ['3'], 'C13': ['6'], 'C8': ['6', '7'], 'O3': ['6', '7'], 'N5': ['7']}

items_in_groups 也是在第一次資料傳遞時創建的 - 對于我們找到的每個專案，我們將我們找到的組添加到關聯串列中。例如，我們在第 1 組和第 4 組中都發現了“C10”。

最后，通過計算 items_in_groups，我們可以查找 2 個或更多的匹配項。我們遍歷組（按照從檔案中讀取的順序 - 在最近版本的 python 中，保留了 dict 排序）。然后，我們遍歷該組中的每個專案，檢查該專案是否出現在多個組中。如果我們發現一個專案出現在多個組中，我們會停止，列印出當前組、它的專案以及匹配組串列中的第一組。

編輯：簡化代碼 - 輸入檔案的讀取現在逐行完成，最后的列印陳述句更易于閱讀。最后，添加了一些代碼來處理不匹配的情況。

uj5u.com熱心網友回復：

這是另一個答案，但這次是用 awk 撰寫的。我基本上做了從python到awk的翻譯。

awk -F, '
{ # first pass 
  groups[ $1 ] =  $2
  split( $2, items, " " ) 
  for (item in items) {
    items_in_groups[ items[item] ] = items_in_groups[ items[item] ] " " $1
  }
}
END {  # second pass
  for ( group in groups ) {
    split( groups[group], items, " " ) 
    for ( item in items ) { 
      split( items_in_groups[ items[item] ] , in_groups, " " ) 
      if (length( in_groups ) >= 2)  {
        print group,groups[group],in_groups[1]
        break
      }
    }
  }
}
' $1

uj5u.com熱心網友回復：

python, awk, 現在來了ruby。

該演算法首先在關聯陣列中索引資料：一個陣列用于列出組的專案，另一個用于列出專案所屬的組。
然后，對于每個組，它在與其共享至少一個專案的組中搜索他的最高匹配（丟棄僅與其共享一個專案的組）。
最后，它會將每個組分配到與他的最高匹配相同的組，或者如果當時他的最高匹配未分配到任何組，則開始一個新組。
這應該符合您的規格。

編輯：洗掉了代碼中新組的重新編號

#!/usr/bin/env ruby

# my naming sense sucks but it makes it easy to read the code below
items_of = Hash.new { |h,k| h[k] = [] }
groups_of = Hash.new { |h,k| h[k] = [] }

ARGF.each_line do |line|
  groupID,itemsStr = line.split(',')
  items_of[groupID] = itemsStr.split if itemsStr
  items_of[groupID].each {|itemID| groups_of[itemID] << groupID}
end

assigned_group_of = {}

items_of.each do |groupID,itemsArr|
  bestMatch = itemsArr.map { |itemID|
    groups_of[itemID]
  }.flatten.uniq.reject{ |gID|
    gID == groupID || (items_of[groupID] & items_of[gID]).count < 2
  }.max { |a,b|
    (items_of[groupID] & items_of[a]).count - (items_of[groupID] & items_of[b]).count
  }
  assigned_group_of[groupID] = bestMatch.nil? || assigned_group_of[bestMatch].nil? ? groupID : assigned_group_of[bestMatch]

  print "#{groupID}, #{items_of[groupID].join(' ')}, #{assigned_group_of[groupID]}\n"

注意：腳本同時接受來自標準輸入、檔案或兩者的輸入（與cat命令完全一樣）；前任。cat file1 file2 file3 | script.rb或script.rb file1 file2 file3或cat file2 | script.rb file1 - file3

輸出示例：
編輯：添加了八個條目以顯示與@Mark 演算法的行為差異。

1, C10 C11 N3 O1, 1
2, C19 C23 O2, 2
3, C19 N2 O2, 2
4, C10 C11 O1, 1
5, C11 N3 O1, 1
6, C13 C8 O3, 6
7, C8 N5 O3, 6
8, C10 C19 C23, 2

值得思考的行為

根據條目的順序，您可以獲得不同的結果：

1, A10 B10, 1
2, A10 B10 C20 D20, 1
3, C20 D20, 1

1, A10 B10, 1
2, C20 D20, 2
3, A10 B10 C20 D20, 1

更新@Mark

更準確地說，您的代碼的 ruby?? 版本將是：

#!/usr/bin/env ruby
groups = {}
items_in_groups = Hash.new { |h,k| h[k] = [] }

ARGF.each_line do |line|
  group_num,items = line.split(',')
  groups[ group_num ] = items.split() if items
  groups[ group_num ].each {|item| items_in_groups[ item ] << group_num}
end

groups.each do |group,items|
  item = items.find {|i| items_in_groups[ i ].count >=2}
  print "#{group}, #{items.join(' ')}, #{item.nil? ? "N/A" : items_in_groups[ item ][0]}\n"
end

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/359464.html

標籤：猛击贝壳

上一篇：如何用shell腳本替換html檔案中第一次出現的字串

下一篇：shellfind-獲取當前檔案的檔案夾名稱