比較csv檔案中的兩列-有解無憂

有two csv files，在first檔案中的third column具有一定數目的具有行的data，并在second file所述first column具有相似的資料，同樣在一些不定數量，這些都在的形式呈現md5，例如：

檔案_1

列_1	列_2	第 3 列
等等等等等等	等等等等等等	aa7744226c695c0b2e440419848cf700
等等等等等等	等等等等等等	9b34939b137e24f8c6603a54b2305f07
等等等等等等	等等等等等等	ad1172b28f277eab7ca91f96f13a242b
`etc`

檔案_2

列_1	列_2	第 3 列
49269f413284abfa58f41687b6f631e0	等等等等等等	等等等等等等
a0879ff97178e03eb18470277fbc7056	等等等等等等	等等等等等等
9e5b91c360d6be29d556db7e1241ce82	等等等等等等	等等等等等等
`etc`

你能告訴我，我如何比較兩個檔案中的這兩列，即找到duplicate值，如果值重復，則顯示第一個和第二個中的值csv file。

我試圖從這個例子中得到一些東西：

import csv    
interesting_cols = [0, 2, 3, 4, 5]    
with open("/root/file1.csv", 'r') as file1,\
     open("/root/file2.csv", 'r') as file2:    
    reader1, reader2 = csv.reader(file1), csv.reader(file2)    
    for line1, line2 in zip(reader1, reader2):
        equal = all(x == y for n, (x, y) in enumerate(zip(line1, line2)) if n in interesting_cols)
        print(equal)

如果兩個檔案每個只有一列，則此示例將運行良好。根據我的要求，我無法以任何方式實作它，我的Python非常薄弱。非常感謝！

uj5u.com熱心網友回復：

如果你被允許，你可以使用 Pandas 來做到這一點。首先使用pip安裝包： python -m pip install pandas

或康達： conda install pandas

然后閱讀并與熊貓進行比較：

import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]    
file1 = pd.read_csv("/root/file1.csv")
file2 = pd.read_csv("/root/file2.csv")
comp = file1.compare(file2)
print(comp.to_markdown())

或者，如果您希望保留 'with' 陳述句，您應該創建一個類并定義__enter__和__exit__方法：

import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]    
class DataCSV:
    def __init__(self, file) -> None:
        self.filename = file
    def __enter__(self):
        self.file = pd.read_csv(self.filename)
        return self.file
    def __exit__(self, exc_type, exc_value, traceback):
        pass
with DataCSV("/root/file1.csv") as file1, DataCSV("/root/file2.csv") as file2:
    comp = file1.compare(file2)
    print(comp.to_markdown())

輸出應該是這樣的：

	('column_1', '自我')	('column_1', '其他')	('column_3', '自我')	('column_3', '其他')
0	等等等等等等	49269f413284abfa58f41687b6f631e0	aa7744226c695c0b2e440419848cf700	等等等等等等
1	等等等等等等	a0879ff97178e03eb18470277fbc7056	9b34939b137e24f8c6603a54b2305f07	等等等等等等
2	等等等等等等	9e5b91c360d6be29d556db7e1241ce82	ad1172b28f277eab7ca91f96f13a242b	等等等等等等

uj5u.com熱心網友回復：

您可以重新排序串列并使用生成器快速檢查它。

import csv

def parse_csv(filename, header=False, delim=',', quotechar='"'):
    with open(filename, 'r') as f:
        csvfile = csv.reader(f, delimiter=delim, quotechar=quotechar)
        if header:
            csvfile.__next__()
        for row in csvfile:
            yield row

def diff(l1, l2, reorder=None):
    if reorder:
        for i,line in enumerate(l2):
            l2[i] = [line[x] for x in line]
    for i, line in enumerate(l1):
        if line not in l2:
            yield i,  line

filename1 = ''
filename2 = ''
reorder = [2,1,0]

missing = [(i, line) for i,line in diff(parse_csv(filename1, header=False), list(parse_csv(filename2, header=False)), reorder=reorder) or (None, None)]
print(missing)

uj5u.com熱心網友回復：

在這里，它說從 csv 檔案讀取的每一行都作為字串串列回傳。您可以從這些行中讀取單個列。

例如：

使用兩個簡單的csv檔案
addresses.csv

Doe,John,120 jefferson st.,Riverside, NJ, 08075
McGinnis,Jack,220 hobo Av.,Phila, PA,09119
Repici,"John ""Da Man""",120 Jefferson St.,Riverside, NJ,08075
Tyler,Stephen,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234

和

電話.csv

John,Doe,19122
Jack,McGinnis,20220
"John ""Da Man""",Repici,1202134
Stephen,Tyler,72384


>>> with open('addresses.csv') as file1, open('phones.csv') as file2:
...     r1, r2 = csv.reader(file1), csv.reader(file2)
...     for line1, line2 in zip(r1, r2):
...             if line1[1] == line2[0]:
...                     print('found a duplicate', line1[1])
...
found a duplicate John
found a duplicate Jack
found a duplicate John "Da Man"
found a duplicate Stephen

我們得到在指定列中具有相同值的行。在我們的例子中，這些是第一個 csv 檔案的第二列和第二個 csv 檔案的第一列。為了獲取行號，您可以enumerate(zip())像您提供的示例代碼一樣使用。

您可以檢查Python 串列推導式以了解示例中使用的語法。

uj5u.com熱心網友回復：

我的答案將適用于 files 中的所有記錄。它將在 file1 和 file2 中的所有記錄中找到匹配項。

反向reader1 = [i[::-1] for i in reader1]排序串列。
列出這兩個 reader = reader1 reader2
制作一個字典，它將按數字查找所有匹配項。
只是列印我們搜索的結果

import csv

interesting_cols = [0, 2, 3, 4, 5]
with open("file1.csv", 'r') as file1,\
     open("file2.csv", 'r') as file2:
    reader1, reader2 = csv.reader(file1), csv.reader(file2)

    reader1 = [i[::-1] for i in reader1]
    reader2 = [i for i in reader2]
    reader = reader1   reader2

    dictionary_of_records = dict()

    for i, item in enumerate(reader):
        key = item[0]
        if key in dictionary_of_records:
            dictionary_of_records[key].append(i)
        else:
            dictionary_of_records[key] = list()
            dictionary_of_records[key].append(i)

    for key, value in dictionary_of_records.items():
        if len(value) > 1:
            print(f"Match for {key}")
            for index in value:
                print(' '.join(reader[index]))
        else:
            print(f"No match for {key}")
        print("-----------------------------")

PS 這是相當硬編碼，我認為。您還可以觀看 pandas 庫或 itertools 以找到更漂亮的方法。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/353548.html

標籤：Python 蟒蛇-3.x

上一篇：檢查整數是否只包含奇數

下一篇：如果使用鎖，為什么aiohttp請求會卡住？