洗掉檔案中的重復條目-優化性能-有解無憂

我每天需要掃描數十萬個檔案以從中洗掉重復條目。這些檔案中的每一個依次有幾千條記錄

示例輸入檔案

2019-10-04,3.9,3.29,5.85,6.15
2019-10-05,3.8,7.02,5.69,6.83
2019-10-05,3.8,8.02,8.69,1.83
2019-10-07,1.8,1.02,4.69,7.83

這是我為它撰寫的腳本，大約需要一個小時或更長時間才能完成。

腳本

#!/bin/bash

LOOKUP_DIR="/path/to/source_files"
CLEANEDUP_DIR="/path/to/cleaned_content"


remove_dup(){
    fname=${1}
    awk -F"," 'prev && ($1 != prev) {print seen[prev]} {seen[$1] = $0; prev = $1} END {print seen[$1]}' "${fname}" > "${CLEANEDUP_DIR}/${fname}"
}

cd ${LOOKUP_DIR}
for k in *.csv
do 
    remove_dup "${k}" &
done

wait

檢查重復項的方法是查看第一個欄位，如果此欄位有多個條目（在本例中為日期），則只需要保留帶有此日期的最后一行，其余的則洗掉。

有沒有辦法優化我寫的邏輯？

uj5u.com熱心網友回復：

如果我理解你的問題并且你想從每個檔案中洗掉任何重復的記錄，然后使用一對陣列awk，第一個使用計數器作為索引，因此記錄順序被保持，存盤由 5 個欄位連接SUBSEP的存盤值。由 5 個欄位索引的第二個陣列SUBSEP將記錄作為存盤值保存。這允許在使用index in array測驗之前簡單檢查 5 個欄位是否已被看到。

無需在中撰寫腳本remove_dup()，只需撰寫一個awk從呼叫的可執行腳本即可remove_dup()。腳本可以是：

#!/usr/bin/awk -f

BEGIN { FS="," }

{ if ($1 SUBSEP $2 SUBSEP $3 SUBSEP $4 SUBSEP $5 in array)
    next
  order[  n] = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4 SUBSEP $5
  array[$1,$2,$3,$4,$5] = $0
}

END {
  for (i=1; i<=n; i  )
    print array[order[i]]
}

（僅當連接的欄位不作為索引存在時才存盤上面的記錄，以array確保洗掉所有重復項，保持第一次出現的順序并丟棄所有其他項）

然后您可以將腳本修改為：

#!/bin/bash

LOOKUP_DIR="/path/to/source_files"
CLEANEDUP_DIR="/path/to/cleaned_content"
AWKSCRIPT="/path/to/executable/awkscript"

remove_dup(){
    fname=${1}
    $AWKSCRIPT "${fname}" > "${CLEANEDUP_DIR}/${fname}"
}

cd ${LOOKUP_DIR}
for k in *.csv
do 
    remove_dup "${k}" &
done

wait

（注意添加了awkscript存盤在變數中的可執行檔案的路徑AWKSCRIPT）

那應該做你所追求的。

uj5u.com熱心網友回復：

嘗試：

tac thefile | sort -urst, -k1,1

優化性能

用一種編程語言重寫它。不要使用行程 - 為每個檔案使用執行緒。對于腳本，請使用 Python 或 Ruby。對于編譯，請使用 C 或 C。這花了不到一個小時的時間來撰寫，而且很可能比fork()為每個檔案創建一個新行程要快得多：

#include <map>
#include <future>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <filesystem>
#include <ostream>
#include <vector>

std::string algo(const std::filesystem::path& file) {
    std::map<std::string, std::string> lines;
    std::ifstream ffile(file);
    std::string line;
    std::string field;
    size_t pos;
    while (std::getline(ffile, line)) {
        pos = 0;
        for (auto&& c : line) {
            if (c == ',') {
                break;
            }
            pos  ;
        }
        field = line.substr(0, pos);
        lines.insert_or_assign(std::move(field), std::move(line));
    }
    std::ostringstream of;
    for (auto&& i : lines) {
        of << i.second << '\n';
    }
    return of.str();
}

int main(int argc, char *argv[]) {
    const std::filesystem::path p(argv[1]);
    std::vector<
        std::pair<
            std::string, std::future<std::string>
            >
        > results;
    if (std::filesystem::is_regular_file(p)) {
        std::cout << algo(p) << '\n';
    } else {
        for (auto&& f : std::filesystem::directory_iterator(p)) {
            if (f.is_regular_file()) {
                results.emplace_back(f.path(), std::async(algo, f.path()));
            }
        }
    }
    for (auto&& r : results) {
        std::cout
            << "=== " << r.first << " ===\n\n"
            << r.second.get() << '\n';
    }
}

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/388580.html

標籤：猛击壳

上一篇：我寫了一個腳本來創建一個目錄。它在沒有sudo的情況下作業，但沒有它。該腳本需要使用sudo呼叫

下一篇：如何以2為限制反轉完整檔案？