根據列將大型csv檔案拆分為多個檔案-有解無憂

我想知道在任何程式（awk/perl/python）中一種快速/有效的方法來將一個 csv 檔案（比如 10k 列）拆分為多個小檔案，每個小檔案包含 2 列。我將在 unix 機器上執行此操作。

#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z

我現在想要多個這樣的檔案：

# contents of 1.csv
1,2
a,b
q,w
a,s
z,x

# contents of 2.csv
1,3
a,c
q,e
a,d
z,c

# contents of 3.csv
1,4
a,d
q,r
a,f
z,v

and so on...

我目前可以使用 awk 對小檔案（例如 30 列）執行此操作，如下所示：

awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i  ) print $1, $(i 1) > i ".csv"}' large_file.csv

上面的大檔案需要很長時間，我想知道是否有更快更有效的方法來做同樣的事情。

提前致謝。

uj5u.com熱心網友回復：

這里的主要障礙是撰寫這么多檔案。

這是一種方法

use warnings;
use strict;
use feature 'say';
    
my $file = shift // die "Usage: $0 csv-file\n";

my @lines = do { local @ARGV = $file; <> };
chomp @lines;

my @fhs = map { 
    open my $fh, '>', "f${_}.csv" or die $!; 
    $fh 
} 
1 .. scalar( split /,/, $lines[0] );

for (@lines) { 
    my ($first, @cols) = split /,/; 
    say {$fhs[$_]} join(',', $first, $cols[$_]) 
        for 0..$#cols;
}

我沒有針對任何其他方法計時。首先為每個檔案組裝資料，然后在一次操作中將其轉儲到每個檔案中可能會有所幫助，但首先讓我們知道原始 CSV 檔案有多大。

一次打開如此多的輸出檔案（對于@fhs檔案句柄）可能會帶來問題。如果是這種情況，那么最簡單的方法是首先組合所有資料，然后一次打開和寫入一個檔案

use warnings;
use strict;
use feature 'say';

my $file = shift // die "Usage: $0 csv-file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my @data;
while (<$fh>) {
    chomp;
    my ($first, @cols) = split /,/;
    push @{$data[$_]}, join(',', $first, $cols[$_]) 
        for 0..$#cols;
}

for my $i (0..$#data) {
    open my $fh, '>', $i 1 . '.csv' or die $!;
    say $fh $_ for @{$data[$i]};
}

這取決于整個原始 CSV 檔案以及更多檔案是否可以保存在記憶體中。

uj5u.com熱心網友回復：

用您的展示樣品，嘗試；請嘗試以下awk代碼。由于您正在一起打開檔案，因此可能會因臭名昭著的“打開的檔案太多錯誤”而失敗，因此為了避免將所有值都放入一個陣列中，并在END此awk代碼塊中將它們一一列印，我將盡快關閉它們，所有內容都在獲取列印到輸出檔案。

awk '
BEGIN{ FS=OFS="," }
{
  for(i=1;i<NF;i  ){
    value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i 1))
  }
}
END{
  for(i=1;i<=NF;i  ){
    outFile=i".csv"
    print value[i] > (outFile)
    close(outFile)
  }
}
' large_file.csv

uj5u.com熱心網友回復：

嘗試使用模塊 Text::CSV 的解決方案。

#! /usr/bin/env perl

use warnings;
use strict;
use utf8;
use open qw<:std :encoding(utf-8)>;
use autodie;
use feature qw<say>;
use Text::CSV;

my %hsh = ();

my $csv = Text::CSV->new({ sep_char => ',' });

print "Enter filename: ";
chomp(my $filename = <STDIN>);

open (my $ifile, '<', $filename);

while (<$ifile>) {
    chomp;
    if ($csv->parse($_)) {
    
    my @fields = $csv->fields();
    my $first = shift @fields;
    while (my ($i, $v) = each @fields) {
        push @{$hsh{($i   1).".csv"}}, "$first,$v";   
    }   
    } else {
    die "Line could not be parsed: $_\n";
    }
}

close($ifile);

while (my ($k, $v) = each %hsh) {
    open(my $ifile, '>', $k);
    say {$ifile} $_ for @$v;
    close($ifile);
}

exit(0);

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/379639.html

標籤：perl awk 数据处理

上一篇：自定義Webpack加載器無法決議自定義語言檔案

下一篇：Perl：在使用“嚴格參考”時不能使用字串（“XXX”）作為HASH參考