我正在處理一些非常大的檔案,我的簡單 Go 程式需要 2 分鐘才能運行,而不是等效 C 程式需要 15 秒(https://gist.github.com/g2boojum/5729bf75a41f537b8251af25a816c2fc) . 顯然我錯過了一些重要的東西。(這也是我的第一個 Go 程式,所以我確信代碼也很糟糕。)
我正在處理的檔案是 csv 檔案,如下所示,唯一的問題是它們的大小為 GB。
BOARD;CHANNEL;TIMETAG;ENERGY;ENERGYSHORT;FLAGS
0;0;179096000;316;105;0x0
0;0;682168000;449;146;0x0
0;0;905440000;92;35;0x0
我本可以使用 csv 模塊,但 Scanf 為我將欄位轉換為正確的型別,這似乎更簡單。代碼所做的只是跳過標題行,然后逐行讀取檔案的其余部分以創建第 4 個欄位的直方圖,然后在末尾寫入直方圖。
import (
"bufio"
"fmt"
"os"
"log"
)
const num_channels int = 4096
func main () {
if len(os.Args) != 3 {
fmt.Printf("Usage: %s infile outfile\n", os.Args[0])
os.Exit(1)
}
fin, err := os.Open(os.Args[1])
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(fin)
scanner.Scan() // skip the header line
fout, err := os.Create(os.Args[2])
if err != nil {
log.Fatal(err)
}
fmt.Fprintf(fout, "channel,total\n")
var total[num_channels] uint64
var tmax int64
var board, channel, energy, energyshort, flags int
var timetag int64
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
total[energy] = 1
tmax = timetag
}
tmax_s := float64(tmax)/1e12
fmt.Println("tmax = ", tmax_s)
for i, val := range total {
fmt.Fprintf(fout, "%v,%v\n", i, float64(val)/tmax_s)
}
}
幫助?謝謝!
[Update, and a solution, and some weirdness]
I simplified things, so I could get a better feeling for what is going on with even simpler code. I removed the header line of the csv file I was using to test everything, and I also created an even shorter version of the csv file that I could share, in case anybody wanted an test case (https://grantgoodyear.org/files/sample60.csv).
Here's a streamlined C code:
#include <stdio.h>
int main(int argc, char* argv[]) {
FILE* fp = fopen(argv[1], "r");
int board, channel, energy, energyshort, flags;
long timetag;
double tmax = 0;
while ((fscanf(fp, "%d;%d;%ld;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)) != EOF) {
tmax = timetag / 1.0e12;
}
printf("tmax: %f\n", tmax);
fclose(fp);
}
It processes the short and the 1.5GB files in 0.16s and 15s, respectively.
$ time cmaxt sample60.csv
tmax: 59.999983
real 0m0.160s
user 0m0.152s
sys 0m0.008s
$ time cmaxt long.csv
tmax: 7200.265511
real 0m14.730s
user 0m14.451s
sys 0m0.255s
In comparison, here's a nearly identical program in go:
import (
"io"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
var tmax float64
var board, channel, energy, energyshort, flags int
var timetag int64
for {
_, err := fmt.Fscanf(fin,"%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
if err == io.EOF {
break
}
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
It takes an amazingly long time to run:
$ time gomaxt sample60.csv
tmax = 59.999983344
real 0m8.044s
user 0m4.677s
sys 0m3.555s
$ time gomaxt long.csv
tmax = 7200.265510652
real 18m37.472s
user 10m58.221s
sys 8m28.282s
I have no idea what's happening here, but it's taking 50-75 times longer than the C version. It's especially weird that the sys time is so long.
I changed the go version so that it more resembles my original post, using a bufio.NewScanner and fmt.Sscanf to do the splitting:
import (
"bufio"
"fmt"
"os"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
var board, channel, energy, energyshort, flags int
for scanner.Scan() {
fmt.Sscanf(scanner.Text(), "%d;%d;%d;%d;%d;%x", &board, &channel, &timetag, &energy, &energyshort, &flags)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
This version only(?!) takes 6 times longer than the C version:
$ time gomaxtorig sample60.csv
tmax = 59.999983344
real 0m0.895s
user 0m0.905s
sys 0m0.038s
$ time gomaxtorig long.csv
tmax = 7200.265510652
real 1m53.030s
user 2m1.039s
sys 0m3.021s
So now let's replace fmt.Sscanf with string splitting:
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main () {
fin, _ := os.Open(os.Args[1])
scanner := bufio.NewScanner(fin)
var tmax float64
var timetag int64
for scanner.Scan() {
ss := strings.Split(scanner.Text(), ";")
timetag, _ = strconv.ParseInt(ss[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
As suggested, the majority of time actually was being spent in fmt.Sscanf. This version takes slightly twice as long as the C version:
$ time ../gammaprocess_go/maxtsplit sample60.csv
tmax = 59.999983344
real 0m0.226s
user 0m0.243s
sys 0m0.022s
$ time ../gammaprocess_go/maxtsplit long.csv
tmax = 7200.265510652
real 0m26.434s
user 0m28.834s
sys 0m1.683s
I did write a slightly hacky version that forced the string conversion of the other fields in each row of the csv file, just to see if that made any difference, and the times were essentially identical to the above version.
Because I'm using Scanner.Text(), there is a lot of strings created and then thrown out, and it was suggested that I use bytes instead of strings. It looks to me like that's what the csv package does, so I just used it:
import (
"encoding/csv"
"fmt"
"os"
"io"
"strconv"
)
func main () {
fin, _ := os.Open(os.Args[1])
r := csv.NewReader(fin)
r.Comma = ';'
var tmax float64
var timetag int64
for {
rec, err := r.Read()
if err == io.EOF {
break
}
timetag, _ = strconv.ParseInt(rec[2], 10, 64)
tmax = float64(timetag)/1e12
}
fmt.Println("tmax = ", tmax)
}
The timing is just slightly longer than the string-splitting version with Scanner.Text*():
$ time gomaxtcsv sample60.csv
tmax = 59.999983344
real 0m0.281s
user 0m0.300s
sys 0m0.019s
$ time gomaxtcsv long.csv
tmax = 7200.265510652
real 0m32.736s
user 0m35.619s
sys 0m1.702s
but presumably there's more overhead since the csv package is doing a lot more than my simple example, so I'd say that's inconclusive.
Regardless, I'm fine with a factor of 2 inefficiency, so I'm not going to try to keep optimizing this. Many thanks to the folks who answered this question.
[Another update]
Or just look at https://groups.google.com/g/golang-nuts/c/W08rFBcHKbc/m/oIrdQcBxKa4J, from 2014.
uj5u.com熱心網友回復:
sscanf 占用大部分時間。做:
ss := strings.Split(scanner.Text(), ";")
board, _ = strconv.Atoi(ss[0])
channel, _ = strconv.Atoi(ss[1])
timetag, _ = strconv.Atoi(ss[2])
energy, _ = strconv.Atoi(ss[3])
flags, _ = strconv.ParseUint(ss[4], 16, 64)
省略檢查錯誤。
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/423086.html
標籤:
