逐行讀取檔案的最快方法是什么？-有解無憂

我用 Python 撰寫了一個代碼來逐行讀取檔案并執行一些平均和求和操作。

我需要加快速度的建議。

中的行數目前pressurefile為 945,670（它會更高）。

原始代碼 這是我發布的原始版本。根據您的建議，我正在優化代碼并最終發布了最新版本。

    def time_average():
    try:
        filename = mem.pressurefile
        navg = mem.NFRAMES
        dz = mem.dz
        zlo = mem.zlo
        NZ = mem.NZ
        mass = mem.mass

        dens_fact = amu_to_kg / (mem.slab_V * ang3_to_m3)
        
        array_pxx = np.zeros([NZ,1])
        array_pyy = np.zeros([NZ,1])
        array_pzz = np.zeros([NZ,1])
        array_ndens = np.zeros([NZ,1])
        
        array_density = np.zeros([NZ,1])
        array_enthalpy = np.zeros([NZ,1])
        array_surf_tens = np.zeros([NZ,1])
        
        counter = 0
        with open(filename) as f:
            for line in f:
                line.strip("\n")
                #content = [_ for _ in line.split()]
                content = line.split()
                if len(content) == 7:
                    z = float(content[3]) - zlo
                    pxx = float(content[4])
                    pyy = float(content[5])
                    pzz = float(content[6])
                    
                    loc = math.floor(z/dz)
                    if loc >= NZ:
                        loc = loc - NZ
                    elif loc < 0:
                        loc = loc   NZ   
                    #print(z, loc, zlo)
                    
                    array_pxx[loc]  = pxx
                    array_pyy[loc]  = pyy
                    array_pzz[loc]  = pzz
                    array_ndens[loc]  = 1
                counter  = 1
        for col in range(NZ):
            array_pxx[col] /= navg
            array_pyy[col] /= navg
            array_pzz[col] /= navg
            array_ndens[col] /= navg
            array_density[col] = mass * dens_fact * array_ndens[col]
            
        return (array_density, array_enthalpy, array_surf_tens)
    except IndexError as err:
        writelog (err)
        writelog(float(content[3]) , loc, zlo)

到目前為止，我已經嘗試了以下選項：

分析：

使用 cprofile 分析主要代碼并確定上述輔助函式對 74.4MB 檔案消耗約 10 秒。對我來說，這 10 秒很高。

選項 1：cython3

使用 cython 編譯如下。

    cython3 --embed -o ptythinfile.c ptythinfile.py

    gcc -Os -I /usr/include/python3.8 -o ptythinfile ptythinfile.c -lpython3.8 -lpthread -lm -lutil -ldl

這沒有產生任何性能改進。

選項 2：C/C

將整個代碼轉換為 C/C 并編譯它。

In fact, my first code was in C and debugging was a nightmare and switched to python. So, I don't want to follow this route.

Option 3: Pypy3

I tried with pypy3 and ran into compatibility issues. I have python3.8 and 3.9, but the pypy3 was looking for 3.6 and then I gave up.

Option 4: External C library

I read the tutorial on compiling the helper function as a c code and calling into the python. This would be my next attempt.

Searching into the google I found many options like shedskin etc. Could you point out the best way to optimize the above code snippet and possible alternative solutions to speed it up?

UPDATE 1 : OCT 21 - 2021 The code is updated based on the comments from experts below. Tested and working well. However, average code exec time reduced from ~10 s to ~9.4s

The content of the pressurefile is an output from LAMMPS software and first few lines of it looks like:

    ITEM: TIMESTEP
    50100
    ITEM: NUMBER OF ATOMS
    2744
    ITEM: BOX BOUNDS pp pp pp
    -2.5000000000000000e 01 2.5000000000000000e 01
    -2.5000000000000000e 01 2.5000000000000000e 01
    -7.5000000000000000e 01 7.5000000000000000e 01
    ITEM: ATOMS id x y z c_1[1] c_1[2] c_1[3]
    2354 18.8358 -21.02 -70.5731 -21041.8 -3738.18 -2520.84
    1708 5.54312 -8.1526 -62.6984 4362.84 -30610.2 -4065.84

The last two lines are what we need for processing.

LATEST CODE

    def time_average():
    try:
        filename = mem.pressurefile
        navg = mem.NFRAMES
        dz = mem.dz
        zlo = mem.zlo
        NZ = mem.NZ
        mass = mem.mass

        dens_fact = amu_to_kg / (mem.slab_V * ang3_to_m3)
        
        array_pxx = np.zeros([NZ,1])
        array_pyy = np.zeros([NZ,1])
        array_pzz = np.zeros([NZ,1])
        array_ndens = np.zeros([NZ,1])
        
        #array_density = np.zeros([NZ,1])
        array_enthalpy = np.zeros([NZ,1])
        array_surf_tens = np.zeros([NZ,1])
        
        counter = 0
        locList = []
        pxxList = []
        pyyList = []
        pzzList = []
        with open(filename) as f:
            for line in f:
                #line.strip("\n")
                #content = [_ for _ in line.split()]
                content = line.split()
                if len(content) == 7:
                    z = float(content[3]) - zlo
                    pxx = float(content[4])
                    pyy = float(content[5])
                    pzz = float(content[6])
                    
                    #loc = math.floor(z/dz)
                    loc = int(z // dz)
                    
                    if loc >= NZ:
                        loc = loc - NZ
                    elif loc < 0:
                        loc = loc   NZ   
                    #print(z, loc, zlo)
                    
                    # Not great but much faster than using Numpy functions
                    locList.append(loc)
                    pxxList.append(pxx)
                    pyyList.append(pyy)
                    pzzList.append(pzz)
                counter  = 1

        # Very fast list-to-Numpy-array conversion
        locList = np.array(locList, dtype=np.int32)
        pxxList = np.array(pxxList, dtype=np.float64)
        pyyList = np.array(pyyList, dtype=np.float64)
        pzzList = np.array(pzzList, dtype=np.float64)

        # Fast accumulate
        np.add.at(array_pxx[:,0], locList, pxxList)
        np.add.at(array_pyy[:,0], locList, pyyList)
        np.add.at(array_pzz[:,0], locList, pzzList)
        np.add.at(array_ndens[:,0], locList, 1)

        array_pxx /= navg
        array_pyy /= navg
        array_pzz /= navg
        array_ndens /= navg
        array_density = mass * dens_fact * array_ndens

        return (array_density, array_enthalpy, array_surf_tens)
    except IndexError as err:
        writelog (err)
        print(loc)
        writelog(float(content[3]) , loc, zlo)

Testing computer specs:
Intel? Xeon(R) W-2255 CPU @ 3.70GHz × 20
RAM: 16 GB
NVIDIA Corporation GP107GL [Quadro P620]
64bit Ubuntu 20.04.3 LTS

Current average code exec time is ~2.6s (3x faster than original) credit to user @JeromeRichard

uj5u.com熱心網友回復：

首先，Python 顯然不是有效進行此類計算的最佳工具。代碼是順序的，大部分時間都花在了 CPython 解釋器操作或 Numpy 內部函式上。

選項 1：cython3
這沒有產生任何性能改進。

這部分是因為未啟用優化。您需要使用標志-O2甚至-O3. 盡管如此，Cython 可能不會有太大幫助，因為大部分時間都花在此特定代碼中的 CPython-to-Numpy 呼叫上。

選項 2：C/C 將整個代碼轉換為 C/C 并編譯它。事實上，我的第一個代碼是用 C 撰寫的，除錯是一場噩夢，于是切換到了 python。所以，我不想走這條路。

您不需要移植所有代碼。您可以只重寫像這樣的性能關鍵函式并將它們放在專用的 CPython 模塊中（即撰寫 C/C 擴展）。但是，此解決方案需要處理低級 CPython 內部結構。Cython 可能有助于解決這個問題：AFAIK，您可以使用 Cython 從 Cython 函式呼叫 C 函式，Cython 幫助輕松執行 CPython 和 C 函式之間的介面。簡單的函式介面應該有助于使代碼更易于閱讀和維護。盡管如此，我同意這不是很好，但是 C 代碼可以比 CPython 快至少一個數量級的計算......

搜索谷歌我發現了很多選項，比如 shedskin 等。

ShedSkin 不再積極開發。我懷疑這樣的專案對你有幫助，因為代碼非常復雜并且使用 Numpy。

在這種情況下，Numba理論上可以提供很大幫助。然而，字串還沒有得到很好的支持（即決議）。

您能否指出優化上述代碼片段的最佳方法以及加速它的可能替代解決方案？

像這樣array_pxx[loc] = pxx的行非常慢，因為解釋器需要在內部呼叫 C Numpy 函式來執行許多不需要的操作：系結/型別檢查、型別轉換、分配/釋放、參考計數等。這樣的操作非常慢（> 1000 次比在 C 中慢）。避免這種情況的一種解決方案是簡單地在純 Python 回圈中使用純 Python 串列（至少當代碼無法有效矢量化時）。您可以有效地將串列轉換為 Numpy 陣列并使用np.add.at. 這是一個改進的實作：

def time_average():
    try:
        filename = mem.pressurefile
        navg = mem.NFRAMES
        dz = mem.dz
        zlo = mem.zlo
        NZ = mem.NZ
        mass = mem.mass

        dens_fact = amu_to_kg / (mem.slab_V * ang3_to_m3)
        
        array_pxx = np.zeros([NZ,1])
        array_pyy = np.zeros([NZ,1])
        array_pzz = np.zeros([NZ,1])
        array_ndens = np.zeros([NZ,1])
        
        #array_density = np.zeros([NZ,1])
        array_enthalpy = np.zeros([NZ,1])
        array_surf_tens = np.zeros([NZ,1])
        
        counter = 0
        locList = []
        pxxList = []
        pyyList = []
        pzzList = []
        with open(filename) as f:
            for line in f:
                #line.strip("\n")
                #content = [_ for _ in line.split()]
                content = line.split()
                if len(content) == 7:
                    z = float(content[3]) - zlo
                    pxx = float(content[4])
                    pyy = float(content[5])
                    pzz = float(content[6])
                    
                    #loc = math.floor(z/dz)
                    loc = int(z // dz)
                    
                    if loc >= NZ:
                        loc = loc - NZ
                    elif loc < 0:
                        loc = loc   NZ   
                    #print(z, loc, zlo)
                    
                    # Not great but much faster than using Numpy functions
                    locList.append(loc)
                    pxxList.append(pxx)
                    pyyList.append(pyy)
                    pzzList.append(pzz)
                counter  = 1

        # Very fast list-to-Numpy-array conversion
        locList = np.array(locList, dtype=np.int32)
        pxxList = np.array(pxxList, dtype=np.float64)
        pyyList = np.array(pyyList, dtype=np.float64)
        pzzList = np.array(pzzList, dtype=np.float64)

        # Fast accumulate
        np.add.at(array_pxx[:,0], locList, pxxList)
        np.add.at(array_pyy[:,0], locList, pyyList)
        np.add.at(array_pzz[:,0], locList, pzzList)
        np.add.at(array_ndens[:,0], locList, 1)

        array_pxx /= navg
        array_pyy /= navg
        array_pzz /= navg
        array_ndens /= navg
        array_density = mass * dens_fact * array_ndens

        return (array_density, array_enthalpy, array_surf_tens)
    except IndexError as err:
        writelog (err)
        print(loc)
        writelog(float(content[3]) , loc, zlo)

這段代碼在我的機器上總體上快了大約3 倍。但是請注意，它應該占用更多記憶體（由于串列）。

剩下的大部分時間都花在了字串轉換 (25%)、字串拆分 (20-25%)、串列附加 (17%) 和 CPython 解釋器本身（如匯入模塊）上（20%）。I/O 操作只占用總時間的一小部分（在 SSD 上或檔案被作業系統快取時）。只要使用純 Python 代碼（使用 CPython），優化這一點就具有挑戰性。

uj5u.com熱心網友回復：

讀取檔案的第一步可以很容易地完成genfromtxt。這確實逐行讀取檔案，拆分它（如您所做的那樣），將結果收集在串列串列中，然后使陣列結束。 pandas.read_csv速度更快，至少在使用c模式時，對于大檔案可能值得一試。

制作一個結構化陣列，保留第一列的整數性質。對“列”的訪問是通過欄位名稱（如 dtype 中指定的）：

In [30]: data = np.genfromtxt('stack69665939.py',skip_header=9, dtype=None)
In [31]: data
Out[31]: 
array([(2354, 18.8358 , -21.02  , -70.5731, -21041.8 ,  -3738.18, -2520.84),
       (1708,  5.54312,  -8.1526, -62.6984,   4362.84, -30610.2 , -4065.84)],
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8')])

或者將所有值加載為浮點數，制作一個 (N,7) 二維陣列：

In [32]: data = np.genfromtxt('stack69665939.py',skip_header=9)
In [33]: data
Out[33]: 
array([[ 2.35400e 03,  1.88358e 01, -2.10200e 01, -7.05731e 01,
        -2.10418e 04, -3.73818e 03, -2.52084e 03],
       [ 1.70800e 03,  5.54312e 00, -8.15260e 00, -6.26984e 01,
         4.36284e 03, -3.06102e 04, -4.06584e 03]])

指定usecols為 just[3,4,5,6]可能會節省一些時間。您似乎只對這些資料感興趣：

In [35]: z = data[:,3]
In [36]: pxyz = data[:,[4,5,6]]
In [37]: z
Out[37]: array([-70.5731, -62.6984])
In [38]: pxyz
Out[38]: 
array([[-21041.8 ,  -3738.18,  -2520.84],
       [  4362.84, -30610.2 ,  -4065.84]])

看來你做了一些事情z來派生 a loc，并使用它來組合 `pxyz' 陣列的“行”。我不會嘗試重新創建它。

無論如何，通常在處理大csv檔案時，我們一步讀取，然后再處理生成的陣列或資料幀。在閱讀時進行處理是可能的，但通常不值得付出努力。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/333541.html

標籤：python numpy performance

上一篇：numpyargsort性能下降

下一篇：為什么我的IntelSkylake/KabyLakeCPU在簡單的哈希表實作中會導致神秘的因子3減速？