提高用資料填充熊貓資料框的速度-有解無憂

我正在嘗試找到一種更優化的方法來將資料添加到pandas 資料框。我已經看到其他相關問題，人們建議首先創建串列，然后將資料添加到 Pandas（我現在已實作）。

在我當前的設定中，我遍歷不同的串列（在示例中是librarynr,books和sections），然后計算各種變數（在示例中，這些變數沒有計算但已經設定；nrofletters, excitmentand review）我添加到串列中，最后添加串列到資料框。

有誰知道進一步優化以提高此示例代碼的性能？

重要說明：在我的最終代碼中，所有行的變數都不相同，而是根據回圈的迭代器計算的（參見的示例計算excitment）。

示例代碼：

import pandas as pd
import time


books = ['LordOfTheRings','HarryPotter','LoveStory','RandomBook']
sections = ['Introduction','MainPart','Plottwist','SurprisingEnd']

librarynr = list(range(30000))
nrofletters = 3000
excitment = True
review = 'positive'


start_time = time.time()
summarydf = pd.DataFrame()

indexlist = []
nrofletterlist = []
excitmentlist = []
reviewlist = []

for library in librarynr:
    for book in books:
        for section in sections:
            indexlist.append(str(library) book section)
            nrofletterlist.append(nrofletters)
            
            #example of variable calculation depending on iterators of loop:
            if (library % 2 == 0) or (book[1] == 'L'):
                excitment = False
            else:
                excitment = True
                
            excitmentlist.append(excitment)
            reviewlist.append(review)
            
summarydf['index'] = indexlist
summarydf['nrofletters'] = nrofletterlist
summarydf['excitment'] = excitmentlist
summarydf['review'] = reviewlist
listtime = time.time() - start_time
print(listtime)

uj5u.com熱心網友回復：

Append非常慢，您應該一次性生成您的 DataFrame。

IIUC，您想要一個具有所有可能性的產品。您可以使用itertools.product

我在這里只舉一個例子librarynr = 5。您的條件 withlibrarynr = 300000將產生 480 萬行。

from itertools import product

librarynr = 5

idx = map(''.join, product(map(str, range(librarynr)), books, sections))

df = pd.DataFrame([], index=idx)

df[['nrofletters', 'excitment', 'review']] = [3000, True, 'positive']

輸出：

>>> print(df.reset_index())

                           index  nrofletters  excitment    review
0    0LordOfTheRingsIntroduction         3000       True  positive
1        0LordOfTheRingsMainPart         3000       True  positive
2       0LordOfTheRingsPlottwist         3000       True  positive
3   0LordOfTheRingsSurprisingEnd         3000       True  positive
4       0HarryPotterIntroduction         3000       True  positive
..                           ...          ...        ...       ...
75       4LoveStorySurprisingEnd         3000       True  positive
76       4RandomBookIntroduction         3000       True  positive
77           4RandomBookMainPart         3000       True  positive
78          4RandomBookPlottwist         3000       True  positive
79      4RandomBookSurprisingEnd         3000       True  positive

[80 rows x 4 columns]

uj5u.com熱心網友回復：

由于解釋器，CPython 回圈非常慢（請注意，有更快的 Python 解釋器，如使用JIT 編譯器的PyPy ）。您可以使用理解串列來顯著加速回圈。使用itertools可以幫助更多，以及library在最外層回圈中轉換為字串。但是，對于執行的操作，結果仍然不是很快。

另一個問題來自從Python 串列到 Numpy 陣列的轉換。實際上，Pandas 在內部使用 Numpy，而 Numpy 將每個字串參考轉換為靜態有界字串（因此使用大的原始記憶體緩沖區而不是參考計數物件的陣列）。這意味著每個字串都由 Numpy決議和復制，這是非常昂貴的。如果可能，最好的解決方案是使用矢量化函式直接撰寫 Numpy 陣列。如果這是不可能的，您可以使用Numba。但是，請注意，到目前為止，Numba 幾乎不支持字串陣列。另一種可能的解決方案是您使用Cython。使用 Pandas 的直接分配也可以非常快地一次設定所有字串 （因為字串僅在內部被 Numpy 決議一次）。

在我的機器上，大約 2/3 的時間花在回圈中，1/3 的時間花在 Numpy 字串轉換中（一小部分來自一些額外的 Pandas 開銷）。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/394044.html

標籤：Python 熊猫列表表现循环

上一篇：只有特殊字符從串列中洗掉

下一篇：list.count()方法為iterable中的第一個值回傳count 1