我有一個大約 7.000.000 行和很多列的資料框。
每行都是一條推文,我有一個包含推文內容的列文本。
我為文本中的主題標簽創建了一個新列:
df['hashtags'] = df.Tweets.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
所以我有一個名為hashtags的列,每一行都包含一個串列結構:['#b747', '#test']。
我想計算每個主題標簽的數量,但我有很多行。最高效的方法是什么?
uj5u.com熱心網友回復:
以下是一些不同的方法,以及時間,按速度排序(最快的優先):
# setup
n = 10_000
df = pd.DataFrame({
'hashtags': np.random.randint(0, int(np.sqrt(n)), (n, 10)).astype(str).tolist(),
})
# 1. using itertools.chain to build an iterator on the elements of the lists
from itertools import chain
%timeit Counter(chain(*df.hashtags))
# 7.35 ms ± 58.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2. as per @Psidom comment
%timeit df.hashtags.explode().value_counts()
# 8.06 ms ± 19.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3. using Counter constructor, but specifying an iterator, not a list
%timeit Counter(h for hl in df.hashtags for h in hl)
# 10.6 ms ± 13.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 4. iterating explicitly and using Counter().update()
def count5(s):
c = Counter()
for hl in s:
c.update(hl)
return c
%timeit count5(df.hashtags)
# 12.4 ms ± 66.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 5. using itertools.reduce on Counter().update()
%timeit reduce(lambda x,y: x.update(y) or x, df.hashtags, Counter())
# 13.7 ms ± 10.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6. as per @EzerK
%timeit Counter(sum(df['hashtags'].values, []))
# 2.58 s ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
結論:最快的是#1(使用Counter(chain(*df.hashtags))),但更直觀和自然的#2(來自@Psidom 評論)幾乎一樣快。我可能會同意。#6(@EzerK 方法)對于大的慢速來說非常慢,df因為我們在將它作為引數傳遞給Counter().
uj5u.com熱心網友回復:
您可以將所有串列放到一個大串列中,然后使用 collections.Counter:
import pandas as pd
from collections import Counter
df = pd.DataFrame()
df['hashtags'] = [['#b747', '#test'], ['#b747', '#test']]
Counter(sum(df['hashtags'].values, []))
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/435577.html
