在列中查找具有某些特征的大多數實體的行集-有解無憂

我有以下形式的表格：

單詞	數數	特征
我的	0	代詞
最喜歡的	0	喜好
食物	0	目的
是	0	存在
冰	0	點心
奶油	1	點心

每個表有數千行長。我的目標是在表中找到第 3 列中一組特征計數最高的前 3 組 100 行。例如，我希望能夠說，“我想知道哪 3 組 100 行在第 3 列中“甜點”和“物品”的數量都非常多。” 行不在預設塊中：它可以是行 0-99 或 54-154。輸出應該是一組行索引（例如，4-104）。

我完全不知道如何做到這一點，而沒有在每個可能的 100 行集上創建一些大規模回圈并計算其中的值，這似乎效率低下。我懷疑有某種內置功能可以做到這一點，但我不知道是什么。有什么想法嗎？

uj5u.com熱心網友回復：

你試過groupby() 函式嗎？在這種情況下：

In [1]: df.groupby(["feature", "word"]).size()
Out[2]: word     feature
        dessert   ice       1
                  cream     1
        food      object    1 
        dtype: int64

uj5u.com熱心網友回復：

首先，使用pandas圖書館。它包含您最終會使用的許多函式的矢量化實作，因此它比回圈數百行要快得多。

首先，將 csv 檔案讀入 Pandas 資料幀：

df = pd.read_csv('csv_file.csv')

對于您給定的示例，這會產生一個如下所示的資料框：

       word  count      feature
0        my      0      pronoun
1  favorite      0  preferences
2      food      0       object
3        is      0        being
4       ice      0      dessert
5     cream      1      dessert

現在，定義一個函式，它接受一行，并計算隨后 100 行中關鍵字的出現次數：

def count_in_next_100(row, keyword):
    row_index = row.name # Since the index is numeric, row.name will be the row number
    # Take the feature column for the next 100 rows
    # Check which of these are == keyword, which will give a bunch of True/False
    # Then take their .sum(), so you get the number that are True
    total = (df.loc[row_index:row_index 100, "feature"] == keyword).sum()
    return total # Return this value.

接下來，將此函式應用于每一行的資料框，即 axis=1

count_dessert = df.apply(count_in_next_100, axis=1, args=("dessert",))

然后，count_dessert.idxmax()將為您提供dessert在隨后的 100 行中出現次數最多的行號。我將把“查找前 3 名”部分留給您作為練習，但如果您需要幫助，請告訴我。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/326231.html

標籤：Python 文件

上一篇：使用Python（最好是Pandas？）在兩個csv檔案之間匹配資料

下一篇：如何通過python代碼將csv中的ID號從A列到B列？