PandasDataframe-如何計算第一行和最后一行的差異并將其匯總到重復組中？-有解無憂

我有一系列資料處理如下：

我有兩個串列，其中包含我需要的資料。
我將串列附加到一個新串列中。[表串列]
將串列轉換為資料框并將其匯出為 csv 檔案。[tableDf]

下面是 tableList 的簡化內容：

Category      CategoryName     time(s)      Power       Vapor
   1               A          1625448301   593233.36    3353.92
   1               A          1625449552   595156.24    3286.8
   1               A          1625450802   593833.36    3855.42
   2               B          1625452051   595233.37    3353.95
   2               B          1625453301   593535.86    3252.92
   2               B          1625454552   593473.36    3364.15
   3               C          1625455802   593754.32    3233.92
   3               C          1625457052   593153.46    3563.52
   3               C          1625458301   593854.56    3334.94
   4               D          1625459552   593345.75    3353.36
   4               D          1625460802   592313.24    3674.95
   4               D          1625460802   592313.24    3673.35
   1               A          1625463301   597313.23    3658.46
   1               A          1625464552   595913.68    3789.45
   ....

里面的資料是按類別劃分的，類別出現的模式并不總是相似的。
注意：時間列資料為 unix 格式的日期時間。

這是我想要實作的計劃結果：

Category      CategoryName    TotalTime(s)           Power           Vapor
       1           A          (Total time diff 1) (Power SUM 2)    (Vapor SUM 1)
       2           B          (Total time diff 2) (Power SUM 2)    (Vapor SUM 2)
       3           C          (Total time diff 3) (Power SUM 2)    (Vapor SUM 3)
       4           D          (Total time diff 4) (Power SUM 2)    (Vapor SUM 4)

資料按類別分組，而 Power 和 Vapor 的總和可以通過在分組類別中使用 SUM 函式簡單地實作。我一直在計算總時間。

比如第一次出現的類別1，最后一次和第一次的差是2501（1625450802-1625448301）。

在類別 1 的下一次出現中，last 和 first 之間的差異為 2600。所有這些差異值被組合以創建Total time diff 1

我試過使用 pd.diff() 和這個其他問題的答案

tableDf['TotalTime(s)'] = tableDf.groupby('Category')['time(s)'].transform(lambda x: x.iat[-1] - x.iat[0])

但是所有這些方法只計算類別 1 的最后一行和第一行。這導致了錯誤的總時間。

任何解決方案或建議來計算每個出現類別的最后一行和第一行之間的差異？

uj5u.com熱心網友回復：

只是為了提供基于convtools的替代選項：

from convtools import conversion as c
from convtools.contrib.tables import Table


# this is an ad hoc converter function; consider generating it once and reusing
# further
converter = (
    c.chunk_by(c.item("Category"))
    .aggregate(
        {
            "Category": c.ReduceFuncs.First(c.this).item("Category"),
            "CategoryName": c.ReduceFuncs.First(c.this).item("CategoryName"),
            "TotalTime(s)": (
                c.ReduceFuncs.Last(c.this).item("time(s)")
                - c.ReduceFuncs.First(c.this).item("time(s)")
            ),
            "Power": c.ReduceFuncs.Sum(c.item("Power")),
            "Vapor": c.ReduceFuncs.Sum(c.item("Vapor")),
        }
    )
    .gen_converter()
)

column_types = {
    "time(s)": int,
    "Power": float,
    "Vapor": float,
}

# this is iterable, so can be consumed only once
prepared_rows_iter = (
    Table.from_csv("tmp4.csv", header=True)
    # casting column types
    .update(
        **{
            column_name: c.col(column_name).as_type(column_type)
            for column_name, column_type in column_types.items()
        }
    ).into_iter_rows(dict)
)

# if list of dicts is needed
result = list(converter(prepared_rows_iter))
assert result == [
    { "Category": "1", "CategoryName": "A", "TotalTime(s)": 2501, "Power": 1782222.96, "Vapor": 10496.14, },
    { "Category": "2", "CategoryName": "B", "TotalTime(s)": 2501, "Power": 1782242.5899999999, "Vapor": 9971.02, },
    { "Category": "3", "CategoryName": "C", "TotalTime(s)": 2499, "Power": 1780762.3399999999, "Vapor": 10132.380000000001, },
    { "Category": "4", "CategoryName": "D", "TotalTime(s)": 1250, "Power": 1777972.23, "Vapor": 10701.66, },
    { "Category": "1", "CategoryName": "A", "TotalTime(s)": 1251, "Power": 1193226.9100000001, "Vapor": 7447.91, },
]

# if csv file is needed
# Table.from_rows(converter(prepared_rows_iter)).into_csv("out.csv")

uj5u.com熱心網友回復：

這是一個解決方案datar，它重新構想了 pandas 的 API：

構建資料

>>> from datar.all import f, tribble, group_by, summarise, first, last, sum, relocate
[2022-03-23 10:11:46][datar][WARNING] Builtin name "sum" has been overriden by datar.
>>> 
>>> df = tribble(
...     f.Category,  f.CategoryName, f["time(s)"], f.Power,   f.Vapor,
...     1,           "A",            1625448301,   593233.36, 353.92,
...     1,           "A",            1625449552,   595156.24, 286.8,
...     1,           "A",            1625450802,   593833.36, 855.42,
...     2,           "B",            1625452051,   595233.37, 353.95,
...     2,           "B",            1625453301,   593535.86, 252.92,
...     2,           "B",            1625454552,   593473.36, 364.15,
...     3,           "C",            1625455802,   593754.32, 233.92,
...     3,           "C",            1625457052,   593153.46, 563.52,
...     3,           "C",            1625458301,   593854.56, 334.94,
...     4,           "D",            1625459552,   593345.75, 353.36,
...     4,           "D",            1625460802,   592313.24, 674.95,
...     4,           "D",            1625460802,   592313.24, 673.35,
... )

操作資料

>>> (
...     df 
...     >> group_by(f.Category) 
...     >> summarise(
...         Power=sum(f.Power),
...         Vapor=sum(f.Vapor),
...         CategoryName=first(f.CategoryName),
...         **{
...             "TotalTime(s)": last(f["time(s)"]) - first(f["time(s)"]),
...         }
...     ) 
...     >> relocate(f.CategoryName, f["TotalTime(s)"], _after=f.Category)
... )
   Category CategoryName  TotalTime(s)       Power     Vapor
    <int64>     <object>       <int64>   <float64> <float64>
0         1            A          2501  1782222.96   1496.14
1         2            B          2501  1782242.59    971.02
2         3            C          2499  1780762.34   1132.38
3         4            D          1250  1777972.23   1701.66

uj5u.com熱心網友回復：

你可以很容易地做到這一點pandas，你只需要使用一點 groupby 技巧從連續的類別中創建分組，然后應用你的操作：

consec_groupings = (
    df['Category'].shift()
    .ne(df['Category'])
    .groupby(df['Category']).cumsum()
    .rename('Consec_Category')
)

intermediate = (
    df.groupby(['Category', 'CategoryName', groupings])
    .agg({'time(s)': ['first', 'last'], 'Power': 'sum', 'Vapor': 'sum'})
)

intermediate[('time(s)', 'delta')] = (
    intermediate[('time(s)', 'last')] - intermediate[('time(s)', 'first')]
)

print(intermediate)
                                          time(s)                   Power     Vapor time(s)
                                            first        last         sum       sum   delta
Category CategoryName Consec_Category                                                      
1        A            1                1625448301  1625450802  1782222.96  10496.14    2501
                      2                1625463301  1625464552  1193226.91   7447.91    1251
2        B            1                1625452051  1625454552  1782242.59   9971.02    2501
3        C            1                1625455802  1625458301  1780762.34  10132.38    2499
4        D            1                1625459552  1625460802  1777972.23  10701.66    1250

然后從該中間產品中，您可以很容易地計算出最終輸出：

final = (
    intermediate[[('time(s)', 'delta'), ('Power', 'sum'), ('Vapor', 'sum')]]
    .droplevel(level=1, axis=1)
    .groupby(['Category', 'CategoryName']).sum()
)

print(final)
                       time(s)       Power     Vapor
Category CategoryName                               
1        A                3752  2975449.87  17944.05
2        B                2501  1782242.59   9971.02
3        C                2499  1780762.34  10132.38
4        D                1250  1777972.23  10701.66

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/447966.html

標籤：Python 熊猫数据框数据处理

上一篇：有沒有辦法使用它的鏡像Dataframe結構轉換PandasDataframe中的列

下一篇：如果列中的一行在不知道列名的情況下包含“url”或“http”，如何洗掉列？