我有一長串日期(從 1942-1-1 00:00:00 到 2012-12-31 24:00:00)。這些分別與一些金額相關(見下文)。有沒有辦法先分別過濾一天的所有金額,然后將它們加在一起?
例如對于 1942-01-01,如何找到這一天(從時間 0 到 24)發生的所有值(數量),然后將它們相加?
time amount
DateTime Float64
1942-01-01T00:00:00 7.0
1942-01-02T00:00:00 0.2
1942-01-03T00:00:00 2.1
1942-01-04T00:00:00 3.0
:
2012-12-31T23:00:00 4.0
2012-12-31T24:00:00 0.0
df = CSV.read(path, DataFrame)
for i in 1:24
filter(r ->hour(r.time) == i, df)
end
uj5u.com熱心網友回復:
您可以使用許多方法(也許其他一些評論者會提出替代方案)。在這里,讓我向您展示如何在不進行任何過濾的情況下實作您想要的效果:
julia> df = DataFrame(time=[DateTime(2020, 1, rand(1:2), rand(0:23)) for _ in 1:100], amount=rand(100))
100×2 DataFrame
Row │ time amount
│ DateTime Float64
─────┼────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325
2 │ 2020-01-02T02:00:00 0.376917
3 │ 2020-01-02T09:00:00 0.11849
4 │ 2020-01-02T04:00:00 0.462997
? │ ? ?
97 │ 2020-01-02T18:00:00 0.750604
98 │ 2020-01-01T13:00:00 0.179414
99 │ 2020-01-01T15:00:00 0.552547
100 │ 2020-01-01T02:00:00 0.769066
92 rows omitted
julia> transform!(df, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour)
100×4 DataFrame
Row │ time amount date hour
│ DateTime Float64 Date Int64
─────┼───────────────────────────────────────────────────
1 │ 2020-01-02T16:00:00 0.29325 2020-01-02 16
2 │ 2020-01-02T02:00:00 0.376917 2020-01-02 2
3 │ 2020-01-02T09:00:00 0.11849 2020-01-02 9
4 │ 2020-01-02T04:00:00 0.462997 2020-01-02 4
? │ ? ? ? ?
97 │ 2020-01-02T18:00:00 0.750604 2020-01-02 18
98 │ 2020-01-01T13:00:00 0.179414 2020-01-01 13
99 │ 2020-01-01T15:00:00 0.552547 2020-01-01 15
100 │ 2020-01-01T02:00:00 0.769066 2020-01-01 2
92 rows omitted
julia> unstack(df, :hour, :date, :amount, combine=sum, fill=0)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 16 1.06636 0.949414
2 │ 2 0.990913 1.43032
3 │ 9 0.183206 3.16363
4 │ 4 1.24055 0.57196
? │ ? ? ?
21 │ 10 0.0 0.492397
22 │ 14 0.393438 0.0
23 │ 21 0.0 0.487992
24 │ 8 0.848852 0.0
16 rows omitted
最終結果是一個資料框,可為您提供所有天(列)的所有小時(行)的聚合。資料按出現順序顯示,因此您可能希望按以下方式對結果進行排序hour:
julia> res = sort!(unstack(df, :hour, :date, :amount, combine=sum, fill=0), :hour)
24×3 DataFrame
Row │ hour 2020-01-02 2020-01-01
│ Int64 Float64 Float64
─────┼───────────────────────────────
1 │ 0 1.99143 0.150979
2 │ 1 1.25939 0.860835
3 │ 2 0.990913 1.43032
4 │ 3 3.83337 2.33696
? │ ? ? ?
21 │ 20 1.73576 1.93323
22 │ 21 0.0 0.487992
23 │ 22 1.52546 0.651938
24 │ 23 1.03808 0.0
16 rows omitted
現在,您只需提取與其對應的列即可提取特定日期的資訊,例如:
julia> res."2020-01-02"
24-element Vector{Float64}:
1.991425180864845
1.2593855803084226
0.9909134301068651
3.833369559458414
1.2405519797178841
1.4494215475119732
?
2.4509665509554157
0.0
1.7357636571508785
0.0
1.525457178008634
1.0380772820126043
對于您擁有的資料量,一次獲取所有結果應該沒有問題(在此示例中,我在一天和一小時對源資料幀進行了預排序,以使最終表格按行和列很好地排序):
julia> @time big = DataFrame(time=[DateTime(rand(1942:2012), rand(1:12), rand(1:28), rand(0:23)) for _ in 1:10^7], amount=rand(10^7));
0.413495 seconds (99.39 k allocations: 310.149 MiB, 3.75% gc time, 5.54% compilation time)
julia> @time sort!(transform!(big, :time => ByRow(Date) => :date, :time => ByRow(hour) => :hour), [:date, :hour]);
5.049808 seconds (1.03 M allocations: 1.167 GiB, 0.81% gc time)
julia> @time unstack(big, :hour, :date, :amount, combine=sum, fill=0)
1.342251 seconds (21.58 M allocations: 673.052 MiB, 13.63% gc time)
24×23857 DataFrame
Row │ hour 1942-01-01 1942-01-02 1942-01-03 1942-01-04 1942-01-05 1942-01-06 1942-01-07 1942-01-08 1942-01-09 1942-01-10 1942-01-11 194 ?
│ Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Flo ?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0 9.19054 8.00765 6.99379 9.63979 6.5088 11.6281 12.4928 6.86322 11.4453 12.6505 10.0583 1 ?
2 │ 1 8.78977 8.32879 6.29344 12.0815 9.83297 8.24592 10.349 10.1213 6.51192 6.1523 8.38962
3 │ 2 5.51566 9.97157 12.1064 8.28468 11.1929 8.274 8.25525 7.88186 4.65225 7.44625 6.62251 1
4 │ 3 7.25526 13.1635 4.75877 9.77418 11.5427 6.30625 6.2512 8.06394 8.77394 12.5935 9.09008
? │ ? ? ? ? ? ? ? ? ? ? ? ? ?
21 │ 20 8.46999 9.99227 11.1116 14.5478 11.8379 7.38414 11.0567 6.17652 10.6811 9.059 9.77321 ?
22 │ 21 7.02998 10.0908 5.5182 8.8145 9.81238 10.8413 8.65648 12.6846 12.1116 8.75566 11.2892 1
23 │ 22 9.17824 13.2115 10.589 9.87813 10.7258 7.97428 12.8137 10.3456 8.37605 9.54897 7.24197
24 │ 23 13.0214 10.2333 9.08972 11.8678 7.36996 9.80802 11.0031 6.0818 11.7789 4.3467 7.49586
23845 columns and 16 rows omitted
編輯
這是一個如何使用的示例filter。我假設我們處理big上面創建的資料框,并且只需要 1942-02-03 的資訊。我還使用 Chain.jl 來很好地鏈接執行的操作:
julia> @chain big begin
filter(:date => ==(Date("1942-02-03")), _)
groupby(:hour, sort=true)
combine(:amount => sum)
end
24×2 DataFrame
Row │ hour amount_sum
│ Int64 Float64
─────┼───────────────────
1 │ 0 6.22427
2 │ 1 8.33195
3 │ 2 9.26992
4 │ 3 13.7858
? │ ? ?
21 │ 20 6.59938
22 │ 21 6.07788
23 │ 22 6.68741
24 │ 23 7.59147
16 rows omitted
(如果有不清楚的地方請評論)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/524755.html
上一篇:如何查找列中具有最高值的行的名稱
