R：如何計算兩個字串列之間的組合頻率（使用dplyr、tidyr或baser）-有解無憂

我有一個站到站自行車旅行的大型資料框，我的目標是確定最常見的起點站和終點站組合。我的 df 看起來像這樣：

df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
                 end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'),
                 start_lat = c(12.3456, 23.4567, 34.5678, 12.3456, 12.3456, 23.4567),
                 start_lng = c(09.8765, 98.7654, 87.6543, 09.8765, 09.8765, 98.7654)
)

理想情況下，我想要一個輸出來創建一個按頻率降序排列的站點組合串列，以及一個反映精確組合數量的新列“ride_count”。

在上面的例子中，我希望輸出是一個新的資料框，我可以進一步操作/可視化

start_station   end_station   ride_count start_lat   start_lng
Apple            Bungalow      3          12.3456    09.8765
Bungalow         Apple         2          23.4567    98.7654
Carrot           Carrot        1          34.5678    87.6543

根據之前的建議，“count()”命令確實執行了正確的計算，但是我丟失了與每個站點相關的其他資料，例如 start_lat 和 start_lng。

有沒有辦法保留這些列？

衷心感謝任何人的幫助。我一直在高效地完成這個專案，但我真的在為最后的地理元素苦苦掙扎。

uj5u.com熱心網友回復：

如果我們假設其他非分組欄位在組內總是不變的，那么我們可以這樣做：

library(dplyr)
df %>%
  group_by(start_station, end_station) %>%
  summarize(n = n(), across(everything(), first), .groups = "drop")
# # A tibble: 3 x 5
#   start_station end_station     n start_lat start_lng
#   <chr>         <chr>       <int>     <dbl>     <dbl>
# 1 Apple         Bungalow        3      12.3      9.88
# 2 Bungalow      Apple           2      23.5     98.8 
# 3 Carrot        Carrot          1      34.6     87.7

但是，如果有任何變化，那么您需要考慮如何單獨聚合每一列。

uj5u.com熱心網友回復：

df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
                 end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'))
df_counts <- tapply(df$start_station, 
                    paste(df$start_station, df$end_station),
                    length) |> 
  as.data.frame() |>
  `colnames<-`(c('count'))
idx <- order(df_counts$count, decreasing = TRUE)
df_counts <- df_counts[ idx, , drop = FALSE]
print(df_counts)

               count
Apple Bungalow     3
Bungalow Apple     2
Carrot Carrot      1

uj5u.com熱心網友回復：

編輯 dplyr 方法。這會為每個站點組合添加一個計數，并洗掉重復的行。這可能無法像您演示的那樣根據其他列的內容來“總結”事物（lat 和 lng 對于一個站來說是一致的，這使得這在示例中起作用）。

library(tidyverse)

df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
                 end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'),
                 start_lat = c(12.3456, 23.4567, 34.5678, 12.3456, 12.3456, 23.4567),
                 start_lng = c(09.8765, 98.7654, 87.6543, 09.8765, 09.8765, 98.7654)
)

df %>%
  add_count(start_station, end_station, name = 'ride_count') %>%
  distinct()
#>   start_station end_station start_lat start_lng ride_count
#> 1         Apple    Bungalow   12.3456    9.8765          3
#> 2      Bungalow       Apple   23.4567   98.7654          2
#> 3        Carrot      Carrot   34.5678   87.6543          1

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/525508.html

標籤：rdplyr蒂迪尔

上一篇：使用{sf}查詢具有簡單要素多邊形的地理包

下一篇：如何在R的整個資料框中使用分組后的rollapply？