在計算了 1900-1910 年每個城市的平均成本后,我需要洗掉所有在 1860 年到 1863 年之間缺失資料的城市。這是那個時間跨度的資料......
city cost1860 cost1861 cost1862 cost1863 cost1864
1 Boston NA NA NA NA NA
2 Los Angeles 1.77643659 3.516253 1.683492 3.573637296 4.4076780
3 Detroit NA NA NA NA NA
4 New York City NA NA NA NA NA
5 Chicago 32.87500913 39.785973 35.471498 24.683812800 19.5488509
6 Memphis NA NA NA NA NA
7 Seattle NA NA NA NA NA
8 St. Louis -0.01007441 4.659959 NA 0.005722915 NA
9 Boulder NA NA NA NA NA
10 Boise NA NA NA NA NA
現在,接下來幾年的單獨列中也有資料,但我需要找出一種方法來洗掉在 1860 年到 1863 年之間具有任何 NA 值的城市,而不洗掉接下來幾年的所有資料。所以,一旦完成,我應該只有資料在 1860 年到 1863 年之間的城市(以及接下來幾年的資料,可能有 NA 值)。
我已經能夠洗掉 1860 年到 1863 年之間缺少資料的城市,但是如果不洗掉隨后幾年的所有資料,我無法弄清楚如何這樣做。這是我的代碼...
na.exclude(mydata[, 2:5])
mydata_1860_1863 <- na.exclude(mydata[, 2:5])
有誰知道我如何洗掉 1860 年到 1863 年之間缺失的城市資料,同時保留接下來幾年的資料?
uj5u.com熱心網友回復:
很難洗掉行但保留列。與其洗掉,為什么不標記那些缺少 1860-1863 資料的行,以便您以后可以對其進行過濾?
例如:
library(dplyr)
mydata <- mydata %>%
mutate(is_missing = ifelse(is.na(rowSums(.[, 2:5])), 1, 0))
結果:
city cost1860 cost1861 cost1862 cost1863 cost1864 is_missing
1 Boston NA NA NA NA NA 1
2 Los Angeles 1.77643659 3.516253 1.683492 3.573637296 4.407678 0
3 Detroit NA NA NA NA NA 1
4 New York City NA NA NA NA NA 1
5 Chicago 32.87500913 39.785973 35.471498 24.683812800 19.548851 0
6 Memphis NA NA NA NA NA 1
7 Seattle NA NA NA NA NA 1
8 St. Louis -0.01007441 4.659959 NA 0.005722915 NA 1
9 Boulder NA NA NA NA NA 1
10 Boise NA NA NA NA NA 1
uj5u.com熱心網友回復:
這里的解決方案基于data.table:
library(data.table)
dt <- data.table::data.table(city = c("Boston","Los Angeles", "Detroit","New York City","Chicago","Memphis","Seattle", "St. Louis","Boulder","Boise"), cost1860 = c(NA,1.77643659,NA,NA, 32.87500913,NA,NA,-0.01007441,NA,NA), cost1861 = c(NA,3.516253,NA,NA,39.785973, NA,NA,4.659959,NA,NA), cost1862 = c(NA, 1.683492, NA, NA, 35.471498, NA, NA, NA, NA, NA), cost1863 = c(NA,3.573637296,NA,NA, 24.6838128,NA,NA,0.005722915,NA,NA), cost1864 = c(NA, 4.407678, NA, NA, 19.5488509, NA, NA, NA, NA, NA) )
dt[dt[,!is.na(rowSums(.SD)),.SDcols=-c(1,6)]]
#> city cost1860 cost1861 cost1862 cost1863 cost1864
#> 1: Los Angeles 1.776437 3.516253 1.683492 3.573637 4.407678
#> 2: Chicago 32.875009 39.785973 35.471498 24.683813 19.548851
現在,一種tidyverse方法:
library(tidyverse)
df <- data.frame(stringsAsFactors = FALSE, city = c("Boston", "Los Angeles","Detroit","New York City","Chicago", "Memphis","Seattle","St. Louis","Boulder","Boise"), cost1860 = c(NA,1.77643659,NA, NA,32.87500913,NA,NA,-0.01007441,NA,NA), cost1861 = c(NA,3.516253,NA, NA,39.785973,NA,NA,4.659959,NA,NA), cost1862 = c(NA,1.683492,NA, NA,35.471498,NA,NA,NA,NA,NA), cost1863 = c(NA,3.573637296,NA, NA,24.6838128,NA,NA,0.005722915,NA,NA), cost1864 = c(NA,4.407678,NA, NA,19.5488509,NA,NA,NA,NA,NA))
df %>%
filter(across(2:5, ~ !is.na(.x)))
#> city cost1860 cost1861 cost1862 cost1863 cost1864
#> 1 Los Angeles 1.776437 3.516253 1.683492 3.573637 4.407678
#> 2 Chicago 32.875009 39.785973 35.471498 24.683813 19.548851
uj5u.com熱心網友回復:
使用is.na()中i的data.tablesdt[I,j,by]
dt[!(is.na(cost1861) | is.na(cost1862) | is.na(cost1863))]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/369341.html
標籤:r
上一篇:視圖中的錯誤:無法將“infl”類強制轉換為data.frame-問題與“infl”類有關
下一篇:如何標記每條趨勢線而不是制作圖例
