我有一個要在 R 中讀取的資料文本檔案(17 列)。我正在使用 read.table() 函式。
read.table(file="data.txt", header = TRUE, sep = "\t", quote = "",comment.char="")
問題是某些行占用多行(下面的示例)
10 Macron serait-il plus pro-salafiste que Hamon?!
t.co/g29oOgqih1
#Presidentielle2017 FALSE 0 NA 2017-03-02 13:45:08 FALSE NA 837297724378726400 NA <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> Trader496 0 FALSE FALSE NA NA
有什么方法可以在一行中讀取這種型別的資料還是我必須使用fill=TRUE
資料檔案:https ://pastebin.com/b90VHvSt
uj5u.com熱心網友回復:
readr::melt_*()
ormeltr::melt_*()
函式對于格式錯誤的資料很有用。這可能是一項非常乏味的任務,因此我將在不完全清理這些資料的情況下演示一些功能和作業流程。
這看起來像是制表符分隔的,所以我們將從melt_tsv()
:
library(readr)
library(dplyr)
library(tidyr)
data_raw <- melt_tsv("https://pastebin.com/raw/b90VHvSt")
data_raw
#> # A tibble: 281 × 4
#> row col data_type value
#> <dbl> <dbl> <chr> <chr>
#> 1 1 1 character no text
#> 2 1 2 character favorited
#> 3 1 3 character favoriteCount
#> 4 1 4 character replyToSN
#> 5 1 5 character created
#> 6 1 6 character truncated
#> 7 1 7 character replyToSID
#> 8 1 8 character id
#> 9 1 9 character replyToUID
#> 10 1 10 character statusSource
#> # … with 271 more rows
這一次讀入資料一個令牌,其中包含有關位置和資料型別的資訊。對于初學者來說,前兩個列名看起來是用空格而不是制表符分隔的,因此被作為一個標記讀入。我們可以解決這個問題,然后將更正后的標題合并到其余資料中。
headers_fixed <- data_raw %>%
filter(row == 1, col != 1) %>%
mutate(col = col 1) %>%
select(col, col_name = value) %>%
add_row(col = c(1, 2), col_name = c("no", "text"), .before = 1)
data_raw <- data_raw %>%
filter(row != 1) %>%
left_join(headers_fixed) %>%
add_count(row, name = "row_cols")
我還添加了一個計數變數,顯示每行的列數。每行應該有 17 列,因此我們可以使用它來過濾和旋轉“好”行。
data_ok <- data_raw %>%
filter(row_cols == 17) %>%
select(row, col_name, value) %>%
pivot_wider(names_from = col_name) %>%
type_convert()
data_ok
#> # A tibble: 12 × 18
#> row no text favor…1 favor…2 reply…3 created trunc…? reply…?
#> <dbl> <dbl> <chr> <lgl> <dbl> <lgl> <dttm> <lgl> <lgl>
#> 1 2 1 "RT … FALSE 0 NA 2017-03-02 13:45:34 FALSE NA
#> 2 3 2 "Ne … FALSE 0 NA 2017-03-02 13:45:32 FALSE NA
#> 3 4 3 "Il … FALSE 0 NA 2017-03-02 13:45:29 FALSE NA
#> 4 5 4 "RT … FALSE 0 NA 2017-03-02 13:45:26 FALSE NA
#> 5 6 5 "RT … FALSE 0 NA 2017-03-02 13:45:26 FALSE NA
#> 6 7 6 "RT … FALSE 0 NA 2017-03-02 13:45:25 FALSE NA
#> 7 8 7 "RT … FALSE 0 NA 2017-03-02 13:45:13 FALSE NA
#> 8 9 8 "#Pr… FALSE 0 NA 2017-03-02 13:45:10 FALSE NA
#> 9 10 9 "#Pr… FALSE 0 NA 2017-03-02 13:45:10 FALSE NA
#> 10 16 11 "RT … FALSE 0 NA 2017-03-02 13:44:58 FALSE NA
#> 11 21 13 "RT … FALSE 0 NA 2017-03-02 13:44:46 FALSE NA
#> 12 26 15 "Dim… FALSE 0 NA 2017-03-02 13:44:41 FALSE NA
#> # … with 9 more variables: id <dbl>, replyToUID <lgl>, statusSource <chr>,
#> # screenName <chr>, retweetCount <dbl>, isRetweet <lgl>, retweeted <lgl>,
#> # longitude <lgl>, latitude <lgl>, and abbreviated variable names 1?favorited,
#> # 2?favoriteCount, 3?replyToSN, ??truncated, ??replyToSID
這給我們留下了 13 個“壞”行中的 61 個值。診斷和修復這些將需要更多的作業,這留給讀者作為練習。
data_bad <- data_raw %>%
filter(row_cols != 17)
data_bad
#> # A tibble: 61 × 6
#> row col data_type value col_n…1 row_c…2
#> <dbl> <dbl> <chr> <chr> <chr> <int>
#> 1 11 1 integer 10 no 2
#> 2 11 2 character Macron serait-il plus pro-salafiste qu… text 2
#> 3 12 1 character <url shortener rmvd> no 1
#> 4 13 1 missing <NA> no 1
#> 5 14 1 missing <NA> no 1
#> 6 15 1 character #Presidentielle2017 no 16
#> 7 15 2 logical FALSE text 16
#> 8 15 3 integer 0 favori… 16
#> 9 15 4 missing <NA> favori… 16
#> 10 15 5 datetime 2017-03-02 13:45:08 replyT… 16
#> # … with 51 more rows, and abbreviated variable names 1?col_name, 2?row_cols
使用reprex v2.0.2創建于 2022-11-09
uj5u.com熱心網友回復:
這適用于您的資料樣本:
data <- readLines("b90VHvSt.txt")
data <- paste(data, collapse = " ")
data <- gsub("(([^\\t]*\\t){15}[^ ] ) ", "\\1\t", data, perl = T)
data <- unlist(strsplit(data, "\t"))
data <- append(data, "?", 9)
data <- matrix(data, nrow = 17)
data <- as.data.frame(t(data[,-1]), row.names = data[,1] ))
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/530803.html
標籤:r读表
下一篇:每個元素出現在向量中的百分比