資料集
請建議將此類資料讀入 R 中的資料框的最佳方法。
使用read.table("Software.txt")只給出錯誤:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 6 elements.
此外,此資料(Amazon 資料集)不是傳統的行和列格式,因此也將不勝感激。
uj5u.com熱心網友回復:
您的資料似乎與用于存盤包元資料的“Debian 控制檔案”(DCF) 格式相同。此類資料的正確匯入函式是
read.dcf("Software.txt")
查看?read.dcf幫助頁面了解更多資訊。
uj5u.com熱心網友回復:
這里有一個基于readLines.
r1 <- readLines('~/Downloads/Software.txt') ## read raw text
r2 <- r1[r1 != ''] ## remove blank elements, realize repeats every 10th
r3 <- strsplit(r2, ': ') ## split at `: `
## remove part before `: ` and make matrix with 10 rows
r4 <- matrix(sapply(r3, `[`, 2), 10, dimnames=list(sapply(r3[1:10], `[`, 1), NULL))
r5 <- as.data.frame(t(r4)) ## transpose and coerce to df
r6 <- setNames(r5, make.names(names(r5))) ## names
r6[r6 == 'unknown'] <- NA ## generate NA's
r7 <- type.convert(r6, as.is=TRUE) ## convert proper classes
當然,您可以稍微簡化一下。我只是想向您展示各個步驟。
結果
str(r7)
# 'data.frame': 95084 obs. of 10 variables:
# $ product.productId : chr "B000068VBQ" "B000068VBQ" "B000068VBQ" "B000068VBQ" ...
# $ product.title : chr "Fisher-Price Rescue Heroes" "Fisher-Price Rescue Heroes" ...
# $ product.price : num 8.88 8.88 8.88 8.88 8.88 8.88 8.88 NA NA NA ...
# $ review.userId : chr NA NA "A10P44U29RNOT6" NA ...
# $ review.profileName: chr NA NA "D. Jones" NA ...
# $ review.helpfulness: chr "11/11" "9/10" "6/6" "4/4" ...
# $ review.score : num 2 2 1 1 4 5 1 4 5 4 ...
# $ review.time : int 1042070400 1041552000 1126742400 1042416000 1045008000 ...
# $ review.summary : chr "Requires too much coordination" "You can't pick which ...
# $ review.text : chr "I bought this software for my 5 year old. He has a couple ...
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/334571.html
上一篇:缺少線條ggplot2圖
下一篇:當我嘗試縮放變數時的一些問題
