使用Pandas讀取CSV-有解無憂

我正在嘗試使用熊貓從https://download.bls.gov/pub/time.series/bp/bp.measure讀取資料，如下所示：

import pandas as pd

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')

但是，我只需要獲取包含兩列的資料集：measure_code和measure_text。作為這個資料集的標題，BP measure我也嘗試閱讀它：

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)

但在這種情況下，它回傳一個只有一列的資料集，我無法將其插入：

>>> df.columns
Index([' measure_code         measure_text'], dtype='object')

關于獲取此資料集的更好方法的任何建議/想法？

uj5u.com熱心網友回復：

這絕對是可能的，但格式有一些怪癖。

正如您所指出的，列標題從第 2 行開始，因此您需要skiprows=1.
該檔案以空格分隔，而不是制表符分隔。
列值跨多行繼續。

問題 1 和 2 可以使用skiprows和修復sep。問題 3 更難，需要您對檔案進行一些預處理。出于這個原因，我使用了一種更靈活的方式來獲取檔案，使用 requests 庫。獲得檔案后，我使用正則運算式解決問題 3，并將檔案回傳給 pandas。

這是代碼：

import requests
import re
import io
import pandas as pd

url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'

# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
    .text \
    .replace("\r\n", "\n")

# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)

# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))

# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")

print(df)

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/326229.html

標籤：蟒蛇-3.x 熊猫文件

上一篇：連接兩個資料幀并替換R中的NA值并將結果轉換為csv檔案

下一篇：使用Python（最好是Pandas？）在兩個csv檔案之間匹配資料