我正在嘗試使用熊貓從https://download.bls.gov/pub/time.series/bp/bp.measure讀取資料,如下所示:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')
但是,我只需要獲取包含兩列的資料集:measure_code和measure_text。作為這個資料集的標題,BP measure我也嘗試閱讀它:
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)
但在這種情況下,它回傳一個只有一列的資料集,我無法將其插入:
>>> df.columns
Index([' measure_code measure_text'], dtype='object')
關于獲取此資料集的更好方法的任何建議/想法?
uj5u.com熱心網友回復:
這絕對是可能的,但格式有一些怪癖。
- 正如您所指出的,列標題從第 2 行開始,因此您需要
skiprows=1. - 該檔案以空格分隔,而不是制表符分隔。
- 列值跨多行繼續。
問題 1 和 2 可以使用skiprows和修復sep。問題 3 更難,需要您對檔案進行一些預處理。出于這個原因,我使用了一種更靈活的方式來獲取檔案,使用 requests 庫。獲得檔案后,我使用正則運算式解決問題 3,并將檔案回傳給 pandas。
這是代碼:
import requests
import re
import io
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
.text \
.replace("\r\n", "\n")
# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)
# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))
# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")
print(df)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/326229.html
