這是一個優化問題,因為我有一個作業代碼,它從 csv 檔案中讀取資料并從中創建了一個 python 字典,其中包含類似True或false轉換為布爾對應項的字串:
def _load_metadata(path):
"""Loads the metadata from the given file and converts boolean strings to booleans."""
filedict = {}
with open(path) as csvfile:
reader = csv.DictReader(csvfile, delimiter=",")
for row in reader:
newrow = {}
for key, value in dict(row).items():
if value.lower() == "false":
newrow[key] = False
elif value.lower() == "true":
newrow[key] = True
else:
newrow[key] = value
filedict[row["name"]]= newrow
return filedict
但我想知道是否有更好/更pythonic的方式來處理這個?
uj5u.com熱心網友回復:
我的評論中的解決方案:
def _load_metadata(path):
"""Loads the metadata from the given file and converts boolean strings to booleans."""
filedict = {}
with open(path) as csvfile:
reader = csv.DictReader(csvfile, delimiter=",")
for row in reader:
newrow = {}
for key, value in dict(row).items():
# if value.lower() not in {"false": False, "true": True} it returns value as default
newrow[key] = {"false": False, "true": True}.get(value.lower(), value)
filedict[row["name"]]= newrow
return filedict
這里是關于這個 dicts方法的檔案.get()
uj5u.com熱心網友回復:
你的代碼很好。一般來說,您不必太擔心優化純 Python 代碼。如果你真的需要更好的性能,你應該考慮使用專門為你的任務構建的包(或者例如用 C 撰寫你自己的擴展模塊)。一些流行的處理大資料的工具是NumPy、pandas和PyArrow。
使用pandas,你會做這樣的事情:
import pandas as pd
df = pd.read_csv(path, engine='pyarrow')
df = df.replace(r'(?i)true', True, regex=True)
df = df.replace(r'(?i)false', False, regex=True)
filedict = {row['name']: row.to_dict() for row in df.iloc}
不幸的是,這似乎比在純 Python 中逐行處理要慢得多,但我是新手pandas,我確信有更好的方法來做到這一點。
至于純 Python,優化原始代碼的一些最明顯的方法并沒有產生顯著的改進:
import csv
import random
import timeit
path = 'temp.csv'
N_ROWS = 100_000
N_TESTS = 100
cols = [chr(k) for k in range(ord('A'), ord('Z') 1)]
choices = ['value', 'True', 'False']
with open(path, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['name'] cols)
writer.writeheader()
for k in range(N_ROWS):
row = {key: random.choice(choices) for key in cols}
row['name'] = f'name_{k}'
writer.writerow(row)
def load_1(path):
"""Original function"""
filedict = {}
with open(path) as csvfile:
reader = csv.DictReader(csvfile, delimiter=",")
for row in reader:
newrow = {}
for key, value in dict(row).items():
if value.lower() == "false":
newrow[key] = False
elif value.lower() == "true":
newrow[key] = True
else:
newrow[key] = value
filedict[row["name"]]= newrow
return filedict
def load_2(path):
"""With dict.get method"""
bool_map = {'true': True, 'false': False}
filedict = {}
with open(path) as csvfile:
reader = csv.DictReader(csvfile, delimiter=",")
for row in reader:
newrow = {}
for key, value in dict(row).items():
newrow[key] = bool_map.get(value.lower(), value)
filedict[row["name"]]= newrow
return filedict
def load_3(path):
"""With walrus operator"""
filedict = {}
with open(path) as csvfile:
reader = csv.DictReader(csvfile, delimiter=",")
for row in reader:
newrow = {}
for key, value in dict(row).items():
if (value_lower := value.lower()) == "false":
newrow[key] = False
elif value_lower == "true":
newrow[key] = True
else:
newrow[key] = value
filedict[row["name"]]= newrow
return filedict
def load_4(path):
"""With walrus operator and dict comprehension"""
filedict = {}
with open(path) as csvfile:
for row in csv.DictReader(csvfile, delimiter=","):
filedict[row["name"]] = {
key: True if (value_str:=value.lower()) == 'true'
else False if value_str == 'false'
else value
for key, value in row.items()
}
return filedict
def load_5(path):
"""With walrus operator and nested dict comprehension"""
with open(path) as csvfile:
return {
row['name']: {
key: True if (value_str:=value.lower()) == 'true'
else False if value_str == 'false'
else value
for key, value in row.items()
}
for row in csv.DictReader(csvfile, delimiter=',')
}
assert load_1(path) == load_2(path) == load_3(path) == \
load_4(path) == load_5(path)
t1 = timeit.timeit('load_1(path)', globals=globals(), number=N_TESTS)/N_TESTS
t2 = timeit.timeit('load_2(path)', globals=globals(), number=N_TESTS)/N_TESTS
t3 = timeit.timeit('load_3(path)', globals=globals(), number=N_TESTS)/N_TESTS
t4 = timeit.timeit('load_4(path)', globals=globals(), number=N_TESTS)/N_TESTS
t5 = timeit.timeit('load_5(path)', globals=globals(), number=N_TESTS)/N_TESTS
print(f'{t1 = :.3f} s')
print(f'{t2 = :.3f} s')
print(f'{t3 = :.3f} s')
print(f'{t4 = :.3f} s')
print(f'{t5 = :.3f} s')
在我的(非常慢的)計算機上,結果是:
t1 = 2.504 s
t2 = 2.376 s
t3 = 2.288 s
t4 = 2.284 s
t5 = 2.260 s
在我看來,原始代碼肯定更“pythonic”,而且只慢了大約 10%。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/515548.html
標籤:PythonCSV
下一篇:將csv檔案轉換為所需的輸出
