我正在處理一個非常大的資料集,并且正在回圈遍歷資料塊以將元素添加到類中。我的資料中有很多重復的值,這意味著我要多次為相同的資料創建一個類實體。從我所做的一些測驗來看,實際上創建類的實體似乎是操作中最昂貴的部分,所以我想盡可能地減少這一點。
我的問題是: 避免創建重復的類實體最便宜(時間)的方法是什么?理想情況下,我只想創建一個類實體,所有重復項都參考同一個實體。我不能從一開始就從我的資料中洗掉重復項,但我想確保我盡量減少任何昂貴的程式。
這是一個玩具示例,我希望能說明我的問題。注釋掉的部分顯示了我對如何節省時間的想法。
在此示例Person中包含 2 個呼叫方法sleep來演示創建實體的時間成本。在我的示例中,代碼將在 4.22 秒 ( (SLEEP_1 * 6) (SLEEP_2 * 6)) 內運行。看到我有一個人“詹姆斯”出現了 3 次,我正在尋找一種方法只添加這個人一次,然后為 2 個重復項參考這個。
然后我希望代碼在 ~2.8s ( (SLEEP_1 * 4) (SLEEP_2 * 4))內運行
import time
from collections import defaultdict
SLEEP_1 = 0.2
SLEEP_2 = 0.5
# A class `Person` has a load of methods,
# meaning that creating an instance has a non-negligible time-cost over millions of calls.
class Person:
def __init__(self, info):
self._id = info['_id']
self.name = info['name']
self.nationality = info['nationality']
self.age = info['age']
self.can_drink_in_USA = self.some_long_fun()
self.can_fly_solo = self.another_costly_fun()
def some_long_fun(self):
time.sleep(SLEEP_1)
if self.age >= 21:
return True
return False
def another_costly_fun(self):
time.sleep(SLEEP_2)
if self.age >= 18:
return True
return False
# Some data to iterate over
# Note that "James" is present 3 times
teams = {
"team1": [
{"_id": "foo", "name": "James", "nationality": "French", "age": 32},
{"_id": "bar", "name": "Frank", "nationality": "American", "age": 36},
{"_id": "foo", "name": "James", "nationality": "French", "age": 32}
],
"team2": [
{"_id": "foo", "name": "James", "nationality": "French", "age": 32},
{"_id": "baz", "name": "Oliver", "nationality": "British", "age": 26},
{"_id": "qux", "name": "Josh", "nationality": "British", "age": 42}
]
}
seen = defaultdict(int)
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
for i, person in enumerate(teams[team]):
if person['_id'] in seen:
print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
# p = getattr(Person, '_id') == person['_id']
# team_directory[team].append(p)
# continue
print(f"Person {i 1} = {person['name']}")
p = Person(info=person)
team_directory[team].append(p)
seen[person['_id']] = 1
finish_time = time.time() - start_time
expected_finish = round((SLEEP_1 * 6) (SLEEP_2 * 6), 2)
print(f"Built a teams directory in {round(finish_time, 2)}s [expect: {expected_finish}s]")
# Loop over the results to check - I want each team to have 3 people
# (so I can't squash duplicates from the outset
for t in team_directory:
roster = " ".join([p.name for p in team_directory[t]])
print(f"Team {team} contains these people: {roster}")
uj5u.com熱心網友回復:
seen可以用作快取,將人員_id與已創建的Person物件相關聯。
這看起來像(代碼到并包括主 for 回圈,其余代碼不需要更改):
seen = {}
team_directory = defaultdict(list)
start_time = time.time()
for team in teams:
for i, person in enumerate(teams[team]):
if person['_id'] in seen:
print(f"{person['name']} [_id: {person['_id']}] already exists in Person class")
p = seen[person['_id']]
team_directory[team].append(p)
continue
print(f"Person {i 1} = {person['name']}")
p = Person(info=person)
team_directory[team].append(p)
seen[person['_id']] = p
像 eg 這樣的賦值seen[person['_id']] = p只復制對物件的參考,而不是物件本身,因此它不需要太多記憶體。
uj5u.com熱心網友回復:
創建一個實體在數百萬次呼叫中具有不可忽略的時間成本
然后不要打電話給他們。您的兩個示例是派生函式;它們使用其他屬性,因此可以保留實體方法,無需將其存盤在實體欄位本身中。另外,您永遠不會在建構式之外的代碼中使用它們,因此可以將它們從那里洗掉并推遲到實際需要它們的任何代碼中。
此外,對于該示例代碼,您只需要一個函式,并且不需要睡眠
def age_check(age):
def f(over):
return age >= over
return f
age_check(self.age)(18)
age_check(self.age)(21)
或者,更簡單
def age_check(self, over):
return self.age >= over
需要在哪里參考實體,
Person._id == person['_id']我不知道如何有效地/根本不知道如何做到這一點。最終,我需要添加以下內容:team_directory[team].append(p)
不要使用串列和附加。使用將 映射Person._id到 person 實體本身的字典。然后你不需要浪費回圈遍歷串列來查看一個人是否已經存在
顯然,這一切都假設您的資料集適合記憶體
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/533234.html
標籤:Python表现
下一篇:鏈接按鈕導致重定向問題
