我有一個 pandas df,其檔案名需要在目錄樹中進行搜索/匹配。
我一直在使用以下內容,但它會因較大的目錄結構而崩潰。我記錄它們是否出現在 2 個串列中。
found = []
missed = []
for target_file in df_files['Filename']:
for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
if target_file in filenames:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
我讀過 scandir 更快,并且可以處理更大的目錄樹。如果是真的,這怎么可能被重寫?
我的嘗試:
found = []
missed = []
for target_file in df_files['Filename']:
for item in os.scandir(DIRECTORY_TREE):
if item.is_file() and item.name() == target_file:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
這運行(快速),但一切都在“錯過”串列中結束。
uj5u.com熱心網友回復:
只掃描一次目錄并將其轉換為資料框。
venv我的目錄上的示例:
import pandas as pd
import pathlib
DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])
df_files = pd.DataFrame({'Filename': ['__init__.py']})
現在您可以使用df_path以下命令查找檔案df_files名merge:
out = (df_files.merge(df_path, on='Filename', how='left')
.value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())
# Output
Filename Found Missed
0 __init__.py 5837 105418
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/478589.html
