最近有專案需要去某個檔案夾里面,獲取所有XML檔案,然后從XML檔案中查找特定節點中查找是否有特定資訊,
為了便于獲取所有檔案資訊,直接用bat檔案每天自動運行生成一個檔案串列list資訊供查看,
cd /d %~dp0 dir /s /b *.* > FileList.txt
所以接下來要做的事情是從該FileList.txt檔案中查找到所有XML檔案,然后打開每個XML檔案查找是否包含指定節點資訊,
import os.path from lxml import etree TARGET = r'specific_info' xml_files_list = [] with open(r'C:\test\FileList.txt', 'r') as f: for line in f.readlines(): if line.strip().endswith('_2_.xml'): xml_files_list.append(line) duplicated_id_list = [] result_xml_files = [] for xml_file in xml_files_list: xml_tree_root = etree.parse(xml_file).getroot() detect_install_str = xml_tree_root.xpath('//SPECIFIC_NODE')[0].xpath('string(.)') if TARGET in detect_install_str: # 檢查是否此檔案名已經檢查過了:因為可能在不同檔案夾里面有同名的檔案 if os.path.basename(xml_file) not in duplicated_id_list: duplicated_id_list.append(os.path.basename(xml_file)) result_xml_files.append(xml_file) print(result_xml_files)
因為FileList.txt中看到的XML檔案有近2w個,最后查詢速度很慢,需要1000s+,
所以考慮使用多執行緒的方式來加快查詢速度:
import os from lxml import etree from multiprocessing.dummy import Pool TARGET = r'specific_info' xml_files_list = [] file_name_list = [] with open(r'C:\test\FileList.txt', 'r') as f: for line in f.readlines(): # 先去重,這樣就不用在讀取XML的時候去重了 if line.strip().endswith('_2_.xml') and os.path.basename(line) not in file_name_list: xml_files_list.append(line) # 創建函式用來跑多執行緒 def handle_xml(xml_file): xml_tree_root = etree.parse(xml_file).getroot() detect_install_str = xml_tree_root.xpath('//SPECIFIC_NODE')[0].xpath('string(.)') if TARGET in detect_install_str: return xml_file p = Pool() results = [] start = time.time() for i in xml_files_list: results.append(p.apply_async(handle_xml, args=(i,))) p.close() p.join() result_xml_files = [x.get() for x in results if x.get()] print(result_xml_files)
用多執行緒后速度大概是100s+,運行速度有了顯著提升,
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/288845.html
標籤:Python
