我是 Python 的初學者,目前正在運行這個嵌套的 for 回圈網路抓取程式,以抓取幾個 Excel 檔案,以獲取我資料集中的數千個觀察結果。但是,我的代碼運行速度如此之慢,以至于我需要加快這個程序,這樣我就可以一次進行 5-20 次觀察。人們建議使用 threading 或 asyncio,但我不知道如何使用它們或撰寫什么代碼,因為在線檔案非常遲鈍,沒有真正解釋 Python 3.9 (Spyder) 在我的試錯程序中在做什么程序。
我的代碼很長,但要點是我需要一次迭代多個元素 i(在第一行代碼中),但我不知道該怎么做。我正在尋找一個簡單的解決方法。我意識到這段代碼非常笨拙,但處理能力/速度不是問題。請只幫我解決并發問題!
這是我的代碼。第一行是我需要一次/同時迭代陣列中的多個 (10-20) 元素的行。
for i in range(0,33000):
#Say what iteration this is
print('Beginning iteration')
print(i)
#Calling to use Chrome to webscrape
driver = webdriver.Chrome(ChromeDriverManager().install())
#Create WebDriverWait times of 5, 10, 15 and 30 seconds
wait5 = WebDriverWait(driver, 5)
wait10 = WebDriverWait(driver, 10)
wait15 = WebDriverWait(driver, 15)
wait30 = WebDriverWait(driver, 30)
#Open FEC webpage
driver.get("https://www.fec.gov/")
#Find the searchbar and search the PCC ID
searchbar = driver.find_element_by_xpath('/html/body/header[2]/div/ul/li[3]/form/div/span/input')
searchbar.send_keys(commid[i])
#searchbar.send_keys(comm5)
searchbar.send_keys(Keys.RETURN)
#Click on PCC Homepage
pcc = wait5.until(
EC.element_to_be_clickable((By.XPATH, '/html/body/main/main/div[2]/div[2]/section/ul/li/h3/a'))
)
pcc.click()
try:
#Get Two-Year election cycle Period drop down menu in PCC Homepage
select = driver.find_element_by_xpath( "//select[@id='summary-cycle']") #get the select element
options = select.find_elements_by_tag_name("option") #get all the options into a list
except:
pass
else:
#Create array that will hold all election cycle options for PCC 'i'
optionsList = []
for option in options: #iterate over the options, place attribute value in the options array
optionsList.append(option.get_attribute("value"))
#Now, for each PCC in the dataset, loop over all available election cycles each PCC was registered for
for oppy in optionsList:
#Select the election cycle of interest
dropdown = Select(driver.find_element_by_id('summary-cycle'))
dropdown.select_by_value(oppy)
sleep(randint(5,7))
try:
#Clicks on "Browse receipts" button on PCC i's homepage
receipts = wait10.until(EC.presence_of_element_located((By.XPATH, '//*[@id="total-raised"]/div[1]/a')))
driver.execute_script("arguments[0].click();",receipts)
sleep(randint(10,15))
except:
if NoSuchElementException:
try:
driver.find_element(By.XPATH, '/html/body/main/div[2]/header/div/span[3]')
print('For PCC ID {},'.format(''.join(commid[i])))
#print('For PCC ID {},'.format(''.join(comm5[i])))
print('Receipts do not exist for election year {}.'.format(''.join(oppy)))
pass
except:
print('For PCC ID {},'.format(''.join(commid[i])))
print('Webpage does not exist for election year {}.'.format(''.join(oppy)))
driver.back()
else:
try:
#Clicks on "Export" button for receipts from succeeding webpage of receipt data
receiptsexport = wait15.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="main"]/section/div[2]/div[1]/div[1]/div/div[2]/button')))
receiptsexport.click()
sleep(randint(5,7))
except:
print('For PCC ID {},'.format(''.join(commid[i])))
#print('For PCC ID {},'.format(''.join(comm5[i])))
print('There is no Receipt Data to export for election year {}.'.format(''.join(oppy)))
sleep(randint(5,7))
pass
else:
try:
#Clicks on "Download" button under "Your downloads" to download receipts as .csv file
receiptsdownload = wait10.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[4]/div/ul/li/div/a')))
sleep(randint(5,7))
receiptsdownload.click()
sleep(randint(5,7))
driver.back()
sleep(randint(5,7))
except:
print('For PCC ID {},'.format(''.join(commid[i])))
print('I cannot download Receipt Data, since there is none to export for election year {}.'.format(''.join(oppy)))
driver.back() #Go back to PCC homepage
sleep(randint(5,7))
pass
try:
#Search for "Browse Disbursements" on PCC i's homepage and click link
disburse = wait10.until(EC.presence_of_element_located((By.LINK_TEXT, "Browse disbursements")))
driver.execute_script("arguments[0].click();",disburse)
sleep(randint(10,15))
except:
print('For PCC ID {},'.format(''.join(commid[i])))
#print('For PCC ID {},'.format(''.join(comm5)))
print('Disbursements do not exist for election year {}.'.format(''.join(oppy)))
sleep(randint(5,7))
pass
else:
try:
#Clicks on "Export" button for disbursements from succeeding webpage of disbursement data
disbursexport = wait15.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="main"]/section/div[2]/div[1]/div[1]/div/div[2]/button')))
disbursexport.click()
sleep(randint(5,7))
except:
print('For PCC ID {},'.format(''.join(commid[i])))
#print('For PCC ID {},'.format(''.join(comm5)))
print('There is no Disbursement Data to export for election year {}.'.format(''.join(oppy)))
sleep(randint(5,7))
pass
else:
try:
#Clicks on "Download" button under "Your downloads" to download disbursements as .csv file
disbursedownload = wait15.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[4]/div/ul/li/div/a')))
sleep(randint(5,7))
disbursedownload.click()
driver.back()
sleep(randint(5,7))
except:
print('For PCC ID {},'.format(''.join(commid[i])))
print('I cannot download Disbursement Data, since there is none to export for election year {}.'.format(''.join(oppy)))
driver.back() #Go back to PCC homepage
sleep(randint(5,7))
pass
uj5u.com熱心網友回復:
你可以試試concurrent.futures。使用一個引數定義要作為函式運行的代碼,并像這樣傳遞它:
import concurrent.futures
def my_func(i):
do_something
my_list = [i for i in range(0, 33000)]
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(my_func, my_list)
的每個條目my_list都傳遞到my_func. 如果您想將更多引數傳遞給my_func(),請查看How to use multiprocessing pool.map with multiple arguments,但看起來您并不需要。您可以max_workers使用ThreadPoolExecutor.
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/432767.html
