我不熟悉多執行緒以及如何應用它來快速抓取資料,因為 beautifulsoup 抓取資料緩慢可以告訴我如何將多執行緒應用到我的代碼這是頁面鏈接https://baroul-timis.ro/tabloul-avocatilor /
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url= 'https://baroul-timis.ro'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
productlink=[]
data = requests.get(url).json()
for i, d in enumerate(data["data"], 1):
link = BeautifulSoup(d["actions"], "html.parser").a["href"]
comp=base_url link
productlink.append(comp)
test=[]
for link in productlink:
wev={}
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
prod=soup.find_all('div',class_='user-info text-left mb-50')
for pip in prod:
title=pip.find('h4').text
wev['title']=title
try:
phone=pip.select('span',class_='font-weight-bolder')[2].text
except:
pass
wev['phone']=phone.split('\xa0')
try:
email=pip.select('span',class_='font-weight-bolder')[3].text
except:
pass
wev['email']=email.split('\xa0')
test.append(wev)
df = pd.DataFrame(test)
print(df)
uj5u.com熱心網友回復:
多執行緒是這類事情的理想選擇,因為在訪問 URL 并獲取它們的資料時會有大量的 I/O 等待。以下是您可以重新作業的方法:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url= 'https://baroul-timis.ro'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
test = []
def process(link):
wev={}
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'lxml')
prod=soup.find_all('div',class_='user-info text-left mb-50')
for pip in prod:
title=pip.find('h4').text
wev['title']=title
try:
wev['phone']=pip.select('span',class_='font-weight-bolder')[2].text.split('\xa0')
except:
pass
try:
wev['email']=pip.select('span',class_='font-weight-bolder')[3].text.split('\xa0')
except:
pass
test.append(wev)
productlink=[]
data = requests.get(url).json()
for d in data["data"]:
link = BeautifulSoup(d["actions"], "lxml").a["href"]
productlink.append(base_url link)
with ThreadPoolExecutor() as executor:
executor.map(process, productlink)
df = pd.DataFrame(test)
print(df)
這會在我的系統(24 個執行緒)上在 <44 秒內生成 941 行資料幀 - 即 ~20 個 URL/秒
注意:如果您還沒有安裝 lxml,您將需要它。它通常比 html.parser 快
編輯:
多處理版本
import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url = 'https://baroul-timis.ro'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
def process(link):
wev = {}
test = []
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
prod = soup.find_all('div', class_='user-info text-left mb-50')
for pip in prod:
wev['title'] = pip.find('h4').text
try:
wev['phone'] = pip.select('span', class_='font-weight-bolder')[2].text.split('\xa0')
except:
pass
try:
wev['email'] = pip.select('span', class_='font-weight-bolder')[3].text.split('\x0a')
except:
pass
test.append(wev)
return test
def main():
productlink = []
for d in requests.get(url).json()["data"]:
link = BeautifulSoup(d["actions"], "lxml").a["href"]
productlink.append(base_url link)
test = []
with ProcessPoolExecutor() as executor:
for r in executor.map(process, productlink):
test.extend(r)
df = pd.DataFrame(test)
print(df)
if __name__ == '__main__':
main()
uj5u.com熱心網友回復:
ThreadPoolExecutor如果你想使用執行緒,你可以使用。
from concurrent.futures import ThreadPoolExecutor
links = [...] # All you product urls goes here.
def do_work(link):
...
# Write code to process 1 url here.
# Run 10 threads at a time
executor = ThreadPoolExecutor(8)
results = executor.map(do_work, links)
我建議使用ProcessPoolExecutor. 這不僅適用于 IO,也適用于 CPU 系結任務。它還將使用您所有的 CPU。
from concurrent.futures import ProcessPoolExecutor
links = [...] # All you product urls goes here.
def do_work(link):
...
# Write code to process 1 url here.
# Run 8 processes parallel.
executor = ProcessPoolExecutor(8)
results = executor.map(do_work, links, chunksize=40)
這里的結果將是do_work函式的回傳值串列。最好不要回傳大量資料。然后序列化該資料將使該程序非常緩慢。而是將其保存到資料庫或檔案。
閱讀有關concurrent.futures的更多資訊
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/492300.html
下一篇:抓取嵌套鏈接
