網站奉上 https://www.dytt8.net/html/gndy/dyzz/list_23_1.html
今天寫了一個爬取電影天堂的電影的名字,封面圖片網址 的代碼
爬取速度賊慢,然后就試著用了執行緒,速度確實很快,但是爬取不完整
就像我只爬第一頁的電影(一共有25個),但是實際他卻只爬取了 不到25個
就是卡在那不動,主行程也沒辦法輸出。
本以為是代碼寫的適用性太弱,導致某部電影資訊無法爬取,然后我統計了一下
就比如我運行第一次中沒有爬取到的電影卻在
我第二次運行的時候爬取出來了
一臉懵逼,求救,謝謝了
import requests
import threading
from lxml import etree
BASE_DOMAIN = 'https://www.dytt8.net/'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
def get_detail_urls(url):
response = requests.get(url, headers=HEADERS).text
html = etree.HTML(response)
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
detail_urls = map(lambda url: BASE_DOMAIN + url, detail_urls)
return detail_urls
def parse_detail_page(url):
movie = {} # 將后面爬取到的某部電影的資訊放入該字典
response = requests.get(url, headers=HEADERS)
text = response.content.decode('gbk', 'ignore')
html = etree.HTML(text)
title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
movie['title'] = title
zoomE = html.xpath("//div[@id='Zoom']")[0]
imgs = zoomE.xpath(".//img/@src")
cover = imgs[0]
movie['cover'] = cover
infos = zoomE.xpath(".//text()")
def parse_info(info, rule):
return info.replace(rule, "").strip()
movie['url'] = url
print(movie)
def spider():
threads = list()
base_url = "https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html"
for i in range(1, 2): # 現在只爬取第一頁電影
url = base_url.format(i)
movie = get_detail_urls(url)
for url_1 in list(movie):
threads.append(threading.Thread(target=parse_detail_page, args=(url_1,)))
threads[-1].start()
for t in threads:
t.join()
if __name__ == '__main__':
spider()
print('爬取完成')
uj5u.com熱心網友回復:
對了,不用執行緒爬的慢,還不完整。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/48369.html
上一篇:請問一個回圈的問題
下一篇:懸賞!匯編語言問題
