大家好,我是漆柒七7!
首先見面禮
Python學習大禮包 點擊領取
然后今天我要給大家分享的是如何爬取豆瓣上深圳近期即將上映的電影影訊,并分別用普通的單執行緒、多執行緒和協程來爬取,從而對比單執行緒、多執行緒和協程在網路爬蟲中的性能,
具體要爬的網址是:https://movie.douban.com/cinema/later/shenzhen/
除了要爬入口頁以外還需爬取每個電影的詳情頁,具體要爬取的結構資訊如下:


爬取測驗
下面我演示使用xpath決議資料,
入口頁資料讀取:
import requests
from lxml import etree
import pandas as pd
import re
main_url = "https://movie.douban.com/cinema/later/shenzhen/"
headers = {
"Accept-Encoding": "Gzip",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
}
r = requests.get(main_url, headers=headers)
r
結果:
<Response [200]>
檢查一下所需資料的xpath:

可以看到每個電影資訊都位于id為showing-soon下面的div里面,再分別分析內部的電影名稱、url和想看人數所處的位置,于是可以寫出如下代碼:
html = etree.HTML(r.text)
all_movies = html.xpath("//div[@id='showing-soon']/div")
result = []
for e in all_movies:
# imgurl, = e.xpath(".//img/@src")
name, = e.xpath(".//div[@class='intro']/h3/a/text()")
url, = e.xpath(".//div[@class='intro']/h3/a/@href")
# date, movie_type, pos = e.xpath(".//div[@class='intro']/ul/li[@class='dt']/text()")
like_num, = e.xpath(
".//div[@class='intro']/ul/li[@class='dt last']/span/text()")
result.append((name, int(like_num[:like_num.find("人")]), url))
main_df = pd.DataFrame(result, columns=["影名", "想看人數", "url"])
main_df
結果:

然后再選擇一個詳情頁的url進行測驗,我選擇了熊出沒·狂野大陸這部電影,因為文本資料相對最復雜,也最具備代表性:
url = main_df.at[17, "url"]
url
結果:
'https://movie.douban.com/subject/34825886/'
分析詳情頁結構:

文本資訊都在這個位置中,下面我們直接提取這個div下面的所有文本節點:
r = requests.get(url, headers=headers)
html = etree.HTML(r.text)
movie_infos = html.xpath("//div[@id='info']//text()")
print(movie_infos)
結果:
導演: 丁亮
編劇: 徐蕓 / 崔鐵志 / 張宇
主演: 張偉 / 張秉君 / 譚笑
型別: 喜劇 / 科幻 / 影片
制片國家/地區: 中國大陸
語言: 漢語普通話
上映日期: 2021-02-12(中國大陸) / 2020-08-01(上海電影節)
片長: 100分鐘
又名: 熊出沒大電影7 / 熊出沒科幻大電影 / Boonie Bears: The Wild Life
IMDb鏈接: tt11654032
接下來就簡單了:
row = {}
for line in re.split("[\n ]*\n[\n ]*", movie_info_txt):
line = line.strip()
arr = line.split(": ", maxsplit=1)
if len(arr) != 2:
continue
k, v = arr
row[k] = v
row
結果:
{'導演': '丁亮',
'編劇': '徐蕓 / 崔鐵志 / 張宇',
'主演': '張偉 / 張秉君 / 譚笑',
'型別': '喜劇 / 科幻 / 影片',
'制片國家/地區': '中國大陸',
'語言': '漢語普通話',
'上映日期': '2021-02-12(中國大陸) / 2020-08-01(上海電影節)',
'片長': '100分鐘',
'又名': '熊出沒大電影7 / 熊出沒科幻大電影 / Boonie Bears: The Wild Life',
'IMDb鏈接': 'tt11654032'}
可以看到成功的切割出了每一項,
下面根據上面的測驗基礎,我們完善整體的爬蟲代碼:
單執行緒爬蟲
import requests
from lxml import etree
import pandas as pd
import re
main_url = "https://movie.douban.com/cinema/later/shenzhen/"
headers = {
"Accept-Encoding": "Gzip",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
}
r = requests.get(main_url, headers=headers)
html = etree.HTML(r.text)
all_movies = html.xpath("//div[@id='showing-soon']/div")
result = []
for e in all_movies:
imgurl, = e.xpath(".//img/@src")
name, = e.xpath(".//div[@class='intro']/h3/a/text()")
url, = e.xpath(".//div[@class='intro']/h3/a/@href")
print(url)
# date, movie_type, pos = e.xpath(".//div[@class='intro']/ul/li[@class='dt']/text()")
like_num, = e.xpath(
".//div[@class='intro']/ul/li[@class='dt last']/span/text()")
r = requests.get(url, headers=headers)
html = etree.HTML(r.text)
row = {}
row["電影名稱"] = name
for line in re.split("[\n ]*\n[\n ]*", "".join(html.xpath("//div[@id='info']//text()")).strip()):
line = line.strip()
arr = line.split(": ", maxsplit=1)
if len(arr) != 2:
continue
k, v = arr
row[k] = v
row["想看人數"] = int(like_num[:like_num.find("人")])
# row["url"] = url
# row["圖片地址"] = imgurl
# print(row)
result.append(row)
df = pd.DataFrame(result)
df.sort_values("想看人數", ascending=False, inplace=True)
df.to_csv("shenzhen_movie.csv", index=False)
結果:
https://movie.douban.com/subject/26752564/
https://movie.douban.com/subject/35172699/
https://movie.douban.com/subject/34992142/
https://movie.douban.com/subject/30349667/
https://movie.douban.com/subject/30283209/
https://movie.douban.com/subject/33457717/
https://movie.douban.com/subject/30487738/
https://movie.douban.com/subject/35068230/
https://movie.douban.com/subject/27039358/
https://movie.douban.com/subject/30205667/
https://movie.douban.com/subject/30476403/
https://movie.douban.com/subject/30154423/
https://movie.douban.com/subject/27619748/
https://movie.douban.com/subject/26826330/
https://movie.douban.com/subject/26935283/
https://movie.douban.com/subject/34841067/
https://movie.douban.com/subject/34880302/
https://movie.douban.com/subject/34825886/
https://movie.douban.com/subject/34779692/
https://movie.douban.com/subject/35154209/
爬到的檔案:

整體耗時:

42.5秒,
多執行緒爬蟲
單執行緒的爬取耗時還是挺長的,下面看看使用多執行緒的爬取效率:
import requests
from lxml import etree
import pandas as pd
import re
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
def fetch_content(url):
print(url)
headers = {
"Accept-Encoding": "Gzip", # 使用gzip壓縮傳輸資料讓訪問更快
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
}
r = requests.get(url, headers=headers)
return r.text
url = "https://movie.douban.com/cinema/later/shenzhen/"
init_page = fetch_content(url)
html = etree.HTML(init_page)
all_movies = html.xpath("//div[@id='showing-soon']/div")
result = []
for e in all_movies:
# imgurl, = e.xpath(".//img/@src")
name, = e.xpath(".//div[@class='intro']/h3/a/text()")
url, = e.xpath(".//div[@class='intro']/h3/a/@href")
# date, movie_type, pos = e.xpath(".//div[@class='intro']/ul/li[@class='dt']/text()")
like_num, = e.xpath(
".//div[@class='intro']/ul/li[@class='dt last']/span/text()")
result.append((name, int(like_num[:like_num.find("人")]), url))
main_df = pd.DataFrame(result, columns=["影名", "想看人數", "url"])
max_workers = main_df.shape[0]
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_tasks = [executor.submit(fetch_content, url) for url in main_df.url]
wait(future_tasks, return_when=ALL_COMPLETED)
pages = [future.result() for future in future_tasks]
result = []
for url, html_text in zip(main_df.url, pages):
html = etree.HTML(html_text)
row = {}
for line in re.split("[\n ]*\n[\n ]*", "".join(html.xpath("//div[@id='info']//text()")).strip()):
line = line.strip()
arr = line.split(": ", maxsplit=1)
if len(arr) != 2:
continue
k, v = arr
row[k] = v
row["url"] = url
result.append(row)
detail_df = pd.DataFrame(result)
df = main_df.merge(detail_df, on="url")
df.drop(columns=["url"], inplace=True)
df.sort_values("想看人數", ascending=False, inplace=True)
df.to_csv("shenzhen_movie2.csv", index=False)
df
結果:


耗時8秒,
由于每個子頁面都是單獨的執行緒爬取,每個執行緒幾乎都是同時在作業,所以最終耗時僅取決于爬取最慢的子頁面,
協程異步爬蟲
由于我在jupyter中運行,為了使協程能夠直接在jupyter中直接運行,所以我在代碼中增加了下面兩行代碼,在普通編輯器里面可以去掉:
import nest_asyncio
nest_asyncio.apply()
這個問題是因為jupyter所依賴的高版本Tornado存在bug,將Tornado退回到低版本也可以解決這個問題,
下面我使用協程來完成這個需求的爬取:
import aiohttp
from lxml import etree
import pandas as pd
import re
import asyncio
import nest_asyncio
nest_asyncio.apply()
async def fetch_content(url):
print(url)
header = {
"Accept-Encoding": "Gzip", # 使用gzip壓縮傳輸資料讓訪問更快
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
}
async with aiohttp.ClientSession(
headers=header, connector=aiohttp.TCPConnector(ssl=False)
) as session:
async with session.get(url) as response:
return await response.text()
async def main():
url = "https://movie.douban.com/cinema/later/shenzhen/"
init_page = await fetch_content(url)
html = etree.HTML(init_page)
all_movies = html.xpath("//div[@id='showing-soon']/div")
result = []
for e in all_movies:
# imgurl, = e.xpath(".//img/@src")
name, = e.xpath(".//div[@class='intro']/h3/a/text()")
url, = e.xpath(".//div[@class='intro']/h3/a/@href")
# date, movie_type, pos = e.xpath(".//div[@class='intro']/ul/li[@class='dt']/text()")
like_num, = e.xpath(
".//div[@class='intro']/ul/li[@class='dt last']/span/text()")
result.append((name, int(like_num[:like_num.find("人")]), url))
main_df = pd.DataFrame(result, columns=["影名", "想看人數", "url"])
tasks = [fetch_content(url) for url in main_df.url]
pages = await asyncio.gather(*tasks)
result = []
for url, html_text in zip(main_df.url, pages):
html = etree.HTML(html_text)
row = {}
for line in re.split("[\n ]*\n[\n ]*", "".join(html.xpath("//div[@id='info']//text()")).strip()):
line = line.strip()
arr = line.split(": ", maxsplit=1)
if len(arr) != 2:
continue
k, v = arr
row[k] = v
row["url"] = url
result.append(row)
detail_df = pd.DataFrame(result)
df = main_df.merge(detail_df, on="url")
df.drop(columns=["url"], inplace=True)
df.sort_values("想看人數", ascending=False, inplace=True)
return df
df = asyncio.run(main())
df.to_csv("shenzhen_movie3.csv", index=False)
df
結果:

耗時僅7秒,相對比多執行緒更快一點,
由于request庫不支持協程,所以我使用了支持協程的aiohttp進行頁面抓取,當然實際爬取的耗時還取絕于當時的網路,但整體來說,協程爬取會比多執行緒爬蟲稍微快一些,
回顧
今天我向你演示了,單執行緒爬蟲、多執行緒爬蟲和協程爬蟲,可以看到一般情況下協程爬蟲速度最快,多執行緒爬蟲略慢一點,單執行緒爬蟲則必須上一個頁面爬取完成才能繼續爬取,
但協程爬蟲相對來說并不是那么好撰寫,資料抓取無法使用request庫,只能使用aiohttp,所以在實際撰寫爬蟲時,我們一般都會使用多執行緒爬蟲來提速,但必須注意的是網站都有ip訪問頻率限制,爬的過快可能會被封ip,所以一般我們在多執行緒提速的同時使用代理ip來并發的爬取資料,
彩蛋:xpath+pandas決議表格并提取url
我們在深圳影訊的底部能夠看到一個[查看全部即將上映的影片] (https://movie.douban.com/coming)的按鈕,點進去能夠看到一張完整近期上映電影的串列,發現這個串列是個table標簽的資料:

那就簡單了,決議table我們可能壓根就不需要用xpath,直接用pandas即可,但片名中包含的url地址還需決議,所以我采用xpath+pandas來決議這個網頁,看看我的代碼吧:
import pandas as pd
import requests
from lxml import etree
headers = {
"Accept-Encoding": "Gzip",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
}
r = requests.get("https://movie.douban.com/coming", headers=headers)
html = etree.HTML(r.text)
table_tag = html.xpath("//table")[0]
df, = pd.read_html(etree.tostring(table_tag))
urls = table_tag.xpath(".//td[2]/a/@href")
df["url"] = urls
df
結果:

這樣就能到了主頁面的完整資料,再簡單的處理一下即可,
結語
感謝各位讀者,有什么想法和識訓歡迎留言評論噢!
Python學習大禮包 點擊領取
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/262045.html
標籤:python
上一篇:歸納整理:圖表的基本組成元素對應的matplotlib庫中的方法
下一篇:來找獨特數啊
