Python爬蟲

爬取騰訊新聞首頁的新聞內容

最近學習了爬蟲，爬了一些內容，分享一下，方便大家，

#匯入模塊
import urllib.request
import urllib.error
import re,ssl
import ssl
#針對https ,需要單獨處理
#import ssl
#ssl._create_default_https_context = ssl._create_unverified_context
ssl._create_default_https_context = ssl._create_unverified_context
#騰訊新聞首頁網址
url="https://xw.qq.com/"
#該部分通過用戶代理（User-Agent)來模擬瀏覽器請求，但騰訊新聞可直接訪問，可不用
#headers=("User-Agent"," Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Mobile Safari/537.36")
#opener=urllib.request.build_opener()
#opener.addhesders=[headers]
#date=opener.open(url).read().decode("utf-8","ignore")
#爬取首頁所有資訊
#urllib.request.urlopen()函式會報錯，所以要進行爬蟲例外處理
try:
    date=urllib.request.urlopen(url).read().decode("utf-8","ignore")
    print(len(date))
#HTTPError有兩個屬性，例外原因和例外狀態碼
except urllib.error.HTTPError as e:
    #列印例外狀態碼
    if hasattr(e,"code"):
        print(e.code)
    #列印例外原因
    if hasattr(e,"reason"):
        print(e.reason)
#正則運算式，提取所需的資訊（即新聞鏈接）
pat=''',"url":"(.*?)",'''
#全域匹配函式,把提取出來的資訊放入串列中
thislink=re.compile(pat).findall(date)
#清除快取
urllib.request.urlcleanup()
#列印該串列，可不要
#print(thislink)
#通過回圈將網址放入本地
for i in range(len(thislink)):
    #進行例外處理，urllib,request.urlretrieve()函式會報錯，所有要進行爬蟲例外處理
    try:
        #判斷字串"https"是否在提取出的鏈接中，若在，直接放入本地
        if ("https" in thislink[i]):
            #通過urlretrieve()直接把鏈接放入本地
            newurl=thisurl[i]
            urllib.request.urlretrieve(newurl,"D:/write1/date/"+str(i)+".html")
            #列印存放好的網址
            print(newurl)
        else:
            #沒有"https"時，通過手動添加鏈接頭(包括https部分），此時針對了https，單獨處理
            newurl="https://xw.qq.com"+thislink[i]
            urllib.request.urlretrieve(newurl,"D:/write1/date/"+str(i)+".html")
            #列印存放好的網址
            print(newurl)
    #HTTPError有兩個屬性，例外原因和例外狀態碼
    except urllib.error.HTTPError as e:
        #列印例外狀態碼
        if hasattr(e,"code"):
            print(e.code)
        #列印例外原因
        if hasattr(e,"reason"):
            print(e.reason)

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/274892.html

標籤：python

上一篇：Python學習筆記

下一篇：按回車不執行一直換行，為什么

Python爬蟲 爬取騰訊新聞首頁的新聞內容

Python爬蟲

爬取騰訊新聞首頁的新聞內容

Python爬蟲爬取騰訊新聞首頁的新聞內容