由于要做有關專利方面的研究,所以選擇了重點產業專利資訊網獲取資料,該網站提供了資料下載功能,但由于網站回應比較慢,而且需要數量較多,所以選擇爬蟲進行爬取,

1.資料獲取
經過分析發現該網站需要模擬登錄才能實作資料獲取,并且我們從post請求的資料可以看出登錄的用戶名為: ‘cnipr’,密碼為:123456,
登錄失請求的url為:‘http://chinaip.sipo.gov.cn/login’,該網站需要維持session,所以我們使用session = requests.session(),然后用session去提交請求,該網站的最后資料獲取跳轉次數較多,所以會有多次請求,具體代碼如下:
import re
import csv
import time
import random
import pandas as pd
import eventlet #匯入eventlet這個模塊
#獲取jsp內容
def get_jsp(page,searchword):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Accept': 'text / html, application / xhtml + xml, image / jxr, * / *',
'Accept - Encoding': 'gzip, deflate',
'Cache - Control': 'no - cache',
'Content - Type': 'application / x - www - form - urlencoded',
'Host': 'chinaip.sipo.gov.cn',
'Cookie': 'cizi=2; DisplayCookies=%u7533%u8BF7%u53F7%7C120%23%u7533%u8BF7%uFF08%u4E13%u5229%u6743%uFF09%u4EBA%7C160%23%u56FD%u7701%u4EE3%u7801%7C100%23%u5206%u7C7B%u53F7%7C160%23%u516C%u5F00%uFF08%u516C%u544A%uFF09%u53F7%7C160%23%u4E3B%u5206%u7C7B%u53F7%7C120%23%u540D%u79F0%7C500%23%u53D1%u660E%uFF08%u8BBE%u8BA1%uFF09%u4EBA%7C160%23%u516C%u5F00%uFF08%u516C%u544A%uFF09%u65E5%7C160%23%u7533%u8BF7%u65E5%7C100%23%u672C%u56FD%u4E3B%u4E13%u5229%u4EE3%u7406%u673A%u6784%7C160%23%u4EE3%u7406%u4EBA%7C100%23%u5730%u5740%7C160; __tins__20911579=%7B%22sid%22%3A%201608962776970%2C%20%22vd%22%3A%201%2C%20%22expires%22%3A%201608964576970%7D; __51cke__=; __51laig__=1',
'Referer': 'http: // chinaip.sipo.gov.cn /'
} # post隱藏登錄的請求頭
form_data = {
'errorurl': 'error.jsp',
'url': 'zljs/index.jsp?navRootID=1506&t=2',
'channelid': '14,15,16',
'name': 'cnipr',
'password': '123456',
'chanye': '1506',
} # post隱藏登錄的表單
#第一次模擬隱藏登錄
session = requests.session()
response = session.post(url='http://chinaip.sipo.gov.cn/login', headers=headers, data=form_data,
allow_redirects=False,)
#print(response)
if response.status_code == 302:
#print('第一次請求成功!')
#獲取cookie
cookie = session.cookies
a = cookie.get_dict()
# 保持登錄
re1 = session.get('http://chinaip.sipo.gov.cn/zljs/hyjs-jieguo-mixed.jsp?firstsearch=1&searchword=%67901&searchChannel=&searchFrom=0&searchType=0&FTS=0&t=IPC&channelid=14,15,16,17')
#第二次表單提交引數
form_data2 = {
'searchword': searchword,
'channelid': '14',
'sortfield': 'RELEVANCE',
'currentChannelID': '14',
'extension': '',
'searchType': '0',
'sortcolumn': 'RELEVANCE',
'sRecordNumber': '1',
'strdb': '14',
'savesearchword': 'ON',
'searchFrom': '0',
'issearch': 'on',
'page': page,
'cizi': '2', }
#第二、三次請求頭
headers3 = {'Accept': 'text/html, application/xhtml+xml, image/jxr, */*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Length': '21353',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'DisplayCookies=%u7533%u8BF7%u53F7%7C120%23%u7533%u8BF7%uFF08%u4E13%u5229%u6743%uFF09%u4EBA%7C160%23%u56FD%u7701%u4EE3%u7801%7C100%23%u5206%u7C7B%u53F7%7C160%23%u516C%u5F00%uFF08%u516C%u544A%uFF09%u53F7%7C160%23%u4E3B%u5206%u7C7B%u53F7%7C120%23%u540D%u79F0%7C500%23%u53D1%u660E%uFF08%u8BBE%u8BA1%uFF09%u4EBA%7C160%23%u516C%u5F00%uFF08%u516C%u544A%uFF09%u65E5%7C160%23%u7533%u8BF7%u65E5%7C100%23%u672C%u56FD%u4E3B%u4E13%u5229%u4EE3%u7406%u673A%u6784%7C160%23%u4EE3%u7406%u4EBA%7C100%23%u5730%u5740%7C160; JSESSIONID={}; __tins__20911579=%7B%22sid%22%3A%201609061045897%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201609064216885%7D; __51cke__=; __51laig__=16; cizi=2; DisplayCookies=%u7533%u8BF7%u53F7%7C120%23%u7533%u8BF7%uFF08%u4E13%u5229%u6743%uFF09%u4EBA%7C160%23%u56FD%u7701%u4EE3%u7801%7C100%23%u5206%u7C7B%u53F7%7C160%23%u516C%u5F00%uFF08%u516C%u544A%uFF09%u53F7%7C160%23%u4E3B%u5206%u7C7B%u53F7%7C120%23%u540D%u79F0%7C500%23%u53D1%u660E%uFF08%u8BBE%u8BA1%uFF09%u4EBA%7C160%23%u516C%u5F00%uFF08%u516C%u544A%uFF09%u65E5%7C160%23%u7533%u8BF7%u65E5%7C100%23%u672C%u56FD%u4E3B%u4E13%u5229%u4EE3%u7406%u673A%u6784%7C160%23%u4EE3%u7406%u4EBA%7C100%23%u5730%u5740%7C160'.format(list(a.values())[0]),
'Host': 'chinaip.sipo.gov.cn',
'Origin': 'http://chinaip.sipo.gov.cn',
'Referer': 'http://chinaip.sipo.gov.cn/zljs/hyjs-jieguo-mixed.jsp?firstsearch=1&searchword=%231507&searchChannel=&searchFrom=0&searchType=0&FTS=0&t=IPC&channelid=14,15,16,17',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko', }
re2 = session.post(url='http://chinaip.sipo.gov.cn/zljs/hyjs-jieguo-mixed.jsp?', headers=headers3,
data=form_data2,)
if re2.status_code == 200:
page_re = r"\"<iframe id='sOutline' src=\'(.*)\' width='770'" # 獲取下一頁引數
page_mach = re.search(page_re, str(re2.text))
page_random = page_mach.group(1)
#print(page_random)
# 構建下一頁的URL
Next_url = 'http://chinaip.sipo.gov.cn/zljs/RecordFrame.jsp?' + page_random
#print('第{}頁鏈接為:'.format(page),Next_url)
# 獲取JSP內容
content = session.get(url=Next_url, headers=headers3, allow_redirects=False)
if content.status_code == 200:
print('第{}頁獲取jsp頁面成功!'.format(page))
return content.text
else:
print('第{}頁獲取jsp頁面錯誤!!!!!!'.format(page))
return None
else:
print("獲取第{}頁鏈接錯誤!!!!!!!".format(page))
return None
else:
print("第一次請求錯誤")
return None
2.說明與改進
由于網站回應較慢,所以爬取可以在晚上進行,既不影響其他用戶使用,也可提高效率,經筆者實踐后發現,由于需要獲取的資料量較大以及該網站的回應較慢,所以采用了多協程進行爬取,極大的提高了爬取效率,再次說明以上代碼僅供學習交流使用,在爬取時建議設定休眠時間,防止導致服務器崩潰,
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/271575.html
標籤:python
上一篇:python判斷輸入的日期時間“20210401”格式是否合法或者是否已經發生過
下一篇:藍橋杯集錦05(python3)
