作為一個java菜雞,想了解一下python的爬蟲,據說文書網反爬很厲害,遍去試試
好嘛
我去,啥啊,不講武德
這個網站的特點首先符合了政府網站回應慢的特點,7百億的訪問量,,,,再加上時時刻刻的小機器人,正常訪問都卡的一批
有事找度娘,網上最新的幾種方案,最多的還是,破解post引數
pageId,ciphertext,__RequestVerificationToken 三個引數

我也試過了,都沒人說過cookie引數怎么搞,都說登錄之后,寫死就行,反正我是沒成功,“無權限訪問介面”
繼續換,試過web scraper,我去,啥啊,文書網超時嚴重,1分鐘不帶回傳的,scraper還總出問題,最大的問題就是只能獲取單頁的,沒啥用,果斷放棄
正題,來了!!!!!!敲黑板,我要變了
selenium,模擬用戶行為訪問,xpath獲取資料,暫時這個是搞得挺順暢
文書網有個600條限制,就是說最大能查到600,在往后查就需要高級查詢等條件了,
思路!!!!敲黑板
1、看見首頁這個,法院地圖沒

把所有法院搞出來(什么?不會搞,我也不會,,,),應該有什么政府網能查到這些法院名稱,只提供個思路哈,因為我是針對某個法院做的高級搜索,然后再具體到月份(這樣就能限制到600),(什么?要是超過600怎么辦,大哥哪個法院一個月能上傳有600多文書啊,文員不得累死–嗯嗯,我是這么認為滴滴滴滴滴滴滴,托下巴表情)
2、然后就是,程式控制瀏覽器,自動打開網址,登錄,(登錄成功后,有時候會讓輸入驗證碼,手動輸入就行了)
在這之前呢,我手動大體看了下,13年以前的都沒有資料(什么?有的有,大拇指,大家可以往前搞幾年),(什么?要知道這個干嘛),要填入整月高級搜索丫丫丫丫,就是那個裁判日期,法院名稱填上哪個法院就行了(更具體的搜索,自己填去)

登錄成功后呢跳到主頁
回圈去吧,打開高級搜索,填上內容,點擊搜索(等那么幾十秒,這玩意不一定啊,1分鐘最長了),全選文章,點擊批量下載,點擊下一頁(等那么幾十秒,這玩意不一定啊,1分鐘最長了),點擊全選文章,點擊批量下載,點擊下一頁,,,,,最后一頁下載完了!!!!打開高級搜索,填上內容,點擊搜索(等那么幾十秒,這玩意不一定啊,1分鐘最長了),全選文章,點擊批量下載,點擊下一頁(等那么幾十秒,這玩意不一定啊,1分鐘最長了),點擊全選文章,點擊批量下載,點擊下一頁,,,,,最后一頁下載完了…(口渴)
3、上代碼
谷歌瀏覽器,驅動
from selenium import webdriver
import time
bro = webdriver.Chrome(executable_path='chromedriver.exe')
# 打開網頁
bro.get('https://wenshu.court.gov.cn/')
最大化視窗,為什么還是重繪一下呢,哎,這玩意加載不完整啊!后邊還有重繪,大家試試就知道了
# 最大化視窗
bro.maximize_window()
time.sleep(2)
bro.refresh()
# 點擊登錄按鈕
login_tag = bro.find_element_by_xpath('//*[@id="loginLi"]/a')
# 執行點擊命令
time.sleep(2)
login_tag.click();
time.sleep(2)
bro.refresh()
# 切換到iframe登錄視窗
bro.switch_to.frame("contentIframe")
,,,,,,
,,,,,
,,,,
,,,
,,
,
不寫了,大家下邊看代碼吧!!!!!!!
4、注意,敲黑板,完整代碼,以下鏈接,嘿嘿嘿,只要 5 C幣,大家搞一下哈!!!!!
什么!沒看見鏈接,哎,公司搞什么安全軟體,不讓上傳檔案了!!瞬間損失了好幾萬!!!!
搞上!!!!!!!!!
from selenium import webdriver
import time
bro = webdriver.Chrome(executable_path='chromedriver.exe')
# 打開網頁
bro.get('https://wenshu.court.gov.cn/')
# 最大化視窗
bro.maximize_window()
time.sleep(2)
bro.refresh()
# 點擊登錄按鈕
login_tag = bro.find_element_by_xpath('//*[@id="loginLi"]/a')
# 執行點擊命令
time.sleep(2)
login_tag.click();
time.sleep(2)
bro.refresh()
# 切換到iframe登錄窗
bro.switch_to.frame("contentIframe")
# 定位 手機號,密碼,登錄按鈕位置
username_path=bro.find_element_by_xpath('//*[@class="phone-number-input"]')
password_path=bro.find_element_by_xpath('//*[@class="password"]')
login_in=bro.find_element_by_xpath('//*[@id="root"]/div/form/div/div[3]/span')
time.sleep(1)
username_path.send_keys("")
time.sleep(1)
password_path.send_keys("")
start_time = [#"2008-01-01","2010-01-01","2011-01-01","2012-01-01","2013-01-01",
#"2014-01-10","2014-02-01",
"2014-03-01","2014-04-01","2014-05-01","2014-06-01","2014-07-01","2014-08-01","2014-09-01",
"2014-10-01","2014-11-01","2014-12-01","2015-01-01","2015-02-01","2015-03-01","2015-04-01","2015-05-01",
"2015-06-01","2015-07-01","2015-08-01","2015-09-01","2015-10-01","2015-11-01","2015-12-01","2016-01-01",
"2016-02-01","2016-03-01","2016-04-01","2016-05-01","2016-06-01","2016-07-01","2016-08-01","2016-09-01",
"2016-10-01","2016-11-01","2016-12-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01",
"2017-06-01","2017-07-01","2017-08-01","2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01",
"2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01",
"2018-10-01","2018-11-01","2018-12-01","2019-01-01","2019-02-01","2019-03-01","2019-04-01","2019-05-01",
"2019-06-01","2019-07-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01","2020-01-01",
"2020-02-01","2020-03-01","2020-04-01","2020-05-01","2020-06-01","2020-07-01","2020-08-01","2020-09-01",
"2020-10-01","2020-11-01","2020-12-01","2021-01-01","2021-02-01","2021-03-01","2021-04-01","2021-05-01",
"2021-06-01","2021-07-01","2021-08-01","2021-09-01","2021-10-01"];
end_time = [#"2008-12-31","2010-12-31","2011-12-31","2012-12-31","2013-12-31",
#"2014-02-10","2014-02-31",
"2014-03-31","2014-04-31","2014-05-31","2014-06-31","2014-07-31","2014-08-31","2014-09-31",
"2014-10-31","2014-11-31","2014-12-31","2015-01-31","2015-02-31","2015-03-31","2015-04-31","2015-05-31",
"2015-06-31","2015-07-31","2015-08-31","2015-09-31","2015-10-31","2015-11-31","2015-12-31","2016-01-31",
"2016-02-31","2016-03-31","2016-04-31","2016-05-31","2016-06-31","2016-07-31","2016-08-31","2016-09-31",
"2016-10-31","2016-11-31","2016-12-31","2017-01-31","2017-02-31","2017-03-31","2017-04-31","2017-05-31",
"2017-06-31","2017-07-31","2017-08-31","2017-09-31","2017-10-31","2017-11-31","2017-12-31","2018-01-31",
"2018-02-31","2018-03-31","2018-04-31","2018-05-31","2018-06-31","2018-07-31","2018-08-31","2018-09-31",
"2018-10-31","2018-11-31","2018-12-31","2019-01-31","2019-02-31","2019-03-31","2019-04-31","2019-05-31",
"2019-06-31","2019-07-31","2019-08-31","2019-09-31","2019-10-31","2019-11-31","2019-12-31","2020-01-31",
"2020-02-31","2020-03-31","2020-04-31","2020-05-31","2020-06-31","2020-07-31","2020-08-31","2020-09-31",
"2020-10-31","2020-11-31","2020-12-31","2021-01-31","2021-02-31","2021-03-31","2021-04-31","2021-05-31",
"2021-06-31","2021-07-31","2021-08-31","2021-09-31","2021-10-31"];
for index, item in enumerate(start_time):
print(index, item)
time.sleep(10)
gaojisousuo=bro.find_element_by_xpath('//*[@class="advenced-search"]')
gaojisousuo.click()
fayuanVal=bro.find_element_by_xpath('//*[@id="s2"]')
fayuanVal.clear()
fayuanVal.send_keys("晉州市人民法院")
startTime=bro.find_element_by_xpath('//*[@id="cprqStart"]')
startTime.clear()
startTime.send_keys(item)
endTime=bro.find_element_by_xpath('//*[@id="cprqEnd"]')
endTime.clear()
endTime.send_keys(end_time[index])
sousuo = bro.find_element_by_xpath('//*[@id="searchBtn"]')
time.sleep(5)
sousuo.click()
time.sleep(60)
# 先判斷是否有資料
page_num_all = bro.find_element_by_xpath('//*[@id="_view_1545184311000"]/div[1]/div[2]/span')
if page_num_all.text != '0':
next = True
page_num = 1
while next:
# 定位全選和批量下載
all_select = bro.find_element_by_xpath('//*[@id="AllSelect"]')
all_select.click()
time.sleep(5)
all_download = bro.find_element_by_xpath('//*[@id="_view_1545184311000"]/div[2]/div[4]/a[3]')
all_download.click()
time.sleep(5)
next_click = bro.find_element_by_xpath('//*[@id="_view_1545184311000"]/div[last()]/a[last()]')
class_name = next_click.get_attribute('class')
if class_name == 'disabled pageButton':
next = False
break
else:
next_click.click()
page_num += 1
print(page_num)
time.sleep(70)
注釋不太完整哈,寫著玩來著!思路還是上邊的思路
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/341888.html
標籤:python
