當url保持靜態時，如何使用會話ID和/或refererr導航到下一頁？-有解無憂

我正在嘗試抓取一個網站并在第一頁上成功。但是，我沒有設法刮掉下一頁。

到目前為止，我正在使用 requests 和 BeautifulSoup 并使用以下代碼從第一頁獲取內容：

r = requests.get(url)
data = soup(r.content, 'html.parser')

這會回傳一些可愛的 html，我在這里得到的有關頁面和引薦來源的資訊如下：

<div class="content-header-navigation">
<form action="/arcinsys/recherchePagingSelect.action" id="headerPagingForm" method="get" name="headerPagingForm">
<input name="_csrf" type="hidden" value="45cc8dd5-2869-4327-957e-83ffcbe08fba"/>
<span class="pagingLinks" data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId1">
<img id="pagingtestid1" src="/arcinsys/images/aktion_first_w.png"/>  
                  
                  
                  
                  
                </span>
<span class="pagingLinks" data-href="/arcinsys/recherchePaging.action?pagingvalues=1" id="pId2">
<img id="pagingtestid2" src="/arcinsys/images/aktion_prev_w.png"/>  
                  
                  
                  
                </span>
<span class="formfieldset"><input id="pageposition" maxlength="6" name="pageposition" size="6" style="width: 50px" type="text" value="1"/>
<button id="formSubmitButton2" title="Seite 1 von 2">  / 2 </button></span>
<span class="pagingLinks" data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId3">
<img id="pagingtestid3" src="/arcinsys/images/aktion_next_w.png"/>  
                  
                  
                </span>
<span class="pagingLinks" data-href="/arcinsys/recherchePaging.action?pagingvalues=2" id="pId4">
<img id="pagingtestid4" src="/arcinsys/images/aktion_last_w.png"/>  
                  
                </span>
</form>
</div>

我可以說我只在第 1 頁，共 2 頁，但是如何到達第 2 頁？我設法獲得了會話和 cookie：

session = requests.session()
cookies = r.cookies

print(session)
for cookie in r.cookies:
    print(cookie)

以此作為兩個結果：

<requests.sessions.Session object at 0x0000017E20A7DAF0>
<Cookie JSESSIONID=A4FF49C5577C2A8EFCB0FCD6F2C2D181 for arcinsys.hessen.de/arcinsys>

我在上面的 html 代碼中也有引薦來源網址

data-href="/arcinsys/recherchePaging.action?pagingvalues=2"

我現在嘗試了各種方法來傳遞會話 ID、cookie 或引薦來源網址，但到目前為止沒有任何效果。我可能做錯了，我也不確定哪種方式最好。非常感謝任何幫助！

uj5u.com熱心網友回復：

內容是通過附加請求動態加載的，您可以在瀏覽器的 DevTools 中的 xhr 請求選項卡上進行檢查。

您可以使用 a while-loop，選擇帶有id="pId3"什么是下一頁按鈕的元素，并將其data-href與帶有id="pId4"什么是最后一頁按鈕的元素進行比較，如果它等于break您的回圈：

while True:
    soup = BeautifulSoup(s.get(url).text)

    ### extract what need

    if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
        url = baseUrl   soup.select_one('#pId3').get('data-href')
    else:
        break

例子

import requests
from bs4 import BeautifulSoup

s = requests.session()
s.headers = {'User-Agent': 'Mozilla/5.0'}
baseUrl = 'https://arcinsys.hessen.de'
url='https://arcinsys.hessen.de/arcinsys/einfachsuchen.action?pageName=einfachesuche&methodName=einfach&rechercheBean.defaultfield=&rechercheBean.defaultfield_widget=wort&rechercheBean.von=&rechercheBean.bis=&rechercheBean.einfacheSucheRadioName=alle&__checkbox_rechercheBean.hasdigi=true'

while True:
    soup = BeautifulSoup(s.get(url).text)
    
    print([s.text for s in soup.select('td.cell-signature')])
    
    if soup.select_one('#pId3').get('data-href') != soup.select_one('#pId4').get('data-href'):
        url = baseUrl   soup.select_one('#pId3').get('data-href')
    else:
        break

輸出

['HStAM, 17 d', 'HStAM, 340 St?lzel', 'AdJb, A 216, ...', 'ISG FFM, W2-7, 3200', 'UBA Ffm, Na 49, 116', 'UBA Ffm, Na 62, 335', 'ISG FFM, W2-7, 4150', 'ISG FFM, S3, 30135', 'HStAD, G 37, 4776', 'HStAM, 340 Grimm, Ms 272', 'HStAM, 340 von Schwertzell, 859 d', 'HStAD, G 15 Schotten, B 76', 'HStAM, 311/1, B 59', 'StadtA KS, P 1, 914', 'UBA Ffm, Na 67 , 190', 'LWV-Archiv, B 100-10, 531', 'ISG FFM, W2-7, 3201', 'ISG FFM, W2-7, 3202', 'ISG FFM, W2-7, 2340', 'ISG FFM, W2-7, 1121']
...

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/508461.html

標籤：Python html 网页抓取会议饼干

上一篇：使用會話在表中顯示表單資料，然后使用PHP在單擊按鈕時將其洗掉

下一篇：PHPSession在一個目錄下作業正常，但相同的代碼在不同的目錄下失敗