有沒有更有效的方式在沒有Selenium的情況下訪問JavaScript表？-有解無憂

我目前正在從事一個輔助專案，以抓取一個 Web 表單的結果，該表單回傳一個用 JavaScript 呈現的表格。

我已經設法使用 Selenium 輕松完成這項作業。但是，我基于 CSV 檔案查詢此表單大約 5,000 次，這會導致處理時間較長（大約 9 小時）。

我想知道是否有一種方法可以使用生成的請求 URL 直接通過 Python 訪問回應資料，而不是渲染 JavaScript。

有問題的網站表格：https ://probatesearch.service.gov.uk/

表單的兩個部分都完成后捕獲的網路請求 URL 的示例（輸入 1996 年之前的一年將輸出不同的回應，這些回應可以忽略）：

https://probatesearch.service.gov.uk/api/nuxeo/api/v1/search/pp/pp_mainstream_default_search/execute?pageProvider=pp_mainstream_default_search&currentPageIndex=0&hmcts_grant_schema_grantdocTypeOf=1&hmcts_grant_schema_surname=SMITH&hmcts_grant_schema_dateofdeath_min=2019-03-23T00:00:00.000Z&hmcts_grant_schema_dateofdeath_max=2019-03-23T00:00:00.000Z&hmcts_grant_schema_dateofprobate_min=2022-02-01T00:00:00.000Z&hmcts_grant_schema_dateofprobate_max=2022-03-02T00:00:00.000Z&hmcts_grant_schema_firstnames=TREVOR&sortBy=&sortOrder=DESC

我嘗試使用 BeautifulSoup、urllib 和 requests 來處理這個請求，但在提取有意義的資料方面沒有運氣，但是在網路抓取方面我相當業余。

我使用 urllib 或 requests 得到的輸出如下： JSON Response

不幸的是，這不包括請求表中的任何實際資料（例如姓名、死亡日期等）

我希望將表回應（如果有）捕獲到 JSON 或 Dataframe 中以進行進一步處理。任何幫助表示贊賞。

編輯：這是表格完成并請求后我嘗試訪問的表格資料的螢屏截圖：必填表格

uj5u.com熱心網友回復：

一般的答案是，英國政府（或者可能只是法院系統）似乎正在實施一個 API 來訪問您正在尋找的資料型別 - 您絕對應該閱讀它和一般的 API。

更具體地說，在您的情況下，資料可通過 API 呼叫獲得，可以使用瀏覽器中的開發人員選項卡查看。在這里查看更多，作為眾多示例之一。

因此，在這種情況下，我假設您知道有關案件的一些（但不是全部）資訊（在下面的示例中，您知道姓氏、死亡年份和遺囑認證年份），并發送包含該資訊的 API 請求。該呼叫檢索 7 個條目。

import requests
import json

url = 'https://probatesearch.service.gov.uk/api/nuxeo/api/v1/search/pp/pp_mainstream_default_search/execute'

last_name, death, probate = 'SMITH',2019,2022
targets = ['hmctsgrant:surname','hmctsgrant:firstnames','hmctsgrant:dateofdeath','hmctsgrant:dateofprobate','hmctsgrant:probatenumber',
    'hmctsgrant:grantdocTypeoOfName','hmctsgrant:registryofficename']

param_dict = (
    ('pageProvider', 'pp_mainstream_default_search'),
    ('currentPageIndex', '0'),
    ('hmcts_grant_schema_grantdocTypeOf', '1'),
    ('hmcts_grant_schema_surname', f'{last_name}'),
    ('hmcts_grant_schema_dateofdeath_min', f'{death}-01-01T00:00:00.000Z'),
    ('hmcts_grant_schema_dateofdeath_max',f'{death}-12-31T00:00:00.000Z'),
    ('hmcts_grant_schema_dateofprobate_min', f'{probate}-01-01T00:00:00.000Z'),
    ('hmcts_grant_schema_dateofprobate_max', f'{probate}-12-31T00:00:00.000Z'),
    ('hmcts_grant_schema_firstnames', ''),
    ('sortBy', ''),
    ('sortOrder', 'DESC'),
)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0',
    'Accept': 'application/json',
    'Referer': 'https://probatesearch.service.gov.uk/search-results',
    'properties': 'hmcts_grant_schema',

}
response = requests.get(url, headers=headers, params=param_dict, cookies=cookies)

data = json.loads(response.text)
for entry in data['entries']:
    info = entry['properties']        
    for target in targets:
        print(info[target])
    print('------------')

這種情況下的輸出是

Smith
Trevor Floyd
2019-03-23T00:00:00.000Z
2022-02-03T00:00:00.000Z
1641476859693801
ADMINISTRATION
Newcastle
------------
Smith
David William
2019-02-06T00:00:00.000Z
2022-02-04T00:00:00.000Z
1643363130442596
ADMINISTRATION
Newcastle
------------

等等。

您可以明顯地將輸出加載到 pandas 資料框或您需要使用的任何其他內容中。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/439605.html

標籤：Python 网页抓取

上一篇：抓取公共電報聊天（通過瀏覽器中的預覽）

下一篇：如何抓取和構造相同div但不同子類中的價格？