嗨,我正在嘗試從這個
我的代碼。
r= requests.get('http://www.italgiure.giustizia.it/sncass/')
soup = BeautifulSoup(r.text, 'html.parser')
pdf_list = soup.find_all('a')
print(pdf_list)
search_html = html.fromstring(r.text)
page_link = search_html.xpath('//*[@id="contentData"]/div[2]/div[1]/div/h3/a/span[1]/span')
print(page_link)
結果:
[<a href="accessibilita.html" style="text-decoration:none;font-size:80%;color:white" tabindex="0">Accessibilità</a>, <a accesskey="r" name="results" onclick="$(this).next().focus();" tabindex="-2" title="contenuto"></a>, <a accesskey="1" name="card" onclick="$(this).next().focus();" tabindex="-2" title="documento"></a>, <a class="text2pdf" href="javascript:void(0)" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="filename" data-role="content" title="pdf"></span> <span class="chkcontent"><span class="label">Sez.</span> <span class="risultato" data-arg="szdec" data-role="content"></span> <span class="risultato" data-arg="kind" data-role="content"></span><span class="chkcontent"> - <span class="risultato" data-arg="ssz" data-role="content"></span></span><span class="label">,</span> </span> <span data-arg="tipoprov" data-role="content"></span> <span class="chkcontent"><span class="chkcontent"><span class="label">n.</span><span data-arg="numcard" data-role="content"></span></span><span data-arg="numdec" data-role="content" style="display:none"></span><span data-arg="numdep" data-role="content" style="display:none"></span> <span class="chkcontent"><span class="label"> del </span><span data-arg="datdep" data-role="content"></span><span data-arg="ecli" data-role="content" style="font-weight:normal"></span><span data-arg="anno" data-role="content" style="display:none"></span><span class="label">,</span></span> </span> <span class="chkcontent"><span class="label">udienza del</span> <span data-arg="datdec" data-role="content"></span><span class="label">,</span></span> <span class="chkcontent"><span class="label">Presidente </span><span data-arg="presidente" data-role="content"></span> </span> <span class="chkcontent"><span class="label">Relatore </span><span data-arg="relatore" data-role="content"></span> </span> </a>, <a class="text2ocr" href="javascript:void(0)" onclick="toTargetText($('.toDocument.txt',$(this)).attr('data-arg'))" style="text-decoration:none;color:#440;" tabindex="0"> <span data-arg="testoocr" data-role="content" title="testo ocr"></span> <span data-arg="ocr" data-role="datasubset"> <span data-arg="ocr" data-role="multivaluedcontent">snippet</span> </span> </a>, <a href="http://www.italgiure.giustizia.it" style="color:white;" tabindex="0">ItalgiureWeb</a>] []
在上面的結果中,我無法檢索網路鏈接,這些鏈接在<span data-arg="/xw...
我也嘗試給 span 類中,即:
pdf_list = soup.find('span', {'class': 'toDocument pdf'})
html是
<a href="javascript:void(0)" tabindex="0" onclick="toTargetDoc($('.toDocument.pdf',$(this)).attr('data-arg'), this)" style="text-decoration:none;color:#440;" class="text2pdf"> <span data-role="content" data-arg="filename" title="pdf"><span class="toDocument pdf" data-arg="/xway/application/nif/clean/hc.dll?verbo=attach&db=snciv&id=./20221107/snciv@s50@a2022@[email protected]"><img class="rowIcon" alt="formato pdf" src="pix/pdf.png"></span></span> <span class="chkcontent"><span class="label">Sez.</span> <span class="risultato" data-role="content" data-arg="szdec">QUINTA</span> <span class="risultato" data-role="content" data-arg="kind">CIVILE</span><span class="label">,</span> </span> <span data-role="content" data-arg="tipoprov">Ordinanza</span> <span class="chkcontent"><span class="chkcontent"><span class="label">n.</span><span data-role="content" data-arg="numcard">32765</span></span><span style="display:none" data-role="content" data-arg="numdec">32765</span><span style="display:none" data-role="content" data-arg="numdep"></span> <span class="chkcontent"><span class="label"> del </span><span data-role="content" data-arg="datdep">07/11/2022</span><span style="font-weight:normal" data-role="content" data-arg="ecli"> (ECLI:IT:CASS:2022:32765CIV)</span><span style="display:none" data-role="content" data-arg="anno">2022</span><span class="label">,</span></span> </span> <span class="chkcontent"><span class="label">udienza del</span> <span data-role="content" data-arg="datdec"><span style="font-weight:normal">19/10/2022</span></span><span class="label">,</span></span> <span class="chkcontent"><span class="label">Presidente </span><span data-role="content" data-arg="presidente">PAOLITTO LIBERATO</span> </span> <span class="chkcontent"><span class="label">Relatore </span><span data-role="content" data-arg="relatore">DELL'ORFANO ANTONELLA</span> </span> </a>
請讓我知道如何處理這個問題。提前致謝。
uj5u.com熱心網友回復:
這些檔案來自一個POST請求,您需要模仿它來獲取檔案。
例如:
import urllib.parse
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
query_url = "http://www.italgiure.giustizia.it/sncass/isapi/hc.dll/sn.solr/sn-collection/select?app.query="
payload = {
"start": "0",
"rows": "10",
"q": "((kind:\"snciv\" OR kind:\"snpen\")) AND szdec:\"F\" AND anno:\"2022\"",
"wt": "json",
"indent": "off",
"sort": "pd desc,numdec desc",
"fl": "id,filename,szdec,kind,ssz,tipoprov,numcard,numdec,numdep,datdep,ecli,anno,datdec,presidente,relatore,testoocr,ocr",
"hl": "true",
"hl.snippets": "4",
"hl.fragsize": "100",
"hl.fl": "ocr",
"hl.q": "nomatch AND szdec:\"F\" AND anno:\"2022\"",
"hl.maxAnalyzedChars": "1000000",
"hl.simple.pre": "<em class=\"hit\">",
"hl.simple.post": "</em>",
}
docs = (
requests
.post(query_url, headers=headers, data=payload).json()["response"]["docs"]
)
base_url = "http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id="
for doc in docs:
print(f'{base_url}{doc["filename"][0].replace(".pdf", ".clean.pdf")}')
這將使您首先10 .pdfs獲得FERIALE-> 2022。
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221103/snpen@sF0@a2022@n41566@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221021/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221019/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20221012/snpen@sF0@a2022@n38545@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220928/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]
http://www.italgiure.giustizia.it/xway/application/nif/clean/hc.dll?verbo=attach&db=snpen&id=./20220926/snpen@sF0@a2022@[email protected]
為了“選擇”選單項編輯此欄位:
例如,這將為您提供for 的第一個10檔案。UNITE2017
"hl.q": "nomatch AND szdec:\"U\" AND anno:\"2017\""
如果您希望對回應進行分頁,請將 的值更改"start"為,例如,10獲取下一個10檔案:
"start": "10"
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/530074.html
