我試圖獲取一個包含不同 url 的串列,當您看到此網頁的 HTML 版本時(部分):
https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas
我嘗試了幾種不同的方法,但它們并沒有真正起作用。
第一次嘗試
from bs4 import BeautifulSoup
import requests
import html
import urllib
import json
import re
url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('div', class_ = "rftabdetailline accordion aem-GridColumn aem-GridColumn--default--12")
鏈接包含以下內容:
[<div class="rftabdetailline accordion aem-GridColumn aem-GridColumn--default--12">
<!-- rf-tab-detail-line en resto de modos -->
<rf-tab-detail-line content='[{"color":"120,180,225","name":"C1","active":"true","stations":"València Nord \u2013 Gandía","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html"},{"color":"245,150,40","name":"C2","active":"false","stations":"València Nord \u2013 Xàtiva \u2013 Moixent","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html"},{"color":"125,37,130","name":"C3","active":"false","stations":"València Sant Isidre \u2013 Bu?ol \u2013 Utiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html"},{"color":"215,0,30","name":"C4","active":"false","stations":"València Sant Isidre \u2013 Xirivella L\u2019Alter","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html"},{"color":"0,139,41","name":"C5","active":"false","stations":"València Nord \u2013 Caudiel","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html"},{"color":"15,50,135","name":"C6","active":"false","stations":"València Nord \u2013 Castelló","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html"},{"color":"150,100,40","name":"ER02","active":"false","stations":"Castelló - Vinaròs","url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html"}]' title-text="Seleccione una línea:">
</rf-tab-detail-line>
</div>]
在其中,您可以看到我想要的部分:例如,* "url":"/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1。 html" *. 我想在串列中獲取所有不同的/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_WHATEVER.html。為此,我嘗試了提取和使用正則運算式,但沒有成功。
第二次嘗試
按照這個問題的答案中顯示的步驟 Extractinf info form HTML that has no tags 我獲得了下一段代碼:
import requests
import html
import json
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
data = response.text # get data from site
raw_list = data.split("'")[8] # extract attributes
json_list = html.unescape(raw_list) # decode html symbols
parsed_list = json.loads(json_list) # parse json
我認為它會起作用,因為它產生的輸出具有相似性,但是在定義 parsed_list 時會回傳下一個錯誤:
- JSONDecodeError:期望值:第 1 行第 1 列(字符 0)*
有人有想法嗎?謝謝大家!!!
uj5u.com熱心網友回復:
這條路:
import html
import json
import re
import requests
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas'
response = requests.get(url)
page_text = response.text # get data from site
regex = r"<rf-tab-detail-line title-text=\"Seleccione una línea:\" content=\"([^\"] )"
encoded_content = re.findall(regex, page_text)
if len(encoded_content) == 0:
print("Nothing found, possibly page structure changed.")
exit()
encoded_content = html.unescape(encoded_content[0])
json_content = json.loads(encoded_content)
for item in json_content:
print(item["url"])
輸出:
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014184209.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014185974.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014187588.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014189921.html
/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1598612779629.html
希望這是你所需要的。
uj5u.com熱心網友回復:
我會改為使用 css attribute = value 選擇器來定位包含該資料的單個元素,因為它在閱讀時更直觀。然后您只需要提取content屬性并處理鍵值對的json庫過濾url。
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/es/es/cercanias/cercanias-valencia/lineas'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = json.loads(soup.select_one('[title-text="Seleccione una línea:"]')['content'])
links = [i['url'] for i in data]
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/394764.html
下一篇:For回圈有很多不同的URL
