Extractinf資訊表單沒有標簽的HTML-有解無憂

我同時使用 selenium 和 BeautifulSoup 來進行一些網頁抓取。我設法自己獲得了下一段代碼：

from selenium.webdriver import Chrome 
from bs4 import BeautifulSoup

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

輸出的湯產生具有以下結構：

<html>
<head>
</head>
<body>
<rf-list-detail line-color="245,150,40" line-number="C2" line-text="Línea C2" 
list="[{... ;direction&quot;:&quot;Place1&quot;}
,... , 
;direction&quot;:&quot;Place2&quot;}...

回想一下，由于閱讀原因，文本和輸出樣式都已修改。我附上實際輸出的影像，以防萬一它更方便。

有誰知道我如何獲得串列中的每個 PlaceN（在影像中，Moixent將是 Place1）？就像是

places = [Place1,...,PlaceN]

我試過決議它，但由于它沒有標簽（或者至少我的 html 知識，幾乎沒有，這樣說）我什么也沒得到。我也嘗試過使用正則運算式，我剛剛發現了一個東西，但我不知道如何正確地做到這一點。

有什么想法嗎？

先感謝您！！

湯的輸出

uj5u.com熱心網友回復：

此站點以非 html 結構回應。因此，對于此任務，您不需要像 BeautifulSoup 或 lxml 這樣的 html 決議器。

這是使用請求庫的示例。你可以這樣安裝

pip install requests

import requests
import html
import json

url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
response = requests.get(url)
data = response.text  # get data from site

raw_list = data.split("'")[1]  # extract rf-list-detail.list attribute
json_list = html.unescape(raw_list)  # decode html symbols
parsed_list = json.loads(json_list)  # parse json 

print(parsed_list)  # printing result

directions = []
for item in parsed_list:
    directions.append(item["direction"])
print(directions)  # extracting directions

# ['Moixent', 'Vallada', 'Montesa', "L'Alcudia de Crespins", 'Xàtiva', "L'Enova-Manuel", 'La Pobla Llarga', 'Carcaixent', 'Alzira', 'Algemesí', 'Benifaió-Almussafes', 'Silla', 'Catarroja', 'Massanassa', 'Alfafar-Benetússer', 'València Nord']

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/393649.html

標籤：Python html 正则表达式网页抓取美汤

上一篇：如何縮放布局中的多個視圖以適應布局的大小？

下一篇：BEAUTIFULSOUP：如何在沒有css選擇器的情況下使用給定字串獲取標簽