| 4G 頻段 (A) | 4G 頻段 (B) | 5G 頻段 (A) | 5G 頻段 (B) | 5G 頻段 (C) |
|---|---|---|---|---|
| 1、2、3 - A2643 | 1, 2, 3 - A2484 | 1、2、3 - A2643 | 1, 2, 3 - A2484 | 1、2、3 - A2641 |
如何從以下 html 表中獲取上述輸出?
<table cellspacing="0">
<tr class="tr-hover">
<th rowspan="15" scope="row">Network</th>
<td class="ttl"><a href="network-bands.php3">Technology</a></td>
<td class="nfo"><a href="#" class="link-network-detail" data-spec="nettech">GSM / CDMA / HSPA / EVDO / LTE / 5G</a></td>
</tr>
<tr class="tr-toggle">
<td class="ttl"><a href="network-bands.php3">2G bands</a></td>
<td class="nfo" data-spec="net2g">GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2 (dual-SIM)</td>
</tr>
<tr class="tr-toggle" data-spec-optional>
<td class="ttl"> </td>
<td class="nfo">CDMA 800 / 1900 </td>
</tr>
<tr class="tr-toggle">
<td class="ttl"><a href="network-bands.php3">3G bands</a></td>
<td class="nfo" data-spec="net3g">HSDPA 850 / 900 / 1700(AWS) / 1900 / 2100 </td>
</tr>
<tr class="tr-toggle" data-spec-optional>
<td class="ttl"> </td>
<td class="nfo">CDMA2000 1xEV-DO </td>
</tr>
<tr class="tr-toggle">
<td class="ttl"><a href="network-bands.php3">4G bands</a></td>
<td class="nfo" data-spec="net4g">1, 2, 3, 4, 5, 7, 8, 12, 13, 17, 18, 19, 20, 25, 26, 28, 30, 32, 34, 38, 39, 40, 41, 42, 46, 48, 66 - A2643, A2644, A2645</td>
</tr>
<tr class="tr-toggle" data-spec-optional>
<td class="ttl"> </td>
<td class="nfo">1, 2, 3, 4, 5, 7, 8, 11, 12, 13, 14, 17, 18, 19, 20, 21, 25, 26, 28, 29, 30, 32, 34, 38, 39, 40, 41, 42, 46, 48, 66, 71 - A2484, A2641</td>
</tr>
<tr class="tr-toggle">
<td class="ttl"><a href="network-bands.php3">5G bands</a></td>
<td class="nfo" data-spec="net5g">1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 30, 38, 40, 41, 48, 66, 77, 78, 79 SA/NSA/Sub6 - A2643, A2644</td>
</tr>
<tr class="tr-toggle" data-spec-optional>
<td class="ttl"> </td>
<td class="nfo">1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 29, 30, 38, 40, 41, 48, 66, 71, 78, 79, 258, 260, 261 SA/NSA/Sub6/mmWave - A2484</td>
</tr>
<tr class="tr-toggle" data-spec-optional>
<td class="ttl"> </td>
<td class="nfo">1, 2, 3, 5, 7, 8, 12, 20, 25, 28, 29, 30, 38, 40, 41, 48, 66, 71, 77, 78, 79 SA/NSA/Sub6 - A2641</td>
</tr>
<tr class="tr-toggle">
<td class="ttl"><a href="glossary.php3?term=3g">Speed</a></td>
<td class="nfo" data-spec="speed">HSPA 42.2/5.76 Mbps, LTE-A, 5G, EV-DO Rev.A 3.1 Mbps</td>
</tr>
</table>
我的問題是這行: <tr data-spec-optional>。它們不受任何層次結構的約束。
我開發了一個蜘蛛,它成功地爬取了在線門戶的整個結構并收集了必要的資訊。只有表中的這些可選行會給我帶來問題。
這種方法給了我希望。但我沒有成功實施它。如何選擇和提取兩個元素之間的文本?
任何形式的幫助都會非常有幫助!
編輯
我必須使用物品。使用@SuperUser 的解決方案時,出現以下錯誤:
TypeError:“str”物件不支持專案分配
uj5u.com熱心網友回復:
老實說,我確信這不是最好的方法,但它確實有效。如果我會想到另一種方式,那么我將編輯答案。
In [1]: html="""<html>
...: <body>
...: <table cellspacing="0">
...:
...: <tr >
...: <td ><a href="network-bands.php3">4G bands</a></td>
...: <td data-spec="net4g">1, 2, 3 - A2643</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2484</td>
...: </tr>
...:
...: <tr >
...: <td ><a href="network-bands.php3">5G bands</a></td>
...: <td data-spec="net5g">1, 2, 3 - A2643</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2484</td>
...: </tr>
...:
...: <tr data-spec-optional>
...: <td > </td>
...: <td >1, 2, 3 - A2641</td>
...: </tr>
...:
...: </table>
...: </body>
...: </html>"""
In [2]: from scrapy import Selector
In [3]: response = Selector(text=html)
In [4]: for row in response.xpath('//tr[@]'):
...: if row.xpath('.//a'):
...: ch = 'A'
...: title1 = row.xpath('.//a/text()').get()
...: else:
...: ch = chr(ord(ch) 1)
...: title = title1 f' ({ch})'
...: data = row.xpath('.//td[@]/text()').get()
...: print(f'\"{title}\" : \"{data}\"')
...:
"4G bands (A)" : "1, 2, 3 - A2643"
"4G bands (B)" : "1, 2, 3 - A2484"
"5G bands (A)" : "1, 2, 3 - A2643"
"5G bands (B)" : "1, 2, 3 - A2484"
"5G bands (C)" : "1, 2, 3 - A2641"
uj5u.com熱心網友回復:
這個答案的靈感來自@SuperUser 的使用序數值來增加計數的答案。您可以使用 xpath 獲取標頭和值,然后將它們組合起來形成最終字典
import re
titles = response.xpath("//tr[@class='tr-toggle']/td[@class='ttl']/descendant-or-self::*/text()").getall()
for i, title in enumerate(titles):
if title.strip() == '':
prev_letter = re.search(r"\((.)\)$", titles[i-1]).group(1)
new_title = f"{titles[i-1][:-4]} ({chr(ord(prev_letter) 1)})"
titles[i] = new_title
else:
titles[i] = f"{title} (A)"
values = response.xpath("//tr[@class='tr-toggle']/td[@class='nfo']/text()").getall()
results = {}
for title, value in zip(titles, values):
results[title] = value
當您列印結果字典時,您將獲得以下內容
{'4G bands (A)': '1, 2, 3 - A2643',
'4G bands (B)': '1, 2, 3 - A2484',
'5G bands (A)': '1, 2, 3 - A2643',
'5G bands (B)': '1, 2, 3 - A2484',
'5G bands (C)': '1, 2, 3 - A2641'}
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/421566.html
標籤:
