遍歷html字典以從每個包含的html中抓取內容（td和相鄰元素）-有解無憂

我需要遍歷給定資料字典中的每個 html 來抓取包含“?νδικα Μ?σα”的 td 元素及其相鄰單元格的內容。謝謝你。

這是我正在處理的代碼：

from bs4 import BeautifulSoup
import requests

URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
    "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')

baseUrl = 'https://www.epant.gr'

data = {}

for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
    page = requests.get(f'{baseUrl}{href}', headers = headers)
    soup = BeautifulSoup(page.content,'html.parser')
    data[href.split('-')[-1].split('.')[0]] = {
        'url': f'{baseUrl}{href}'
    }
    data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
    
#Search every case-hmtl for "?νδικα Μ?σα" content

from bs4 import BeautifulSoup
import requests
import re

for url2 in data :
    headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
        "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
        "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
    page = requests.get(url2, headers = headers1)
    soup = BeautifulSoup(page.content,"html.parser")
    if soup.find('td', text = "?νδικα Μ?σα").parent.get_text(strip=True) is TRUE :
        reqs = requests.get(url2)
        soup2 = BeautifulSoup(reqs.text, 'html.parser')
        print(url2.get('href'))
        row = soup.find('td', text = "?νδικα Μ?σα").parent.get_text(strip=True)
        print(row)

PS：如果我的帖子需要編輯或格式化，請告訴我。謝謝你。

編輯：當我輸入您 (HedgeHog) 提供的代碼時，出現 SSL 例外錯誤。

我搜索了一個解決方案并遇到了這個問題。

 proxy = 'http://78.130.136.2:8080'

有了它，我的代碼可以完美運行。謝謝！

uj5u.com熱心網友回復：

好的，現在我得到一個簡單的線索，您嘗試做什么 - 如果您只想從案例中獲取一些資訊，則不需要 dict。您可以在流程中生成所有資訊。

例子

from bs4 import BeautifulSoup
import requests

URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae", 
    "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')

baseUrl = 'https://www.epant.gr'

for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
    page = requests.get(f'{baseUrl}{href}', headers = headers)
    soup = BeautifulSoup(page.content,'html.parser')

    urls = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]

    for url in urls :
        page = requests.get(url, headers = headers)
        soup = BeautifulSoup(page.content,'html.parser')
        row = soup.find('td', text = "?νδικα Μ?σα").parent.get_text(strip=True) if soup.find('td', text = "?νδικα Μ?σα") else None
        case = soup.find('h2').text.strip()
        year = case.split('/')[-1]
        print(f'{year},{case},{row},{url}')

輸出

2021,Απ?φαση 749/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html
2021,Απ?φαση 743/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html
2021,Απ?φαση 738/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html
2021,Απ?φαση 737/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html
2021,Απ?φαση 735/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html
2021,Απ?φαση 733/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html
2021,Απ?φαση 732/2021,?νδικα Μ?σαΟριστικ? απ?φαση. Δεν ?χουν ασκηθε? ?νδικα μ?σα.,https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html
...

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/377774.html

標籤：Python html 字典网页抓取美汤

上一篇：抓取資料HTML表格

下一篇：獲取HTML中的所有標簽作為Python中的骨架