我需要遍歷給定資料字典中的每個 html 來抓取包含“?νδικα Μ?σα”的 td 元素及其相鄰單元格的內容。謝謝你。
這是我正在處理的代碼:
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
baseUrl = 'https://www.epant.gr'
data = {}
for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
page = requests.get(f'{baseUrl}{href}', headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
data[href.split('-')[-1].split('.')[0]] = {
'url': f'{baseUrl}{href}'
}
data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
#Search every case-hmtl for "?νδικα Μ?σα" content
from bs4 import BeautifulSoup
import requests
import re
for url2 in data :
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(url2, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
if soup.find('td', text = "?νδικα Μ?σα").parent.get_text(strip=True) is TRUE :
reqs = requests.get(url2)
soup2 = BeautifulSoup(reqs.text, 'html.parser')
print(url2.get('href'))
row = soup.find('td', text = "?νδικα Μ?σα").parent.get_text(strip=True)
print(row)
PS:如果我的帖子需要編輯或格式化,請告訴我。謝謝你。
編輯:當我輸入您 (HedgeHog) 提供的代碼時,出現 SSL 例外錯誤。
我搜索了一個解決方案并遇到了這個問題。
proxy = 'http://78.130.136.2:8080'
有了它,我的代碼可以完美運行。謝謝!
uj5u.com熱心網友回復:
好的,現在我得到一個簡單的線索,您嘗試做什么 - 如果您只想從案例中獲取一些資訊,則不需要 dict。您可以在流程中生成所有資訊。
例子
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
baseUrl = 'https://www.epant.gr'
for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
page = requests.get(f'{baseUrl}{href}', headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
urls = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
for url in urls :
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
row = soup.find('td', text = "?νδικα Μ?σα").parent.get_text(strip=True) if soup.find('td', text = "?νδικα Μ?σα") else None
case = soup.find('h2').text.strip()
year = case.split('/')[-1]
print(f'{year},{case},{row},{url}')
輸出
2021,Απ?φαση 749/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html
2021,Απ?φαση 743/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html
2021,Απ?φαση 738/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html
2021,Απ?φαση 737/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html
2021,Απ?φαση 735/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html
2021,Απ?φαση 733/2021,?νδικα Μ?σα-,https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html
2021,Απ?φαση 732/2021,?νδικα Μ?σαΟριστικ? απ?φαση. Δεν ?χουν ασκηθε? ?νδικα μ?σα.,https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html
...
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/377774.html
上一篇:抓取資料HTML表格
