我正在嘗試創建一個包含來自網站的所有唯一年份鏈接的串列(見下文)。
當我執行 append 函式時,它給了我一個包含重復條目的巨大串列。
我需要獲得一個僅包含唯一年份鏈接的串列。
網站:https : //www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
到目前為止撰寫的代碼:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
print(link.get('href'))
#year.append(link.get('href'))
#print(year)
所需的結果如下所示(但我需要以串列格式顯示):
https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/83-2022.html
/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/71-2020.html
/apofaseis-gnomodotiseis/itemlist/category/4-2019.html
/apofaseis-gnomodotiseis/itemlist/category/5-2018.html
/apofaseis-gnomodotiseis/itemlist/category/6-2017.html
/apofaseis-gnomodotiseis/itemlist/category/7-2016.html
/apofaseis-gnomodotiseis/itemlist/category/8-2015.html
/apofaseis-gnomodotiseis/itemlist/category/9-2014.html
/apofaseis-gnomodotiseis/itemlist/category/10-2013.html
/apofaseis-gnomodotiseis/itemlist/category/11-2012.html
/apofaseis-gnomodotiseis/itemlist/category/12-2011.html
/apofaseis-gnomodotiseis/itemlist/category/13-2010.html
/apofaseis-gnomodotiseis/itemlist/category/18-2009.html
/apofaseis-gnomodotiseis/itemlist/category/19-2008.html
/apofaseis-gnomodotiseis/itemlist/category/20-2007.html
/apofaseis-gnomodotiseis/itemlist/category/21-2006.html
/apofaseis-gnomodotiseis/itemlist/category/22-2005.html
/apofaseis-gnomodotiseis/itemlist/category/23-2004.html
/apofaseis-gnomodotiseis/itemlist/category/24-2003.html
/apofaseis-gnomodotiseis/itemlist/category/25-2002.html
/apofaseis-gnomodotiseis/itemlist/category/26-2001.html
/apofaseis-gnomodotiseis/itemlist/category/27-2000.html
/apofaseis-gnomodotiseis/itemlist/category/44-1999.html
/apofaseis-gnomodotiseis/itemlist/category/45-1998.html
/apofaseis-gnomodotiseis/itemlist/category/48-1997.html
/apofaseis-gnomodotiseis/itemlist/category/47-1996.html
/apofaseis-gnomodotiseis/itemlist/category/46-1995.html
/apofaseis-gnomodotiseis/itemlist/category/49-1994.html
編輯:我正在嘗試為年度串列中的每一年創建一個案例串列:代碼:
# 1) Created an year list (year = [])
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
total_cases = []
#Url to scrape
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
# 2) Created a case list
case = []
for link in soup.find_all('a', href=lambda href: href and "apofasi" in href):
if link.get('href') not in case :
case.append(link.get('href'))
print(case)
#Trying to create a case list for every year in year list
# A)Get every year link in year list
for year_link in year :
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(year_link, headers = headers1)
soup2 = BeautifulSoup(page.content,"html.parser")
print(year)
# B)Get every case link for every case in a fixed year
for case_link in case :
total_cases.append(case_link)
#Get case link for every case for every year_link (element of year[])
???
編輯 2:
當我嘗試運行您 (HedgeHog) 所以 kinldy 發布的代碼時,它給了我這個錯誤:
--------------------------------------------------------------------------
FeatureNotFound Traceback (most recent call last)
C:\Users\ARISTE~1\AppData\Local\Temp/ipykernel_13944/1621925083.py in <module>
8 "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
9 page = requests.get(URL, headers = headers)
---> 10 soup = BeautifulSoup(page.content,'lxml')
11
12 baseUrl = 'https://www.epant.gr'
~\Documents\conda\envs\conda\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
243 builder_class = builder_registry.lookup(*features)
244 if builder_class is None:
--> 245 raise FeatureNotFound(
246 "Couldn't find a tree builder with the features you "
247 "requested: %s. Do you need to install a parser library?"
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
有任何想法嗎?謝謝!
uj5u.com熱心網友回復:
編輯
根據您的問題編輯,我建議使用 adict而不是所有這些串列 - 以下示例將創建一個以年份為鍵的資料字典,它有自己的 url 和案例 url 串列。
例子
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
baseUrl = 'https://www.epant.gr'
data = {}
for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
page = requests.get(f'{baseUrl}{href}', headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
data[href.split('-')[-1].split('.')[0]] = {
'url': f'{baseUrl}{href}'
}
data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
data
輸出
{'2022': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/83-2022.html',
'cases': []},
'2021': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html',
'cases': ['https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1508-apofasi-728-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1584-apofasi-727-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1586-apofasi-726-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1583-apofasi-725-2021.html']},...}
怎么修?
只需檢查鏈接是否不在您的鏈接串列中 - 因此True將其附加到您的串列中:
if link.get('href') not in year:
year.append(link.get('href'))
筆記
所需的結果如下所示(但我需要以串列格式顯示)
這不是資料結構意義上的串列,它是串列中每個元素的列印版本。
另類
例子
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
輸出
['https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/83-2022.html', '/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/71-2020.html', '/apofaseis-gnomodotiseis/itemlist/category/4-2019.html', '/apofaseis-gnomodotiseis/itemlist/category/5-2018.html', '/apofaseis-gnomodotiseis/itemlist/category/6-2017.html', '/apofaseis-gnomodotiseis/itemlist/category/7-2016.html', '/apofaseis-gnomodotiseis/itemlist/category/8-2015.html', '/apofaseis-gnomodotiseis/itemlist/category/9-2014.html', '/apofaseis-gnomodotiseis/itemlist/category/10-2013.html', '/apofaseis-gnomodotiseis/itemlist/category/11-2012.html', '/apofaseis-gnomodotiseis/itemlist/category/12-2011.html', '/apofaseis-gnomodotiseis/itemlist/category/13-2010.html', '/apofaseis-gnomodotiseis/itemlist/category/18-2009.html', '/apofaseis-gnomodotiseis/itemlist/category/19-2008.html', '/apofaseis-gnomodotiseis/itemlist/category/20-2007.html', '/apofaseis-gnomodotiseis/itemlist/category/21-2006.html', '/apofaseis-gnomodotiseis/itemlist/category/22-2005.html', '/apofaseis-gnomodotiseis/itemlist/category/23-2004.html', '/apofaseis-gnomodotiseis/itemlist/category/24-2003.html', '/apofaseis-gnomodotiseis/itemlist/category/25-2002.html', '/apofaseis-gnomodotiseis/itemlist/category/26-2001.html', '/apofaseis-gnomodotiseis/itemlist/category/27-2000.html', '/apofaseis-gnomodotiseis/itemlist/category/44-1999.html', '/apofaseis-gnomodotiseis/itemlist/category/45-1998.html', '/apofaseis-gnomodotiseis/itemlist/category/48-1997.html', '/apofaseis-gnomodotiseis/itemlist/category/47-1996.html', '/apofaseis-gnomodotiseis/itemlist/category/46-1995.html', '/apofaseis-gnomodotiseis/itemlist/category/49-1994.html']
uj5u.com熱心網友回復:
使用集合作為 HREF 的中間存盤,然后稍后轉換為串列。
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB"}
page = requests.get(URL, headers=headers1)
soup = BeautifulSoup(page.content, "lxml")
year = set()
for link in soup.find_all('a', href=lambda href: href and "category" in href):
year.add(link.get('href'))
print(list(year))
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/375927.html
