我正在抓取一個 Prestashop 網站,我想在其中獲取產品所有影像的串列 URL。但是,我得到了重復的值(所有鏈接都會重復)。我曾嘗試創建一個字典來洗掉重復項,但它似乎不起作用。此外,我似乎無法從參考號中洗掉 span 標簽(解包不起作用) - 它不斷回傳“無”屬性,這令人困惑,因為所有產品都有一個參考號。我曾嘗試將結果轉換為字串,但它不允許我這樣做。
這是代碼:
testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
imagelinks = []
name = soup.find('h1', class_='product_name').text.strip()
reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span')
images = soup.find_all('li', class_='thumb-container')
for item in images:
image = item.find('img').attrs['src']
imagelinks.append(image)
print(imagelinks)
uj5u.com熱心網友回復:
使用.text獲得數量,而不標簽<span>
reference_number = reference.find('span').text
使用set()代替串列跳過重復項
imagelinks = set()
# ...
imagelinks.add(image)
完整的作業代碼:
import requests
from bs4 import BeautifulSoup
testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
imagelinks = set()
name = soup.find('h1', class_='product_name').text.strip()
reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span').text
print(reference_number)
images = soup.find_all('li', class_='thumb-container')
for item in images:
image = item.find('img').attrs['src']
imagelinks.add(image)
print(imagelinks)
print('len:', len(imagelinks))
編輯:
或者你應該只從 <div id="thumb_box">
使用 find().find_all()
images = soup.find('div', {'id':'thumb_box'}).find_all('li', class_='thumb-container')
或使用 CSS selector
images = soup.select('div#thumb_box li.thumb-container')
import requests
from bs4 import BeautifulSoup
testlink = 'https://trgovina.audiopro.si/si/bas-glave/36037-81020104.html'
r = requests.get(testlink)
soup = BeautifulSoup(r.content, 'html.parser')
imagelinks = []
name = soup.find('h1', class_='product_name').text.strip()
reference = soup.find('div', class_='product-reference_top product-reference')
reference_number = reference.find('span').text
print(reference_number)
images = soup.find('div', {'id':'thumb_box'}).find_all('li', class_='thumb-container')
#images = soup.select('div#thumb_box li.thumb-container')
for item in images:
image = item.find('img').attrs['src']
imagelinks.append(image)
print(imagelinks)
print('len:', len(imagelinks))
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/381463.html
上一篇:Django無法在SerializerMethodField()中檢索/獲取通用影像鏈接(裁剪的URL)
下一篇:imgsrc三元運算子不顯示影像
