我需要抓取一個網站,它有一個類似段落的“表格”,我想把它作為 Python 上的 DataFrame。
我需要獲取頁面的名稱、價格和描述,并將它們全部放在 DataFrame 格式中。問題是我可以單獨刮掉所有這些,但我無法將它們放到正確的 DataFrame 中。
這是我到目前為止所做的:
I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
#HTML.status_code
Booti= soup(HTML.content, "lxml")
knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
for items in knife_items:
for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
product_links.append(baseURL links['href'])
然后我在這里抓取各個網頁:
Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
HTML2 = requests.get(links, HEADER)
Booti2 = soup(HTML2.content,"html.parser")
try:
for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
Name.append(N.text.replace('\n', '').strip())
for P in Booti2.findAll('span',{'class': "price" }):
Price.append(P.text.replace('\n', '').strip())
Contents = Booti2.find('div',class_= "rte text--pull")
for S in Contents.find_all('span'):
Specific.append(S.text)
except:
continue
所以我需要以這種格式獲取所有資訊:
Name. | | Price || Model Number Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 2 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 3 | | $$ || XXXX | ABC. || XX"/XXcm. |
| Product Name 4 | | $$ || XXXX | ABC. || XX"/XXcm. |
...等等,以及網頁中的其余列。任何幫助,將不勝感激!!太感謝了!!
uj5u.com熱心網友回復:
我會在一分鐘內更新這個,但嘗試這樣的事情:
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import re
baseURL = 'https://www.civivi.com'
product_links = []
header = {}
for x in range(1, 2):
HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}', header)
# HTML.status_code
Booti = soup(HTML.content, "lxml")
knife_items = Booti.find_all('div', class_="product-list product-list--collection product-list--with-sidebar")
for items in knife_items:
for links in items.findAll('a', attrs={
'class': 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href=True):
product_links.append(baseURL links['href'])
# dataframe that will hold the final resulting data
final = pd.DataFrame()
for links in product_links:
HTML2 = requests.get(links, header)
Booti2 = soup(HTML2.content,"lxml")
try:
buffer = pd.DataFrame(
[[
# name
Booti2.find('h1', class_='product-meta__title heading h1').text.strip(),
# price
Booti2.find('div', class_='price-list').find('span').text,
# if you don't want $ do this: Booti2.find('div', class_='price-list').find('span').text[1:]
# Model Number
str(Booti2(text=re.compile(r'(?:Model Number: )'))[4]).replace('Model Number: ', ''),
# Model Name - using [4] is not the best way. I think the regex could be better or something.
str(Booti2(text=re.compile(r'(?:Model Name: )'))[4]).replace('Model Name: ', ''),
# Overall Length
str(Booti2(text=re.compile(r'(?:Overall Length: )'))[4]).replace('Overall Length: ', '')
]],
columns=['Name', 'Price', 'Model Number', 'Model Name', 'Overall Length']
)
final = final.append(buffer)
except:
continue
編輯:回答您關于 [4] 的問題:
我試圖找到一種方法來搜索帶有相關文本的標簽。即“型號”、“型號名稱”和“總長度”。我試圖使用正則運算式(re library)來做到這一點,即 text=re.compile 部分。所以最初我試圖做類似的事情:
Botti2.find_all('span', text=re.compile(r'Model Number')) # these attributes are in <span> tags
由于某種原因,它無法正常作業,所以我只是修改以查找這些單詞的所有實體。
Booti2(text=re.compile(r'(?:Overall Length: )'))
上面的行回傳 5 個實體。您可以通過在該行設定斷點來查看自己。索引 [4] 僅表示恰好是正確文本的最后一個實體。我認為這不是最好的解決方案,因為它很容易損壞或無法按預期作業。
如果要添加其他屬性,只需復制粘貼其他屬性之一并更改文本,例如:
str(Booti2(text=re.compile(r'(?:Blade Length: )'))[4]).replace('Blade Length: ', '') # order of attributes here must match column names order
然后更新列名
columns=['Name', 'Price', 'Model Number', 'Model Name', 'Overall Length', 'new column name here']
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/422932.html
標籤:
上一篇:使用匹配的字串對從資料框中過濾行[str.contains()ANDOperation-Python,Pandas]
