在Python上獲取HTML<p>標簽作為Pandas表-有解無憂

我需要抓取一個網站，它有一個類似段落的“表格”，我想把它作為 Python 上的 DataFrame。

我需要獲取頁面的名稱、價格和描述，并將它們全部放在 DataFrame 格式中。問題是我可以單獨刮掉所有這些，但我無法將它們放到正確的 DataFrame 中。

這是我到目前為止所做的：

I get the product links first because I need to scrape multiple pages:
baseURL = 'https://www.civivi.com'
product_links = []
for x in range (1,3):
    HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}',HEADER)
    #HTML.status_code
    Booti= soup(HTML.content, "lxml")
    knife_items = Booti.find_all('div',class_= "product-list product-list--collection product-list--with-sidebar")
    
    for items in knife_items:
        for links in items.findAll('a', attrs = {'class' : 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href = True):
            product_links.append(baseURL   links['href'])

然后我在這里抓取各個網頁：

Name = []
Price = []
Specific = []
for links in product_links:
#testlinks = "https://www.civivi.com/collections/all-products/products/civivi-rustic-gent-lockback-knife-c914-g10-d2"
    HTML2 = requests.get(links, HEADER)
    Booti2 = soup(HTML2.content,"html.parser") 
    try:
        for N in Booti2.findAll('h1',{'class': "product-meta__title heading h1" }):
            Name.append(N.text.replace('\n', '').strip())
        for P in Booti2.findAll('span',{'class': "price" }):
            Price.append(P.text.replace('\n', '').strip())
        Contents = Booti2.find('div',class_= "rte text--pull")
        for S in Contents.find_all('span'):
            Specific.append(S.text)

    except:
        continue

所以我需要以這種格式獲取所有資訊：

         Name.     | | Price          || Model Number  Model Name. Overall Length
|------------------| |----------------||-------------| ---------||----------------|
| Product Name 1   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |  
| Product Name 2   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |
| Product Name 3   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    | 
| Product Name 4   | |  $$            ||  XXXX       |  ABC.    ||   XX"/XXcm.    |

...等等，以及網頁中的其余列。任何幫助，將不勝感激！！太感謝了！！

uj5u.com熱心網友回復：

我會在一分鐘內更新這個，但嘗試這樣的事情：

from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import re

baseURL = 'https://www.civivi.com'
product_links = []
header = {}
for x in range(1, 2):
    HTML = requests.get(f'https://www.civivi.com/collections/all-products/price-range_-70?page={x}', header)
    # HTML.status_code
    Booti = soup(HTML.content, "lxml")
    knife_items = Booti.find_all('div', class_="product-list product-list--collection product-list--with-sidebar")

    for items in knife_items:
        for links in items.findAll('a', attrs={
            'class': 'product-item__image-wrapper product-item__image-wrapper--with-secondary'}, href=True):
            product_links.append(baseURL   links['href'])

# dataframe that will hold the final resulting data
final = pd.DataFrame()
for links in product_links:
    HTML2 = requests.get(links, header)
    Booti2 = soup(HTML2.content,"lxml")
    try:
        buffer = pd.DataFrame(
            [[
                # name
                Booti2.find('h1', class_='product-meta__title heading h1').text.strip(),
                # price
                Booti2.find('div', class_='price-list').find('span').text,
                # if  you don't want $ do this: Booti2.find('div', class_='price-list').find('span').text[1:]
                # Model Number
                str(Booti2(text=re.compile(r'(?:Model Number: )'))[4]).replace('Model Number: ', ''),
                # Model Name - using [4] is not the best way. I think the regex could be better or something.
                str(Booti2(text=re.compile(r'(?:Model Name: )'))[4]).replace('Model Name: ', ''),
                # Overall Length
                str(Booti2(text=re.compile(r'(?:Overall Length: )'))[4]).replace('Overall Length: ', '')
            ]],
            columns=['Name', 'Price', 'Model Number', 'Model Name', 'Overall Length']
        )
        final = final.append(buffer)
    except:
        continue

編輯：回答您關于 [4] 的問題：

我試圖找到一種方法來搜索帶有相關文本的標簽。即“型號”、“型號名稱”和“總長度”。我試圖使用正則運算式（re library）來做到這一點，即 text=re.compile 部分。所以最初我試圖做類似的事情：

Botti2.find_all('span', text=re.compile(r'Model Number'))  # these attributes are in <span> tags

由于某種原因，它無法正常作業，所以我只是修改以查找這些單詞的所有實體。

Booti2(text=re.compile(r'(?:Overall Length: )'))

上面的行回傳 5 個實體。您可以通過在該行設定斷點來查看自己。索引 [4] 僅表示恰好是正確文本的最后一個實體。我認為這不是最好的解決方案，因為它很容易損壞或無法按預期作業。

如果要添加其他屬性，只需復制粘貼其他屬性之一并更改文本，例如：

str(Booti2(text=re.compile(r'(?:Blade Length: )'))[4]).replace('Blade Length: ', '')  # order of attributes here must match column names order

然后更新列名

columns=['Name', 'Price', 'Model Number', 'Model Name', 'Overall Length', 'new column name here']

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/422932.html

標籤：

上一篇：使用匹配的字串對從資料框中過濾行[str.contains()ANDOperation-Python,Pandas]

下一篇：計算資料幀每一行之間的公共列數以創建一個全對全矩陣