無法使用python分頁抓取網頁-有解無憂

我正在嘗試通過網路抓取以下類別中的產品鏈接

https://www.acihellas.gr/gaming-pontikia#/

它有 4 頁產品……但出于某種原因，我只得到了第一個……以下內容

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
filterprods = '/#/pageSize=21&viewMode=grid&orderBy=10&pageNumber='


for itm in range(1,page_number):
    print("Page",itm)
    urlget = str(url2get filterprods str(itm))
    time.sleep(2)
    ses=requests.Session()
    r = ses.get(urlget, headers=headers)
    
    if r.status_code == 200:

        Myhtml = r.text
        
        soup = BeautifulSoup(Myhtml, 'lxml')
        
        productlist = soup.find_all('div',attrs = {'class','item-box'})
   
        for p_item in productlist:
            
            a = p_item.find('a')
            if a:
                producttitle = a['title']                  
                productlink = a['href']
                url_item = 'https://acihellas.gr' productlink
                print(url_item)
                urllist.append(url_item)
                time.sleep(2)
            else:
                pass
        ses.close()

    else:
        print(r.status_code)

return urllist

鏈接連接正確，但 ses.get(url) 不起作用，所以我想我是否可以再次關閉會話。

檢查到下一頁的鏈接時，該頁面沒有。所以我用 filterprods 變數構造

我們怎樣才能解決這個問題？

謝謝你

uj5u.com熱心網友回復：

您沒有在您的網站代碼中提供 URL。您可以使用網站 API 來收集產品。這是一個起始代碼，我把決議留給你:)

import requests

url = 'https://www.acihellas.gr/getFilteredProducts'

for pagenum in range(1, 5):

    payload = {
        "categoryId": "828",
        "manufacturerId": "0",
        "vendorId": "0",
        "priceRangeFilterModel7Spikes": "null",
        "specificationFiltersModel7Spikes": {
            "CategoryId": "828",
            "ManufacturerId": "0",
            "VendorId": "0",
            "SpecificationFilterGroups": [{
                "Id": 998,
                "FilterItems": [{
                    "Id": "25188",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "18572",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "7361",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "7362",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "7368",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "18060",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "19024",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "24876",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "28037",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "23321",
                    "FilterItemState": "Unchecked"
                }]
            }, {
                "Id": 990,
                "FilterItems": [{
                    "Id": "7336",
                    "FilterItemState": "Unchecked"
                }]
            }, {
                "Id": 995,
                "FilterItems": [{
                    "Id": "7350",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "7348",
                    "FilterItemState": "Unchecked"
                }, {
                    "Id": "7349",
                    "FilterItemState": "Unchecked"
                }]
            }]
        },
        "pageNumber": str(pagenum),
        "orderby": "10",
        "viewmode": "grid",
        "pagesize": "21",
        "queryString": "#/pageSize=21&viewMode=grid&orderBy=10&pageNumber="   str(pagenum),
        "shouldNotStartFromFirstPage": "true",
        "keyword": "",
        "searchCategoryId": "0",
        "searchManufacturerId": "0",
        "searchVendorId": "0",
        "priceFrom": "",
        "priceTo": "",
        "includeSubcategories": "False",
        "searchInProductDescriptions": "False",
        "advancedSearch": "False",
        "isOnSearchPage": "False",
        "inStockFilterModel": {
            "CategoryId": "828",
            "ManufacturerId": "0",
            "VendorId": "0",
            "Id": "1",
            "FilterItemState": "Unchecked"
        }
    }

    res = requests.post(url, json=payload)

    print(res.text)

注意：您的鏈接是相對的，因此您需要在它們前面加上網站 URL： https://www.acihellas.gr/

編輯：

回答一個問題，如果有辦法只更改 payload 中的類別，看起來是的，我從payload基本上是產品過濾器的變數中洗掉了整個部分，它仍然有效：

payload = {
        "categoryId": "828",
        "manufacturerId": "0",
        "vendorId": "0",
        "priceRangeFilterModel7Spikes": "null",
        "pageNumber": str(pagenum),
        "orderby": "10",
        "viewmode": "grid",
        "pagesize": "21",
        "queryString": "#/pageSize=21&viewMode=grid&orderBy=10&pageNumber="   str(pagenum),
        "shouldNotStartFromFirstPage": "true",
        "keyword": "",
        "searchCategoryId": "0",
        "searchManufacturerId": "0",
        "searchVendorId": "0",
        "priceFrom": "",
        "priceTo": "",
        "includeSubcategories": "False",
        "searchInProductDescriptions": "False",
        "advancedSearch": "False",
        "isOnSearchPage": "False",
        "inStockFilterModel": {
            "CategoryId": "828",
            "ManufacturerId": "0",
            "VendorId": "0",
            "Id": "1",
            "FilterItemState": "Unchecked"
        }
    }

uj5u.com熱心網友回復：

我無法準確地對此進行測驗，因為該域在我的國家/地區被阻止，但也許您可以嘗試以下操作：

import requests
import bs4 as bs

url_base = 'http://www.acihellas.gr/gaming-pontikia#/pageSize=21&viewMode=grid&orderBy=10&pageNumber={page}'
total_pages = 4
products = {}

for page in range(1, total_pages   1):
    url = url_base.format(page=page)
    print(f'Scraping page {page}...')
    res = requests.get(url)

    soup = bs.BeautifulSoup(res.text, 'lxml')
    item = soup.find_all('div', {'class': 'details'})
    # Add items to dictionary
    for i in item:
        name = i.find("h2").text
        print(name)
        url = i.find("a")['href']
        products[i] = {'name': name, 'url': url}

products

該products詞典應該有名稱和專案的URL。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/354332.html

標籤：Python 网页抓取

上一篇：網頁抓取時如何切換框？

下一篇：如何使用R從網站下載檔案