如何使用Bs4刮取結果卡內的頁面？-有解無憂

<img data-src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium" alt="如何使用 Bs4 刮取結果卡內的頁面？" data-gatype="RestaurantImageClick" data-url="/delhi/biryani-by-kilo-connaught-place-central-delhi-40178" data-w-onclick="cardClickHandler" src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium">

頁面網址 - https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p=1

這個頁面現在包含一些餐館卡，同時在回圈中報廢頁面我想按類進入上述 HTML 代碼名稱中的餐館卡 URLdata-url并刮掉編號。里面的評論，我不知道該怎么做，我當前的正常首頁報廢代碼是；

def extract(page):
    url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}"  # URL of the website 
    header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
    r = requests.get(url, headers=header)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

def transform(soup): # function to scrape the page
    divs = soup.find_all('div', class_ = 'restnt-card restaurant')
    for item in divs:
        title = item.find('a').text.strip() # restaurant name
        loc = item.find('div', class_ = 'restnt-loc ellipsis').text.strip() # restaurant location
        try: # used this try and except method because some restaurants are unrated and while scrpaping those we would run into an error
            rating = item.find('div', class_="img-wrap").text 
            rating = (re.sub("[^0-9,.]", "", rating))
            
        except:
            rating = None
        pricce = item.find('span', class_="double-line-ellipsis").text.strip() # price for biriyani
        price = re.sub("[^0-9]", "", pricce)[:-1]

        biry_del = {
            'name': title,
            'location': loc,
            'rating': rating,
            'price': price
        }
        rest_list.append(biry_del)

        
rest_list = []

for i in range(1,18):
    print(f'getting page, {i}')
    c = extract(i)
    transform(c)

希望大家理解，如有困惑請在評論中提問。

uj5u.com熱心網友回復：

它不是很快，但是如果您點擊此后端 api 端點，您似乎可以獲得所需的所有詳細資訊，包括評論計數（不是 232！）： https ://www.dineout.co.in/get_rdp_data_main/delhi/69676 /restaurant_detail_main

import requests
from bs4 import BeautifulSoup
import pandas as pd

rest_list = []
for page in range(1,3):
    print(f'getting page, {page}')

    s = requests.Session()

    url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}"  # URL of the website
    header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
    r = s.get(url, headers=header)
    soup = BeautifulSoup(r.content, 'html.parser')

    divs = soup.find_all('div', class_ = 'restnt-card restaurant')

    for item in divs:
        code = item.find('a')['href'].split('-')[-1] # restaurant code
        print(f'Getting details for {code}')
        data = s.get(f'https://www.dineout.co.in/get_rdp_data_main/delhi/{code}/restaurant_detail_main').json()

        info = data['header']
        info.pop('share') #clean up csv
        info.pop('options')
        rest_list.append(info)

df = pd.DataFrame(rest_list)
df.to_csv('dehli_rest.csv',index=False)

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/415148.html

標籤：

上一篇：PythonBeautifulSoupTypeError:find()沒有關鍵字引數

下一篇：打開.txt檔案并將輸出保存在csv檔案中