創建一個帶有回圈的網路爬蟲來搜索鏈接-有解無憂

我正在嘗試創建一個 websraper，它將從網站回傳餐廳名稱和地址。在當前版本中，它僅回傳名稱（作為測驗），但它們以字串 ( [{'name': 'Copernicus Restaurant | Copernicus Hotel'}] [{'name': 'Copernicus Restaurant | Copernicus Hotel'}, {'name': 'Farina Restaurant'}] [{'name': 'Copernicus Restaurant | Copernicus Hotel'}, {'name': 'Farina Restaurant'}, {'name': 'Cyrano de Bergerac'}]) 的形式保存。

有人可以幫我更正此代碼，以便獲取每個餐廳的鏈接，然后從這些鏈接中提取有關餐廳名稱、地址的資料嗎？

我將不勝感激任何幫助。

我的代碼：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

driver_service = Service(executable_path="C:/webdrivers/chromedriver.exe")
driver = webdriver.Chrome(service=driver_service)
baseurl = 'https://restaurantguru.com/'
driver.get('https://restaurantguru.com/restaurant-Poland-t1')

soup = BeautifulSoup(driver.page_source, 'lxml')
productlist = soup.find_all('div', class_='wrapper_info')

#print(productlist)

productlinks = []
for item in productlist:
    for link in item.find_all('a', href=True):
        productlinks.append(link['href'])

#print(productlinks)

restlist = []

for link in productlinks:
    r = driver.get(link)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    name = soup.find('h1', class_='notranslate').text.strip()
    #  address = soup.find('div', class_='address')
    #  try:
    #     website = soup.find('a', href=True)
    #  except:
    #     website = 'NULL'

    rest ={
        'name': name,
        #   'website': website,
        #   'address': address
    }

    restlist.append(rest)
    print(restlist)

driver.quit()

結果錯誤的編輯代碼：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time
import csv


#driver_service = Service(executable_path="C:/webdrivers/chromedriver.exe")
#driver = webdriver.Chrome(service=driver_service)


driver = webdriver.Chrome()
baseurl = 'https://restaurantguru.com/'
driver.get('https://restaurantguru.com/restaurant-Gostynin-t1')


soup = BeautifulSoup(driver.page_source, 'lxml')
productlist = soup.find_all('div', class_='wrapper_info')

#print(productlist)

productlinks = []
for item in productlist:
    for link in item.find_all('a', href=True):
        productlinks.append(link['href'])

#print(len(productlinks))


restlist = []

for link in productlinks:
    print('[DEBUG] link:', link)
    driver.get(link)
    print('[DEBUG] soup ...')
    soup = BeautifulSoup(driver.page_source, 'lxml')
    print('[DEBUG] name ...')
    name = soup.find('h1', class_='notranslate').text.strip()
    print(name)
    name = driver.find_element(By.XPATH, '//h1[@]').text.strip()
    print(name)
    print('[DEBUG] address ...')
    address = soup.find('div', class_='address').find('div', class_=False).text.strip()
    print(address)
    address = driver.find_element(By.XPATH, '//div[@]/div[2]').text.strip()
    print(address)
    print('[DEBUG] website ...')
    try:
        website = soup.find('div', class_='website').find('a').text #get('href')
        print(website)
        website = driver.find_element(By.XPATH, '//div[@]//a').text #get('href')
        print(website)
    except Exception as ex:
        website = ''
rest = {
    'name': name,
    'website': website,
    'address': address,
}

restlist.append(rest)


print(restlist)
#df = pd.DataFrame(restlist)
#df.to_csv('C:/webdrivers/restauracje.csv')
#print(df.head(10))

driver.quit()

uj5u.com熱心網友回復：

有很多a，href所以你必須使用更復雜的方法來獲取website.

website在<div >里面所以你可以做

website = soup.find('div', class_='website').find('a').get('href')

但到餐廳的真正鏈接是文本，而不是href

website = soup.find('div', class_='website').find('a').text

至于地址，我還必須添加額外的.find('div', class_=False)
(and .text.strip()) 才能得到它

address = soup.find('div', class_='address').find('div', class_=False).text.strip()

Selenium 有自己的方法來搜索 HTML 中的元素，也許它會運行得更快。

name = driver.find_element_by_xpath('//h1[@]').text.strip()

address = driver.find_element_by_xpath('//div[@]/div[2]').text.strip()

website = driver.find_element_by_xpath('//div[@]//a').text #get('href')

在 Linux 上使用 Firefox 測驗：

在代碼中我保留了這兩種方法：soup.find和driver.find_element_by_xpath

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

#driver_service = Service(executable_path="C:/webdrivers/chromedriver.exe")
#driver = webdriver.Chrome(service=driver_service)

try:
    driver = webdriver.Firefox()

    driver.get('https://restaurantguru.com/restaurant-Poland-t1')

    soup = BeautifulSoup(driver.page_source, 'lxml')
    productlist = soup.find_all('div', class_='wrapper_info')

    #print(productlist)

    print('[DEBUG] productlist ...')
    productlinks = []
    for item in productlist:
        for link in item.find_all('a', href=True):
            productlinks.append(link['href'])

    print('len(productlinks):', len(productlinks))

    restlist = []

    for link in productlinks:
        print('[DEBUG] link:', link)
        
        driver.get(link)
        
        print('[DEBUG] soup ...')
        soup = BeautifulSoup(driver.page_source, 'lxml')

        print('[DEBUG] name ...')
        name = soup.find('h1', class_='notranslate').text.strip()
        print(name)
        name = driver.find_element_by_xpath('//h1[@]').text.strip()
        print(name)

        print('[DEBUG] address ...')
        address = soup.find('div', class_='address').find('div', class_=False).text.strip()
        print(address)
        address = driver.find_element_by_xpath('//div[@]/div[2]').text.strip()
        print(address)

        print('[DEBUG] website ...')
        try:
            website = soup.find('div', class_='website').find('a').text #get('href')
            print(website)
            website = driver.find_element_by_xpath('//div[@]//a').text #get('href')
            print(website)
        except Exception as ex:
            print('[DEBUG] Exception:', ex)
            website = ''
            print(website)
        
        rest = {
            'name': name,
            'website': website,
            'address': address,
        }

        print('[DEBUG] rest ...')
        print(rest)

        print('-----')

        restlist.append(rest)
        
    # --- after `for`-loop ---

    print(restlist)
except KeyboardInterrupt:
    print("KeyboardInterrupt")
finally:    
    driver.quit()

    # open only once
    with open('output.csv', 'w') as f:
        csv_writer = csv.DictWriter(f, fieldnames=['name', 'website', 'address'])
        csv_writer.writeheader()
        csv_writer.writerows(restlist)

結果（來自print(restlist)）

[
{'name': 'Copernicus Restaurant | Copernicus Hotel', 'website': 'https://www.likusrestauracje.pl/', 'address': 'Kanonicza 16, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'Farina Restaurant', 'website': 'https://www.farina.com.pl/', 'address': '?wi?tego Marka 16, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'Cyrano de Bergerac', 'website': 'http://cyranodebergerac.com.pl', 'address': 'S?awkowska 26, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'Amarylis Restaurant', 'website': 'https://www.queenhotel.pl/', 'address': 'Józefa Dietla 60, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'Projekt Nano', 'website': '', 'address': 'Podmurna 17 A, Torun, Kuyavian-Pomeranian Voivodeship, Poland'}, 
{'name': 'Raffles Europejski Warsaw', 'website': 'https://www.raffles.com/warsaw', 'address': 'Nowy ?wiat-Uniwersytet'}, 
{'name': 'Caffe Horst', 'website': 'http://www.caffehorst.pl/', 'address': '?wi?toch?owicka 6, Bytom, Silesian Voivodeship, Poland'}, 
{'name': 'Proza', 'website': '', 'address': 'Jana Karola Chodkiewicza 7, Rzeszow, Podkarpackie Voivodeship, Poland'}, 
{'name': 'Il Posto di Luca Santarossa', 'website': 'http://www.ilposto.pl', 'address': 'Jana Sawy 5/lokal 10, Lublin, Lublin Voivodeship, Poland'}, 
{'name': 'Balkan Bistro Prespa', 'website': '', 'address': 'W?adys?awa Syrokomli 8, Warsaw, Masovian Voivodeship, Poland'}, 
{'name': 'Mr Coffee', 'website': '', 'address': 'Tumska 4, Klodzko, Lower Silesian Voivodeship, Poland'}, 
{'name': 'Bottiglieria 1881 Restaurant', 'website': 'https://www.1881.com.pl/', 'address': 'Bocheńska 5, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'Albertina Restaurant & Wine', 'website': 'https://www.albertinarestaurant.pl/', 'address': 'Dominikańska 3, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'Pub & Restauracja ?W Sercu ?odzi”', 'website': '', 'address': 'al. Marsza?ka Józefa Pi?sudskiego 138, ?ód?, ?ód? Voivodeship, Poland'}, 
{'name': '#Alternatywnie', 'website': 'http://www.altcoffee.pl/', 'address': 'aleja Wojska Polskiego 35/u3, Szczecin, West Pomeranian Voivodeship, Poland'}, 
{'name': 'Aqua e Vino', 'website': 'http://www.aquaevino.pl', 'address': 'Wi?lna 5/10, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'Pili Pili Gdańsk', 'website': 'http://www.pilipilicafe.com/', 'address': 'Szafarnia 11/U14, Gdańsk, Pomeranian Voivodeship, Poland'}, 
{'name': 'Kawiarnia Coffeinna', 'website': '', 'address': '1 Maja 26, Jastrz?bie-Zdrój, Silesian Voivodeship, Poland'}, 
{'name': 'Mleczarnia', 'website': 'http://www.mle.pl', 'address': 'Rabina, Beera Meiselsa 20, Kraków, Lesser Poland Voivodeship, Poland'}, 
{'name': 'La Squadra Ristorante', 'website': 'http://lasquadra.pl/restauracja/', 'address': 'Bocheńskiego 109, Katowice, Silesian Voivodeship, Poland'}
]

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/431739.html

標籤：Python 硒硒网络驱动程序网页抓取美丽的汤

上一篇：Beautifoulsoup亞馬遜產品詳情

下一篇：如何防止將相同資料的多個副本添加到我的firebase資料庫