我正在嘗試創建一個 websraper,它將從網站回傳餐廳名稱和地址。在當前版本中,它僅回傳名稱(作為測驗),但它們以字串 ( [{'name': 'Copernicus Restaurant | Copernicus Hotel'}] [{'name': 'Copernicus Restaurant | Copernicus Hotel'}, {'name': 'Farina Restaurant'}] [{'name': 'Copernicus Restaurant | Copernicus Hotel'}, {'name': 'Farina Restaurant'}, {'name': 'Cyrano de Bergerac'}]) 的形式保存。
有人可以幫我更正此代碼,以便獲取每個餐廳的鏈接,然后從這些鏈接中提取有關餐廳名稱、地址的資料嗎?
我將不勝感激任何幫助。
我的代碼:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
driver_service = Service(executable_path="C:/webdrivers/chromedriver.exe")
driver = webdriver.Chrome(service=driver_service)
baseurl = 'https://restaurantguru.com/'
driver.get('https://restaurantguru.com/restaurant-Poland-t1')
soup = BeautifulSoup(driver.page_source, 'lxml')
productlist = soup.find_all('div', class_='wrapper_info')
#print(productlist)
productlinks = []
for item in productlist:
for link in item.find_all('a', href=True):
productlinks.append(link['href'])
#print(productlinks)
restlist = []
for link in productlinks:
r = driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
name = soup.find('h1', class_='notranslate').text.strip()
# address = soup.find('div', class_='address')
# try:
# website = soup.find('a', href=True)
# except:
# website = 'NULL'
rest ={
'name': name,
# 'website': website,
# 'address': address
}
restlist.append(rest)
print(restlist)
driver.quit()
結果錯誤的編輯代碼:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time
import csv
#driver_service = Service(executable_path="C:/webdrivers/chromedriver.exe")
#driver = webdriver.Chrome(service=driver_service)
driver = webdriver.Chrome()
baseurl = 'https://restaurantguru.com/'
driver.get('https://restaurantguru.com/restaurant-Gostynin-t1')
soup = BeautifulSoup(driver.page_source, 'lxml')
productlist = soup.find_all('div', class_='wrapper_info')
#print(productlist)
productlinks = []
for item in productlist:
for link in item.find_all('a', href=True):
productlinks.append(link['href'])
#print(len(productlinks))
restlist = []
for link in productlinks:
print('[DEBUG] link:', link)
driver.get(link)
print('[DEBUG] soup ...')
soup = BeautifulSoup(driver.page_source, 'lxml')
print('[DEBUG] name ...')
name = soup.find('h1', class_='notranslate').text.strip()
print(name)
name = driver.find_element(By.XPATH, '//h1[@]').text.strip()
print(name)
print('[DEBUG] address ...')
address = soup.find('div', class_='address').find('div', class_=False).text.strip()
print(address)
address = driver.find_element(By.XPATH, '//div[@]/div[2]').text.strip()
print(address)
print('[DEBUG] website ...')
try:
website = soup.find('div', class_='website').find('a').text #get('href')
print(website)
website = driver.find_element(By.XPATH, '//div[@]//a').text #get('href')
print(website)
except Exception as ex:
website = ''
rest = {
'name': name,
'website': website,
'address': address,
}
restlist.append(rest)
print(restlist)
#df = pd.DataFrame(restlist)
#df.to_csv('C:/webdrivers/restauracje.csv')
#print(df.head(10))
driver.quit()
uj5u.com熱心網友回復:
有很多a,href所以你必須使用更復雜的方法來獲取website.
website在<div >里面所以你可以做
website = soup.find('div', class_='website').find('a').get('href')
但到餐廳的真正鏈接是文本,而不是href
website = soup.find('div', class_='website').find('a').text
至于地址,我還必須添加額外的.find('div', class_=False)
(and .text.strip()) 才能得到它
address = soup.find('div', class_='address').find('div', class_=False).text.strip()
Selenium 有自己的方法來搜索 HTML 中的元素,也許它會運行得更快。
name = driver.find_element_by_xpath('//h1[@]').text.strip()
address = driver.find_element_by_xpath('//div[@]/div[2]').text.strip()
website = driver.find_element_by_xpath('//div[@]//a').text #get('href')
在 Linux 上使用 Firefox 測驗:
在代碼中我保留了這兩種方法:soup.find和driver.find_element_by_xpath
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
#driver_service = Service(executable_path="C:/webdrivers/chromedriver.exe")
#driver = webdriver.Chrome(service=driver_service)
try:
driver = webdriver.Firefox()
driver.get('https://restaurantguru.com/restaurant-Poland-t1')
soup = BeautifulSoup(driver.page_source, 'lxml')
productlist = soup.find_all('div', class_='wrapper_info')
#print(productlist)
print('[DEBUG] productlist ...')
productlinks = []
for item in productlist:
for link in item.find_all('a', href=True):
productlinks.append(link['href'])
print('len(productlinks):', len(productlinks))
restlist = []
for link in productlinks:
print('[DEBUG] link:', link)
driver.get(link)
print('[DEBUG] soup ...')
soup = BeautifulSoup(driver.page_source, 'lxml')
print('[DEBUG] name ...')
name = soup.find('h1', class_='notranslate').text.strip()
print(name)
name = driver.find_element_by_xpath('//h1[@]').text.strip()
print(name)
print('[DEBUG] address ...')
address = soup.find('div', class_='address').find('div', class_=False).text.strip()
print(address)
address = driver.find_element_by_xpath('//div[@]/div[2]').text.strip()
print(address)
print('[DEBUG] website ...')
try:
website = soup.find('div', class_='website').find('a').text #get('href')
print(website)
website = driver.find_element_by_xpath('//div[@]//a').text #get('href')
print(website)
except Exception as ex:
print('[DEBUG] Exception:', ex)
website = ''
print(website)
rest = {
'name': name,
'website': website,
'address': address,
}
print('[DEBUG] rest ...')
print(rest)
print('-----')
restlist.append(rest)
# --- after `for`-loop ---
print(restlist)
except KeyboardInterrupt:
print("KeyboardInterrupt")
finally:
driver.quit()
# open only once
with open('output.csv', 'w') as f:
csv_writer = csv.DictWriter(f, fieldnames=['name', 'website', 'address'])
csv_writer.writeheader()
csv_writer.writerows(restlist)
結果(來自print(restlist))
[
{'name': 'Copernicus Restaurant | Copernicus Hotel', 'website': 'https://www.likusrestauracje.pl/', 'address': 'Kanonicza 16, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'Farina Restaurant', 'website': 'https://www.farina.com.pl/', 'address': '?wi?tego Marka 16, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'Cyrano de Bergerac', 'website': 'http://cyranodebergerac.com.pl', 'address': 'S?awkowska 26, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'Amarylis Restaurant', 'website': 'https://www.queenhotel.pl/', 'address': 'Józefa Dietla 60, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'Projekt Nano', 'website': '', 'address': 'Podmurna 17 A, Torun, Kuyavian-Pomeranian Voivodeship, Poland'},
{'name': 'Raffles Europejski Warsaw', 'website': 'https://www.raffles.com/warsaw', 'address': 'Nowy ?wiat-Uniwersytet'},
{'name': 'Caffe Horst', 'website': 'http://www.caffehorst.pl/', 'address': '?wi?toch?owicka 6, Bytom, Silesian Voivodeship, Poland'},
{'name': 'Proza', 'website': '', 'address': 'Jana Karola Chodkiewicza 7, Rzeszow, Podkarpackie Voivodeship, Poland'},
{'name': 'Il Posto di Luca Santarossa', 'website': 'http://www.ilposto.pl', 'address': 'Jana Sawy 5/lokal 10, Lublin, Lublin Voivodeship, Poland'},
{'name': 'Balkan Bistro Prespa', 'website': '', 'address': 'W?adys?awa Syrokomli 8, Warsaw, Masovian Voivodeship, Poland'},
{'name': 'Mr Coffee', 'website': '', 'address': 'Tumska 4, Klodzko, Lower Silesian Voivodeship, Poland'},
{'name': 'Bottiglieria 1881 Restaurant', 'website': 'https://www.1881.com.pl/', 'address': 'Bocheńska 5, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'Albertina Restaurant & Wine', 'website': 'https://www.albertinarestaurant.pl/', 'address': 'Dominikańska 3, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'Pub & Restauracja ?W Sercu ?odzi”', 'website': '', 'address': 'al. Marsza?ka Józefa Pi?sudskiego 138, ?ód?, ?ód? Voivodeship, Poland'},
{'name': '#Alternatywnie', 'website': 'http://www.altcoffee.pl/', 'address': 'aleja Wojska Polskiego 35/u3, Szczecin, West Pomeranian Voivodeship, Poland'},
{'name': 'Aqua e Vino', 'website': 'http://www.aquaevino.pl', 'address': 'Wi?lna 5/10, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'Pili Pili Gdańsk', 'website': 'http://www.pilipilicafe.com/', 'address': 'Szafarnia 11/U14, Gdańsk, Pomeranian Voivodeship, Poland'},
{'name': 'Kawiarnia Coffeinna', 'website': '', 'address': '1 Maja 26, Jastrz?bie-Zdrój, Silesian Voivodeship, Poland'},
{'name': 'Mleczarnia', 'website': 'http://www.mle.pl', 'address': 'Rabina, Beera Meiselsa 20, Kraków, Lesser Poland Voivodeship, Poland'},
{'name': 'La Squadra Ristorante', 'website': 'http://lasquadra.pl/restauracja/', 'address': 'Bocheńskiego 109, Katowice, Silesian Voivodeship, Poland'}
]
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/431739.html
