這是我第一次使用 Selenium 和網路抓取。
uj5u.com熱心網友回復:
*該網站正在使用 cloudflare 保護
https://www.fastfoodmenuprices.com/baskin-robbins-prices/ is using Cloudflare CDN/Proxy!
https://www.fastfoodmenuprices.com/baskin-robbins-prices/ is using Cloudflare SSL!
** 所以我必須使用以下選項來逃避檢測
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
*** 要選擇table tr, td,我使用更健壯和靈活的 css 選擇器。
**** 我必須list and zip在 pandas DataFrame 中使用函式,因為它顯示的形狀不同。
***** 我必須使用 try 除非你會看到一些選單項丟失
腳本:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
url = "https://www.fastfoodmenuprices.com/baskin-robbins-prices/"
driver.get(url)
Select(WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH, "//select[@class='tp-variation']")))).select_by_value("MS4yOA==")
price=[]
menu=[]
soup = BeautifulSoup (driver.page_source,"lxml")
driver.close()
for element in soup.select('#tablepress-34 tbody tr'):
try:
menus = element.select_one('td:nth-child(2)').text
menu.append(menus)
except:
pass
try:
prices = element.select_one('td:nth-child(3) span').text
price.append(prices)
except:
pass
df = pd.DataFrame(data=list(zip(price,menu)),columns=['price','menu'])
print(df)
網路驅動程式管理器
輸出:
price menu
0 $2.80 Mini
1 $4.84 Small
2 $5.61 Medium
3 $7.65 Large
4 $2.02 Kids
5 $2.53 Regular
6 $3.81 Large
7 $2.80 Mini
8 $6.39 Regular
9 $7.03
10 $7.03
11 $8.56
12 $7.67
13 $7.67
14 $7.67
15 $7.67
16 $4.47
17 $5.75
18 $6.64
19 $1.01
20 $1.27
21 $2.80
22 $3.57
23 $5.11
24 $1.27
25 $1.91
26 $1.91
27 $4.72 Mini
28 $6.00 Small
29 $7.28 Medium
30 $8.56 Large
31 $4.72 Mini
32 $6.00 Small
33 $7.28 Medium
34 $8.56 Large
35 $0.64
36 $4.72 Mini
37 $6.00 Small
38 $7.28 Medium
39 $8.56 Large
40 $4.72 Mini
41 $6.00 Small
42 $7.28 Medium
43 $8.56 Large
44 $7.67 Quart
45 $6.39 Pint
46 $10.23 Quart
47 $3.70
uj5u.com熱心網友回復:
一旦您選擇加利福尼亞來提取網站內的表格內容,您需要為visibility_of_element_located()引入WebDriverWait并使用Pandas中的DataFrame,您可以使用以下Locator Strategies:
代碼塊:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC import pandas as pd options = Options() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('excludeSwitches', ['enable-logging']) options.add_experimental_option('useAutomationExtension', False) options.add_argument('--disable-blink-features=AutomationControlled') s = Service('C:\\BrowserDrivers\\chromedriver.exe') driver = webdriver.Chrome(service=s, options=options) driver.get("https://www.fastfoodmenuprices.com/baskin-robbins-prices") Select(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//select[@class='tp-variation']")))).select_by_value("MS4yOA==") tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@id='tablepress-34']"))).get_attribute("outerHTML") tabledf = pd.read_html(tabledata) print(tabledf)控制臺輸出:
[ Food ... Price 0 Soft Serve Flavors: Reese’s, Heath, Snickers, ... ... Soft Serve Flavors: Reese’s, Heath, Snickers, ... 1 Soft Serve Below ... $2.80 2 Soft Serve Below ... $4.84 3 Soft Serve Below ... $5.61 4 Soft Serve Below ... $7.65 5 Cups & Cones ... $2.02 6 Cups & Cones ... $2.53 7 Cups & Cones ... $3.81 8 Parfaits ... $2.80 9 Parfaits ... $6.39 10 Sundaes ... Sundaes 11 Banana Royale ... $7.03 12 Brownie ... $7.03 13 Banana Split ... $8.56 14 Reese’s Peanut Butter Cup Sundae ... $7.67 15 Chocolate Chip Cookie Dough Sundae ... $7.67 16 Oreo? Layered Sundae ... $7.67 17 Made with Snickers Sundae ... $7.67 18 One Scoop Sundae ... $4.47 19 Two Scoops Sundae ... $5.75 20 Three Scoops Sundae ... $6.64 21 Candy Topping ... $1.01 22 Waffle Bowl ... $1.27 23 Ice Cream ... Ice Cream 24 Kid’s Scoop ... $2.80 25 Single Scoop ... $3.57 26 Double Scoop ... $5.11 27 Regular Waffle Cone ... $1.27 28 Chocolate Waffle Cone ... $1.91 29 Fancy Waffle Cone ... $1.91 30 Beverages ... Beverages 31 Cappuccino Blast ... $4.72 32 Cappuccino Blast ... $6.00 33 Cappuccino Blast ... $7.28 34 Cappuccino Blast ... $8.56 35 Iced Cappy Blast ... $4.72 36 Iced Cappy Blast ... $6.00 37 Iced Cappy Blast ... $7.28 38 Iced Cappy Blast ... $8.56 39 Add a Boost (Cappuccino or Iced Cappy Blast) ... $0.64 40 Smoothie ... $4.72 41 Smoothie ... $6.00 42 Smoothie ... $7.28 43 Smoothie ... $8.56 44 Shake ... $4.72 45 Shake ... $6.00 46 Shake ... $7.28 47 Shake ... $8.56 48 Ice Cream To Go ... Ice Cream To Go 49 Pre-Packed ... $7.67 50 Hand-Packed ... $6.39 51 Hand-Packed ... $10.23 52 Clown Cones ... $3.70 [53 rows x 3 columns]]
轉載請註明出處,本文鏈接:https://www.uj5u.com/qianduan/460072.html
上一篇:如何知道選擇和查找之間的區別
