我正在嘗試抓取以下網站(https://iltacon2022.expofp.com/)并且我不斷收到以下錯誤(下面的完整輸出列印)。我不確定問題是什么,我想知道是否有人可以幫助我。
if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly."
我試過使用 selenium 和 requests 模塊,但我似乎遇到了同樣的問題。
代碼試驗:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
import random
import requests
options = Options()
options.headless = False
driver = webdriver.Firefox(options=options)
url = "https://iltacon2022.expofp.com/"
driver.get(url)
time.sleep(6)
soup = bs(driver.page_source, 'lxml')
driver.quit()
print(soup)
輸出:
<html lang="en"><head>
<meta charset="utf-8"/>
<link href="https://iltacon2022.expofp.com/packages/master/favicon.png" rel="shortcut icon"/>
<meta content="user-scalable=no, initial-scale=1.0, maximum-scale=1.0, width=device-width" name="viewport"/>
<!-- <meta name="theme-color" content="#000000" /> -->
<title>ILTACON2022 – Gaylord National Resort and Convention Center | August 22–25, 2022 | Monday – Thursday – Expo Floor Plan by ExpoFP</title>
<script>
if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly.");
}
</script>
<style>
html,
body {
touch-action: none;
margin: 0;
padding: 0;
height: 100%;
width: 100%;
background: #ebebeb;
position: fixed;
overflow: hidden;
}
@media (max-width: 820px) and (min-width: 500px) {
html {
font-size: 13px;
}
}
</style>
<style>
.lds-grid {
top: 42vh;
margin: 0 auto;
display: block;
position: relative;
width: 64px;
height: 64px;
}
.lds-grid div {
position: absolute;
width: 13px;
height: 13px;
background: #aaa;
border-radius: 50%;
/* border: solid 1px #fff; */
animation: lds-grid 1.2s linear infinite;
}
.lds-grid div:nth-child(1) {
top: 6px;
left: 6px;
animation-delay: 0s;
}
.lds-grid div:nth-child(2) {
top: 6px;
left: 26px;
animation-delay: -0.4s;
}
.lds-grid div:nth-child(3) {
top: 6px;
left: 45px;
animation-delay: -0.8s;
}
.lds-grid div:nth-child(4) {
top: 26px;
left: 6px;
animation-delay: -0.4s;
}
.lds-grid div:nth-child(5) {
top: 26px;
left: 26px;
animation-delay: -0.8s;
}
.lds-grid div:nth-child(6) {
top: 26px;
left: 45px;
animation-delay: -1.2s;
}
.lds-grid div:nth-child(7) {
top: 45px;
left: 6px;
animation-delay: -0.8s;
}
.lds-grid div:nth-child(8) {
top: 45px;
left: 26px;
animation-delay: -1.2s;
}
.lds-grid div:nth-child(9) {
top: 45px;
left: 45px;
animation-delay: -1.6s;
}
@keyframes lds-grid {
0%,
100% {
opacity: 1;
}
50% {
opacity: 0.5;
}
}
</style>
<link as="script" href="https://iltacon2022.expofp.com/data/data.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/data/fp.svg.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/floorplan.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/css/fontawesome-all.min.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/sanitize-css/sanitize.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/perfect-scrollbar/css/perfect-scrollbar.css" rel="preload"/>
<!-- Fonts are anonymous because those will be loaded with FontFace -->
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-regular-400.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-solid-900.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-light-300.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-500.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-300.woff2" rel="preload"/>
<script src="https://iltacon2022.expofp.com/data/data.js"></script><script src="https://iltacon2022.expofp.com/data/wf.data.js"></script><script src="https://iltacon2022.expofp.com/data/fp.svg.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/floorplan.js"></script></head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div class="expofp-floorplan" data-event-id="iltacon2022"><div></div></div>
<script src="https://iltacon2022.expofp.com/packages/master/expofp.js"></script>
</body></html>
uj5u.com熱心網友回復:
你的任務不是微不足道的。這是一種可能的解決方案:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time as t
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
actions = ActionChains(browser)
url = 'https://iltacon2022.expofp.com/'
browser.get(url)
c_list = []
parent_el = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//div[@data-event-id="iltacon2022"]/div')))
parent_el_shadow_root = parent_el.shadow_root
t.sleep(5)
companies_div = parent_el_shadow_root.find_element(By.CSS_SELECTOR, 'div[]')
while True:
try:
companies = parent_el_shadow_root.find_elements(By.CSS_SELECTOR, "a[class = 'exhibitor-row list-row ']")
for c in companies:
if len(c.text) > 3:
c_list.append((c.text.replace('\n', ': '), c.get_attribute('href')))
print(f'we found {len(c_list)} companies')
actions.move_to_element(companies[len(c_list)]).perform()
print("moving to element", companies[len(c_list)].text.replace('\n', ': '))
t.sleep(1)
companies[len(c_list)].send_keys(Keys.PAGE_DOWN)
print('scrolled page down')
t.sleep(2)
except Exception as e:
print('all done')
break
df = pd.DataFrame(list(set(c_list)), columns = ['Company', 'Url'])
df.to_csv('surveillance_capitalists.csv')
print(df)
由于影子根位于上述代碼中的方式,因此使用 Chrome/chromedriver 很重要。上面的設定是針對 linux 的,但是你可以在你的機器上創建一個作業的 selenium/chromedriver 設定,然后你只需要觀察匯入,以及定義瀏覽器/驅動程式之后的代碼。終端中的列印輸出將非常冗長,它會告訴您發生了什么,最后將列印出包含公司及其各自 url 的資料框(也將作為 csv 檔案保存到磁盤)。然后,您可以抓取這些 url,只需確保正確檢查每個頁面,找到影子根及其內部的元素。Selenium 檔案可以在https://www.selenium.dev/documentation/找到
如有任何問題,請在此處發表評論,或在 Selenium 聊天室中提問,我認為這很有幫助。
uj5u.com熱心網友回復:
有時 AUT(被測應用程式)會嘗試檢測IE瀏覽器用于訪問應用程式的瀏覽器jQuery.
根據討論Jquery 無法檢測到 IE 11而互聯網瀏覽器 10被正確檢測到,互聯網瀏覽器 11沒有被檢測到,因為它使用了不同的userAgent:
Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv 11.0) like Gecko
建議的測驗版解決方案是:
if (!!navigator.userAgent.match(/Trident\/7\./))
return "ie";
這似乎沒有通過。然而,修改后的解決方案得到了實施:
<script>
if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
alert("Your are using old unsupported Internet Explorer browser.\nPlease upgrade to view this page properly.");
}
</script>
你在標簽中觀察到的,<script>這意味著,如果用戶代理不包含Trident您未使用更新的IE v11并且您需要升級 Internet Explorer 瀏覽器版本的字串。
結論
如果您使用Internet Explorer 瀏覽器,則可能會觀察到此設定的影響,否則您可以放心地忽略它,因為它不會影響您的測驗。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/512483.html
上一篇:在現代瀏覽器中使用自己的html標簽是否有任何技術問題
下一篇:如何解決wxWebView(wxWidgets)中的“確保網址//ieframe.dll/dnserrordiagoff.htm#正確”錯誤
