如果在抓取的資料中可用,我想洗掉頁眉和頁腳部分。
代碼
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
URL = "https://www.uh.edu/kgmca/music/events/calendar/?view=e&id=30723#event"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
text = soup.getText(separator=u' ')
我嘗試洗掉標簽但它不起作用。如何實作。
注意:請為這個問題點贊,以便我從 stackoverflow 中獲得更多功能。
提前致謝
uj5u.com熱心網友回復:
選項1:
只需獲取元素并使用.extract().
選項 2:
該<main>標簽是正確的,之間<header>和<footer>標簽。如果你只想要那部分,你可以說:
main = soup.find('main')
另外,您使用 Selenium 的任何原因?不是簡單地使用requests就能解決問題嗎?
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
URL = "https://www.uh.edu/kgmca/music/events/calendar/?view=e&id=30723#event"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
text = soup.getText(separator=u' ')
for each in ['header','footer']:
s = soup.find(each)
s.extract()
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/338314.html
上一篇:無法插入串列
