我正在嘗試洗掉nav報廢資料中存在的標簽內的資料。我嘗試了幾種方法并成功提取。但是當我嘗試清理其余資料時,nav標簽中的資料也出現了。我試過了extract,decompose 但都給出了相同的結果。
代碼
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.parse
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
options = webdriver.ChromeOptions()
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3")
options.add_argument("--headless")
options.add_argument('--ignore-certificate-errors')
options.add_argument("--enable-javascript")
options.add_argument('--incognito')
URL = "https://michiganopera.org/season-schedule/frida/"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content, 'html.parser')
z = soup.find("nav",{"class":"nav-main"})
z.extract()
for h in soup.find_all('header'):
try:
h.extract()
except:
pass
for f in soup.find_all('footer'):
try:
f.extract()
except:
pass
try:
cols = soup.find("div",{"class":"modal fade"})
cols.extract()
except:
pass
text = soup.getText(separator=u' ')
print(text)
當我們運行此代碼時,我們將獲得清理過的資料,并且在這些資料中,最后有一部分必須洗掉,如下所示
要洗掉的部分
Sponsors
Email Sign Up View Calendar
Season & Tickets Season at a Glance MOT at Home Upcoming Dance Theatre of Harlem Calendar Ways to save Subscriptions Groups Gift Certificates Box Office How to Avoid Scalper Tickets Plan Your Visit Parking & Directions Sunday Shuttles Dining Cadillac Café Hotels Opera & Dance Talks FAQ Online Boutique PLAN YOUR EVENT Catering & Events Weddings Corporate & Social Event Sky Deck COVID-19 Safety Plan Get Involved Community Events Young Patrons Circle Opera Teens Opera Clubs Ambassadors Volunteers Dance Film Series Learn Summer Programs Operetta Remix Dance Classes Children’s Choruses For Schools Field Trips In-School Performances Classroom Guides Tours Allesee Resource Library Dance Dialogues MOT Learns at Home Support Annual Fund & DiChiera Society Other Ways to Give Planned Giving David DiChiera Artistic Fund Sponsorship Opportunities Why I give to MOT About Us Our History MOT History DOH History Past Seasons David DiChiera Leadership Board of Directors Wayne S. Brown Yuval Sharon Christine Goerke Admin & Staff Our mission Antiracism Statement of Commitment Opera America Member Musicians Orchestra Roster Chorus Roster Children’s Choruses Non-Profit Status Press
我在幾個網站上面臨同樣的問題。我想我在這里遺漏了一點。
提前致謝
uj5u.com熱心網友回復:
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.parse
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
options = webdriver.ChromeOptions()
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3")
options.add_argument("--headless")
options.add_argument('--ignore-certificate-errors')
options.add_argument("--enable-javascript")
options.add_argument('--incognito')
URL = "https://michiganopera.org/season-schedule/frida/"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content, 'html.parser')
z = soup.find("nav",{"class":"nav-main"})
z.extract()
for h in soup.find_all('header'):
try:
h.extract()
except:
pass
for f in soup.find_all('footer'):
try:
f.extract()
except:
pass
try:
cols = soup.find("div",{"class":"modal fade"})
cols.extract()
except:
pass
text = soup.getText(separator=u' ')
sep = 'Sponsors'
stripped = text.split(sep, 1)[0]
print(stripped)
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/344502.html
上一篇:從按鈕獲取屬性
