使用BeautifulSoup進行簡單且少量的抓取中的標簽和類的HTML問題-有解無憂

我是新手，正在努力讓 BeautifulSoup 作業。我在恢復類和標簽時遇到了 Html 問題。我走近了，但有些地方我錯了。我插入錯誤的標簽和類來抓取新聞專案的標題、時間、鏈接和文本。

我想刮掉垂直串列中的所有這些標題，然后刮掉日期、標題、鏈接和內容。使用 BeautifulSoup 進行簡單且少量的抓取中的標簽和類的 HTML 問題

你能幫我正確的html類和標記嗎？

我沒有收到任何錯誤，但 python 控制臺保持為空

>>>

代碼

import requests
from bs4 import BeautifulSoup
    
site = requests.get('url')
beautify = BeautifulSoup(site.content,'html5lib')
    
news = beautify.find_all('div', {'class','$00'})
arti = []
    
for each in news:
  time = each.find('span', {'class','hh serif'}).text
  title = each.find('span', {'class','title'}).text
  link = each.a.get('href')
  r = requests.get(url)
  soup = BeautifulSoup(r.text,'html5lib')
  content = soup.find('div', class_ = "read__content").text.strip()
    
  print(" ")   
  print(time)
  print(title)
  print(link)
  print(" ") 
  print(content)
  print(" ")

uj5u.com熱心網友回復：

這是一個解決方案，您可以嘗試一下，

import requests
from bs4 import BeautifulSoup

# mock browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')

news = soup.find_all('div', attrs={"class": "tcc-list-news"})

for each in news:
    for div in each.find_all("div"):
        print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)
        print("-- Href ", div.find("a")['href'])
        print("-- Text ", " ".join([span.text for span in div.select("a > span")]))

-- Time  11:36
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241
-- Text  focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli
------------------------------
-- Time  11:24
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233
-- Text  focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia
------------------------------
-- Time  11:15
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229
-- Text  Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A
------------------------------
...
...

編輯：

為什么這里需要標頭？如何使用 Python 請求來偽造瀏覽器訪問并生成用戶代理？

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/448279.html

標籤：Python html python-3.x 网页抓取美丽的汤

上一篇：在pandaspython中參考表號作為變數

下一篇：gensim模型中的單詞不在詞匯錯誤中