如何使用BeautifulSoup在Python中不斷迭代下一頁-有解無憂

以下代碼包含基本上決議第一頁的工具。它獲取所有文章，但包含指向下一頁的鏈接。

如果我們看到這個網站的結構，我們可以看到到下一頁的鏈接是這樣的https://slow-communication.jp/news/?pg=2。

import re
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

main_url = 'https://slow-communication.jp'
req = Request(main_url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

soup = BeautifulSoup(webpage, "lxml")

for link in soup.findAll('a'):
    _link = str(link.get('href'))
    if '/news/' in _link:
        artice_id = _link.split("/news/")[-1]
        if len(artice_id) > 0:
            print(_link)

使用此代碼，我得到

https://slow-communication.jp/news/3589/
https://slow-communication.jp/news/3575/
https://slow-communication.jp/news/3546/
https://slow-communication.jp/news/?pg=2

但我想做的是保留文章的每個鏈接并繼續訪問下一頁。所以我會保持

https://slow-communication.jp/news/3589/
https://slow-communication.jp/news/3575/
https://slow-communication.jp/news/3546/

然后https://slow-communication.jp/news/?pg=2繼續做同樣的事情，直到網站沒有更多的下一頁。

我怎么做？

uj5u.com熱心網友回復：

您可以使用for loop and range function along with format method哪種型別的分頁速度比其他型別快 2 倍來進行分頁。您可以根據需要增加或減少頁碼。

import re
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

main_url = 'https://slow-communication.jp/news/?pg={page}'
for page in range(1,11):

    req = Request(main_url.format(page=page), headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()

    soup = BeautifulSoup(webpage, "lxml")

    for link in soup.findAll('a'):
        _link = str(link.get('href'))
        if '/news/' in _link:
            artice_id = _link.split("/news/")[-1]
            if len(artice_id) > 0:
                print(_link)

uj5u.com熱心網友回復：

您可以設定要抓取的頁數。如果沒有下一頁，它將回傳它找到的所有新聞文章。

import requests
from bs4 import BeautifulSoup

LINK = "https://slow-communication.jp/news/"

def get_news(link, pages=1, news=[]):
    if pages == 0:
        return news
        
    res = requests.get(link, headers={'User-Agent': 'Mozilla/5.0'})
    if res.status_code == 200:
        print("getting posts from", link)
        posts, link = extract_news_and_link(res.text)
        news.extend(posts)
        if link:
            return get_news(link, pages-1, news)
        return news
    else:
        print("error getting news")

def extract_news_and_link(html):
    soup = BeautifulSoup(html, "html.parser")
    news = [post.get("href") for post in soup.select(".post-arc")]
    link = soup.select_one("main > a").get("href")
    if link:
        return news, link
    return news, None
    

def main():
    news = get_news(LINK, 10)
    print("Posts:")
    for post in news:
        print(post)

if __name__ == "__main__":
    main()

uj5u.com熱心網友回復：

如果沒有更多可用的下一個站點，您可以使用while loop移動到每個下一個站點：break

while True:
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, "lxml")
    
    ###perform some action
    
    if soup.select_one('a[href*="?pg="]'):
        url = soup.select_one('a[href*="?pg="]')['href']
        print(url)
    else:
        break

您還可以收集一些資料并以結構化方式將其存盤在全域串列中：

for a in soup.select('a.post-arc'):
    data.append({
        'title':a.h2.text,
        'url':a['href']
    })

例子

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd

main_url = 'https://slow-communication.jp'
url = main_url

data = []

while True:
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, "lxml")
    
    for a in soup.select('a.post-arc'):
        data.append({
            'title':a.h2.text,
            'url':a['href']
        })
    
    if soup.select_one('a[href*="?pg="]'):
        url = soup.select_one('a[href*="?pg="]')['href']
        print(url)
    else:
        break
        
pd.DataFrame(data)

輸出

	title	url
0	都立高校から「ブラック校則」がなくなる	https://slow-communication.jp/news/3589/
1	北京パラリンピックがおわった	https://slow-communication.jp/news/3575/
2	「優生保護法で手術された人に國はおわびのお金を払え」という判決が出た	https://slow-communication.jp/news/3546/
3	ロシアがウクライナを攻撃している	https://slow-communication.jp/news/3535/
4	東京都が「同性パートナーシップ制度」を作る	https://slow-communication.jp/news/3517/

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/457775.html

標籤：Python python-3.x 网页抓取美丽的汤

上一篇：使用python下載CSS和JS

下一篇：Webscraping：使用檢查在R中查找節點/表ID