在搜刮網站時出現重復的結果 -有解無憂

我有下面這個腳本，它可以從一個網站上搜刮所有的資訊。然而，當我運行它時，我得到了重復的博客記錄。

import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import re

blog_topics = []
page = "https://www.bartonassociates.com/blog/"/span>
soup = BeautifulSoup(requests.get(page).content, 'html.parser')
for link in soup.find_all(href=re.compile("/blog/tag">) 。)
    url = link.get('href')
    if '/blog/tag/p' not in urlparse(link.get('href')>.path。
        blog_topics.append(url)
    else:
        pass:.
 
# VARIABLE TO DEFINE A RANGE BASED NO.OF PAGES1)

# DEFINING CUSTOM VARIABLES[/span]。
title_blognames_links_ = []
作者和日期_ = []

# LOOP to RETRIEVE TITLE, BLOG NAMES, LINKS, AUTHORS AND DATE PUBLISHED 
for page in pages:
    for blogs in blog_topics:
        blog_url= blogs  '/p'   str(page) 
        sleep(randint(2,7)
        soup = BeautifulSoup(requests.get(blog_url).content, 'html.parser')
         
        #關于標題、博客名稱及其鏈接的資訊。
        for h4 in soup.select("h4") 。
                for h2 in soup.select("h2") 。
                    title_blognames_links_.append((h4.get_text(strip=True), h4.a["href"], h2. get_text(strip=True).replace('"',")[11: ] )
                
        #作者和日期的資訊。
        for tag in soup.find_all(class_="author") 。
                author_and_dates_.append(tag.get_text(strip=True)

我相信這與我提供的pages = np.range(1)范圍有關系。 P.S. (1)只是一個線索。我已經試過(1,17),(1),(2)

。
背景。我的一個博客主題的最大頁數是17頁，每個主題有10個博客（大約）
。

我正在尋找的是，從所有的博客主題中獲取所有獨特的博客資訊
。
不知道我在這里做錯了什么
uj5u.com熱心網友回復：

要從所有的主題中獲取所有的資訊，你可以先抓取所有的主題鏈接（你在代碼中也是這樣做的），然后為每個主題獲取所有的頁面和所有的資訊（而不是其他方式）：

import re import requests import pandas as pd from bs4 import BeautifulSoup url = "https://www.bartonassociates.com/blog"/span> soup = BeautifulSoup(requests.get(url).content, "html.parser") 主題 = [... a["href"] for a in soup.select('h3:-soup-contains("blog Topics") ul a') ] all_data = [] for t in topics: while True: soup = BeautifulSoup(requests.get(t).content, "html.parser") topic_name = re.search( r'"([^"] )"', soup.select_one("h2").get_text( strip=True) ).group(1) for entry in soup.select(".blog-entry") 。 title = entry.h4.get_text( strip=True) title = entry.h4.get_text(strip=True) link = entry.a["href"] tmp = entry.select_one(".author").get_text(strip=True) if tmp: 作者, 日期 = map( str.strip, entry.select_one(".author").get_text(strip=True).split("|")。 ) else: 作者, 日期 = "N/A", "N/A". all_data.append([topic_name, title, link, author, date]) print（topic_name, title, link, author, date, sep=" ") print() t = soup.select_one('a:-soup-contains("View More") ') if not t。 break t = t["href"] df = pd.DataFrame( all_data, columns=["topic"/span>, "title"/span>, "link"/span>, "author"/span>, "date"/span>] ) print(df) df.to_csv("data.csv", index=False)

印刷品：

健康護理新聞和趨勢 DO與MD：有什么區別？ https://www.bartonassociates.com/blog/whats-the-difference-do-md Tayla Holman 2021年9月9日醫療保健的新聞和趨勢什么是 "偉大的辭呈"？ https://www.bartonassociates.com/blog/what-is-the-great-resignation Chris Keeley 2021年9月2日醫療保健的新聞和趨勢各州對醫生的繼續教育要求 https://www.bartonassociates.com/blog/cme-requirements-for-physicians-by-state Teresa Otto, MD 2021年7月15日 ...等等。

并保存data.csv（LibreOffice的螢屏截圖）：

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/320233.html
標籤：

上一篇：使用python和selenium訪問隱藏在iframe中的表格
下一篇：蟒蛇美麗的湯刮