計算特定URL上特定單詞的頻率-Python-有解無憂

我希望得到一個特定詞在給定 URL 上顯示的頻率。我目前有一種方法可以為一小組 URL 和一個單詞執行此操作：

import requests
from bs4 import BeautifulSoup

url_list = ["https://www.example.org/","https://www.example.com/"]

#the_word = input()
the_word = 'Python'

total_words = []
for url in url_list:
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    words = soup.find_all(text=lambda text: text and the_word.lower() in text)
    count = len(words)
    words_list = [ ele.strip() for ele in words ]
    for word in words:
        total_words.append(word.strip())

    print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
    print(words_list)


#print(total_words)
total_count = len(total_words)

但是，我希望能夠對映射到其各自 URL 的一組單詞執行此操作，如下面的資料框所示。

目標詞	目標網址
字1	www.example.com/topic-1/
字2	www.example.com/topic-2/

理想情況下，輸出會給我一個新列，其中包含單詞在其關聯 URL 上顯示的頻率。例如，'word1'在'www.example.com/topic-1/'上出現的頻率。

非常感謝任何和所有幫助！

uj5u.com熱心網友回復：

只需迭代您的結構 - dict，dicts 串列，...下面的示例只會指向一個方向，因為您的問題不是那么清楚，并且缺少確切的預期結果。我相信您可以根據自己的特殊需要進行調整。

例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = [
    {'word':'Python','url':'https://stackoverflow.com/questions/tagged/python'},
    {'word':'Question','url':'https://stackoverflow.com/questions/tagged/python'}
]

for item in data:
    r = requests.get(item['url'], allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    count = soup.body.get_text(strip=True).lower().count(item['word'].lower())
    item['count'] = count

pd.DataFrame(data)

輸出

單詞	網址	數數
Python	https://stackoverflow.com/questions/tagged/python	93
問題	https://stackoverflow.com/questions/tagged/python	13

注意： 根據您要確定的詞頻，您應該考慮以下幾點：

人類可讀的將與 html 分開提取，例如使用 BeautifulSoup。
根據網頁內容的靜態/動態提供方式，必須選擇該工具。例如，對于動態內容，最好使用selenium，因為與請求不同，它還呈現 JavaScript。

uj5u.com熱心網友回復：

您應該嘗試對字串使用count()方法并且使用您的代碼，它將如下所示：

count = url.count(the_word)
print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/403843.html

標籤：

上一篇：在資料框中為一年中的每個月運行函式

下一篇：將帶有JSON列串列的DataFrame轉換為JSON