beautifulsoup：如何抓取多個以不同結尾的網址-有解無憂

我想把這本詞典刮掉，因為它是不同的動詞。動詞出現在這個“https://www.spanishdict.com/conjugate/”加上動詞。所以，例如：對于動詞“hacer”，我們將有：https : //www.spanishdict.com/conjugate/hacer

我想抓取包含每個動詞變位的所有可能鏈接，并將它們作為字串串列回傳。所以我做了以下事情：

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    reqs = requests.get(url   str())
    soup = BeautifulSoup(reqs.text, 'html.parser')

    urls = []
    for link in soup.find_all('a'):
        urls.append(link.get('href'))

    print(urls)

但是當我列印網址時，我只得到一些空串列。

預期輸出樣本：

['https://www.spanishdict.com/conjugate/hacer', 'https://www.spanishdict.com/conjugate/tener',...etc]

uj5u.com熱心網友回復：

當您遍歷“url”時，您正在遍歷一個字串。看看這段代碼：

url = 'https://www.spanishdict.com/conjugate/' 

for i in url:
    print(i)

這會產生 URL 的每個字母：

h
t
t
p
s
:
/
/
w
w
w
<truncated>

你在這里也做錯了什么：

reqs = requests.get(url   str())

我不確定您要做什么，但 'url str()' 只是 URL 加上一個空字串，即 URL。

如果洗掉 for 回圈和不必要的空字串，你會得到我認為你想要得到的東西：

import requests
from bs4 import BeautifulSoup
url = 'https://www.spanishdict.com/conjugate/' 

reqs = requests.get(url   str())
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    urls.append(link.get('href'))

print(urls)

這會產生：

['/', '/learn', '/translation', '/conjugation', '/vocabulary', '#', '/translation', '/conjugation', '/vocabulary', '/guide', '/pronunciation', '/wordoftheday', '/learn', '/guide/spanish-present-tense-forms', '/guide/spanish-present-progressive-forms', '/guide/spanish-preterite-tense-forms', '/guide/spanish-imperfect-tense-forms', '/guide/simple-future-regular-forms-and-tenses', '/guide/spanish-present-subjunctive', '/guide/commands', '/guide/spanish-imperfect-subjunctive', '/guide', '/drill?drill_start_source=conjugation hubpage', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_campaign=adhesion', '/wordoftheday', '/translate/patinar', '/', 'https://www.ingles.com/verbos', 'https://www.curiositymedia.com/', 'https://help.spanishdict.com/', '/company/privacy', '/company/tos', '/sitemap', '/', 'https://www.ingles.com/verbos', '/translation', '/conjugation', '/vocabulary', '/learn', '/guide', '/wordoftheday', 'https://www.curiositymedia.com/', '/company/privacy', '/company/tos', '/sitemap', 'https://help.spanishdict.com/', 'https://help.spanishdict.com/contact', 'https://www.facebook.com/pages/SpanishDict/92805940179', 'https://twitter.com/spanishdict', 'https://www.instagram.com/spanishdict/', 'https://itunes.apple.com/us/app/spanishdict/id332510494', 'https://play.google.com/store/apps/details?id=com.spanishdict.spanishdict&referrer=utm_source=sd-footer']

這個鏈接串列是您的目標嗎？

uj5u.com熱心網友回復：

編輯

Hopfully 明白你的意思 - 如果是這樣，應該改進問題。要從 javascript 中獲取資訊，您可以使用正則運算式決議回應：

import requests
import json
import re

r = requests.get('https://www.spanishdict.com/conjugation')
m = re.search(r'window.SD_COMPONENT_DATA = ({.*})', r.text)
['https://www.spanishdict.com/conjugate/' w for x in json.loads(m.group(1))['searchQuickLinkSections'] for w in x['words']]

輸出

['https://www.spanishdict.com/conjugate/tener',
 'https://www.spanishdict.com/conjugate/hacer',
 'https://www.spanishdict.com/conjugate/ser',
 'https://www.spanishdict.com/conjugate/estar',
 'https://www.spanishdict.com/conjugate/haber',
 'https://www.spanishdict.com/conjugate/ir',
 'https://www.spanishdict.com/conjugate/poder',
 'https://www.spanishdict.com/conjugate/decir',
 'https://www.spanishdict.com/conjugate/cerrar',
 'https://www.spanishdict.com/conjugate/mentir',
 'https://www.spanishdict.com/conjugate/dormir',
 'https://www.spanishdict.com/conjugate/recordar',
 'https://www.spanishdict.com/conjugate/seguir',
 'https://www.spanishdict.com/conjugate/medir',
 'https://www.spanishdict.com/conjugate/adquirir',
 'https://www.spanishdict.com/conjugate/jugar',
 'https://www.spanishdict.com/conjugate/vestirse',
 'https://www.spanishdict.com/conjugate/divertirse',
 'https://www.spanishdict.com/conjugate/acostarse',
 'https://www.spanishdict.com/conjugate/ponerse',
 'https://www.spanishdict.com/conjugate/despertarse',
 'https://www.spanishdict.com/conjugate/sentirse',
 'https://www.spanishdict.com/conjugate/levantarse',
 'https://www.spanishdict.com/conjugate/sentarse',
 'https://www.spanishdict.com/conjugate/gustar',
 'https://www.spanishdict.com/conjugate/alegrar',
 'https://www.spanishdict.com/conjugate/quedar',
 'https://www.spanishdict.com/conjugate/encantar',
 'https://www.spanishdict.com/conjugate/parecer',
 'https://www.spanishdict.com/conjugate/faltar',
 'https://www.spanishdict.com/conjugate/doler',
 'https://www.spanishdict.com/conjugate/interesar']

獲得預期的輸出，您應該有一個動詞串列。雖然您的問題中沒有提供任何來源，但這是生成此類資訊的良好開端，但我使用了串列verbs-top-500和串列理解。

對于<a>其中包含的所有內容translate，href它將您的 url 和作為直接子項中<div>的文本的動詞連接起來<a>：

['https://www.spanishdict.com/conjugate/' a.div.text for a in soup.select('a[href*="translate"]')]

例子

import requests,json
from bs4 import BeautifulSoup
url='https://www.spanishdict.com/lists/1690101/verbs-top-500'
headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get(url,headers=headers)
soup=BeautifulSoup(r.text, 'lxml')

urls = ['https://www.spanishdict.com/conjugate/' a.div.text for a in soup.select('a[href*="translate/"]')]

輸出

['https://www.spanishdict.com/conjugate/procurar', 'https://www.spanishdict.com/conjugate/podar', 'https://www.spanishdict.com/conjugate/pillar', 'https://www.spanishdict.com/conjugate/perrear', 'https://www.spanishdict.com/conjugate/perfeccionar', 'https://www.spanishdict.com/conjugate/perdonar', 'https://www.spanishdict.com/conjugate/pegar', 'https://www.spanishdict.com/conjugate/pasear', 'https://www.spanishdict.com/conjugate/ordenar', 'https://www.spanishdict.com/conjugate/ondear', 'https://www.spanishdict.com/conjugate/ojalar', 'https://www.spanishdict.com/conjugate/ocultar', 'https://www.spanishdict.com/conjugate/nombrar',...]

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/407914.html

標籤：

上一篇：PythonBeautifulSoup資源

下一篇：如何將方法中的條件拆分為具有單獨條件的兩個方法