我制作此代碼是為了從網站上提取歌詞,告知藝術家和音樂名稱。
代碼正在運行,問題是我有一個包含 10000 首音樂的 DataFrame(名為 years_1920_2020),檢索所有這些歌詞需要 1:30h。
有沒有辦法更快地做到這一點?
def url_lyric(music,artist):
url_list = ("https://www.letras.mus.br/", str(artist),"/", str(music),"/")
url = ''.join(url_list)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
webpage = urlopen(req).read()
bs = BeautifulSoup(webpage, 'html.parser')
lines =bs.find('div', {'class':'cnt-letra p402_premium'})
final_lines = lines.find_all('p')
return final_lines
except:
return 0
final_lyric_series = pd.Series(name = "lyrics")
for year in range (1920,2021):
lyrics_serie = lyrics_from_year(year)
final_lyric_series = pd.concat([final_lyric_series, lyrics_serie])
print(year)
函式 Lyric_from_year(year) 使用函式 url_lyric,執行一些重新任務并回傳包含所選年份的所有歌詞的 pd.series
uj5u.com熱心網友回復:
我們可以使用 pythons asyncio 模塊獲得解決方案。請參考這篇文章這不是一個確切的解決方案,但與您的問題相似。
import asyncio
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def url_lyric(music, artist):
pass
def lyrics_from_year(year):
music = None
artist = None
return url_lyric(music, artist)
async def get_work_done():
with ThreadPoolExecutor(max_workers=10) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(
executor,
lyrics_from_year,
*(year) # Allows us to pass in arguments to `lyrics_from_year`
)
for year in range(1920, 2021)
]
return await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_work_done())
loop.run_until_complete(future)
final_lyric_series = pd.Series(name="lyrics")
for result in future:
final_lyric_series = pd.concat([final_lyric_series, result])
print(result)
uj5u.com熱心網友回復:
這是一個簡單的例子,說明如何做到這一點:
import aiohttp
import asyncio
import requests, bs4
async def main():
async with aiohttp.ClientSession() as session:
urls = [f"https://www.letras.mus.br{x['href']}" for x in bs4.BeautifulSoup(requests.get(
url = 'https://www.letras.mus.br/adele/mais-tocadas.html'
).content, 'html.parser').find_all('a', {'class':'song-name'})]
for url in urls:
async with session.get(url) as r:
lyrics = bs4.BeautifulSoup(await r.text(), 'html.parser').find('div', {'class':'cnt-letra'}).text
print('\n'.join(x.strip() for x in lyrics.strip().split('\n')))
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/365335.html
上一篇:從表中抓取Selenium資料
