我有一組存盤在串列中的 URL,我想制作一個腳本來收集 Genius 站點歌詞并將它們存盤在一個 txt 檔案中。
我已經制作了這個腳本,但由于某種原因回傳的內容不完整。
這是代碼:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from time import time
urls = ['https://genius.com/The-Stooges-1969-lyrics','https://genius.com/The-Stooges-1970-lyrics',
'https://genius.com/The-Rolling-Stones-19th-Nervous-Breakdown-lyrics','https://genius.com/Lil-Wayne-3-Peat-lyrics',
'https://genius.com/RunDMC-30-Days-lyrics','https://genius.com/Bob-marley-and-the-wailers-four-hundred-years-lyrics',
'https://genius.com/The-Clash-48-Hours-lyrics']
start = time()
for u in urls:
soup = BeautifulSoup(requests.get(u).content, 'lxml')
for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):
lyrics = tag.get_text(strip=True, separator='\n')
if lyrics:
with open("PATH\\" str(urls.index(u)) ".txt", 'w') as f:
f.write(lyrics)
print(f'Time taken: {time() - start}')
例如,請參閱 URL 上的歌曲歌詞:https : //genius.com/Rundmc-30-days-lyrics。
現在看得到的回報:
“[DMC] 如果你需要一個假期,我們可以飛遍世界你會知道我永遠不會看另一個女孩我是一個專一的男人,我的心已經定下來你是'我要得到的 80 年代 [Both] 如果你發現你不喜歡我的方式 好吧,你可以在 30 天內把我送回去”
不知何故,我可以訪問歌詞,但似乎缺少一些使腳本強大的東西,因為它在某些情況下會削減內容。
有誰知道我可能有什么問題?
uj5u.com熱心網友回復:
我真的不明白它為什么這樣做,但可能只是該網站有時呈現不同。我做了一些調整,到目前為止還沒有看到這個問題。這可能來自它決議文本的方式,然后是您寫入檔案的方式,所以我調整了 for 回圈中的一些縮進,以了解它如何連接字串:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from time import time
urls = ['https://genius.com/The-Stooges-1969-lyrics','https://genius.com/The-Stooges-1970-lyrics',
'https://genius.com/The-Rolling-Stones-19th-Nervous-Breakdown-lyrics','https://genius.com/Lil-Wayne-3-Peat-lyrics',
'https://genius.com/RunDMC-30-Days-lyrics','https://genius.com/Bob-marley-and-the-wailers-four-hundred-years-lyrics',
'https://genius.com/The-Clash-48-Hours-lyrics']
start = time()
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Mobile Safari/537.36'}
for u in urls:
response = requests.get(u, headers=headers)
#print(response)
soup = BeautifulSoup(response.text, 'lxml')
lyrics = ''
for tag in soup.find_all("div", {"class":re.compile(r'^Lyrics__Container')}):
lyrics = tag.get_text(strip=True, separator='\n') '\n'
if lyrics:
with open("D:/test/lyrics/" str(urls.index(u)) ".txt", 'w') as f:
f.write(lyrics)
#print(lyrics)
print(f'Time taken: {time() - start}')
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/341147.html
