我正在嘗試刮掉 Goodreads 上選擇獎上列出的書籍的書籍描述。我正在使用以下函式來獲取為特定型別列出的各個 url
def get_genre_url(genre):
all_links = []
for year in (range(2011,2022)):
url = 'https://www.goodreads.com/choiceawards/best-' genre '-books-' str(year)
page = requests.get(url)
soup = bs(page.content, 'html.parser')
for link in soup.find_all('a', {'class':'pollAnswer__bookLink'}):
all_links.append('https://www.goodreads.com' link.get('href'))
return(all_links)
獲得書籍網址后,我繼續洗掉這些網址以獲取書籍描述。
def get_description(genre_list):
urls = []
authors = []
titles = []
index = 0
for url in genre_list:
#print(index,url)
page = requests.get(url)
soup = bs(page.content, 'html.parser')
authors.append(soup.find('title').get_text().split(' by ')[1])
#print(index,authors)
description_df = pd.DataFrame (authors, columns = ['author'])
titles.append(soup.find('title').get_text().split(' by ')[0])
description_df['title'] = titles
if soup.find('div',{'class':'readable stacked'}) is None:
#print('This is a NoneType page:', url)
description = soup.find('div',{'class':'TruncatedText__text TruncatedText__text--5'})
else:
description = soup.find('div',{'class':'readable stacked'}).get_text()
urls.append(description)
index = 1
description_df['description'] = urls
return(description_df)
為了獲得我會呼叫的最終資料框(例如)
mystery_thriller_list = get_genre_url('mystery-thriller')
description_myster_thriller = get_description(mystery_thriller_list)
但是,我想要將流派串列(例如genres = ['fiction', 'mystery-thriller'])傳遞給函式,并為每個流派創建最終資料幀,其中資料框名稱將具有命名約定 description_'selected 流派'。到目前為止,我還沒有弄明白,for 回圈需要一些時間,因為它正在為每種型別的 220 本書加載資訊。
uj5u.com熱心網友回復:
您可以將所有資料幀存盤在字典中,并將鍵作為它們的流派名稱。
all_genres_descriptions = {}
genres = ['fiction', 'mystery-thriller']
for genre in genres:
genre_list = get_genre_url(genre)
description_genre = get_description(genre_list)
all_genres_descriptions[f'description_{genre}'] = description_genre
uj5u.com熱心網友回復:
夫婦的事情。對于測驗,您不需要瀏覽所有年份和書籍。我只看一年和前兩本書。要做你正在尋找的東西,你可以使用 globals()。您可能還只想創建一個資料框,但在每次迭代中添加一列“流派”并連接。從長遠來看,將所有資料放在一個資料框中可能會更容易。
genres = ['fiction', 'mystery-thriller']
for genre in genres:
mystery_thriller_list = get_genre_url(genre)
globals()[f"{genre.replace('-', '_')}_selected_genre"] = get_description(mystery_thriller_list)
print(fiction_selected_genre)
author title description
0 Haruki Murakami 1Q84 (1Q84 #1-3) \nThe year is 1984 and the city is Tokyo.A you...
1 Sarah Addison Allen The Peach Keeper \nThe New York Times bestselling author of The...
print(mystery_thriller_selected_genre)
author title description
0 Janet Evanovich | Goodreads Smokin' Seventeen (Stephanie Plum, #17) [[[<p><b><i>Where there’s smoke there’s fire, ...
1 J.D. Robb New York to Dallas (In Death, #33) \nTwelve years ago, Eve Dallas was just a rook...
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/407886.html
標籤:
上一篇:在Python中使用Seleniumwebdriver勾選復選框
下一篇:網頁抓取時列印出奇怪的字符
