所以我使用 BeautifulSoup 來抓取代碼中的鏈接。藝術家姓名和鏈接顯示正常,但我不確定如何訪問第二個標簽中的國籍。
這是代碼:
import requests
import csv
from bs4 import BeautifulSoup
def findName():
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
last_links = soup.find(class_='AlphaNav')
last_links.decompose()
f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')
artist_nationality_list_items = artist_name_list.find_all('td')
print(artist_nationality_list_items)
for artist_name in artist_name_list_items:
names = artist_name.contents[0]
#nationalities = artist_nationality_list_items.contents[0]
links = 'https://web.archive.org' artist_name.get('href')
#print(nationalities)
f.writerow([names, links])
findName()
如果我取消對 for 回圈中的行的注釋,我會收到我期望的運行時錯誤。列印陳述句為我提供了藝術家國籍串列專案的這個值:
<td><a href="/web/20121007172915/http://www.nga.gov/cgi-bin/tsearch?artistid=32727">Babbitt, Platt D.</a></td>, <td>American, died 1879</td>, ..... <- follows this pattern for every artist
基本上,我想要“美國人,死于 1879 年”的部分。
uj5u.com熱心網友回復:
您可以使用selectwhich 接受 CSS 選擇器 with在每個中:nth-child()選擇第二個而不是,因此:<td><tr>find_all
artist_nationality_list_items = artist_name_list.find_all('td')
變成:
artist_nationality_list_items = artist_name_list.select('td:nth-child(2)')
uj5u.com熱心網友回復:
您仍然可以使用contents,但不要陷入所有串列的困境 - 選擇更具體的目標并以更多的流程獲取所有資訊。
發生什么了?
您將artist_nationality_list_items(串列)視為單個元素,這是行不通的。
怎么修?
要從您的結果中獲得正確的結果,您artist_nationality_list_items也必須對其進行迭代。
(有效,但壞主意):
for i,artist_name in enumerate(artist_name_list_items):
names = artist_name.contents[0]
nationalities = artist_nationality_list_items[i 1].contents[0]
links = 'https://web.archive.org' artist_name.get('href')
替代和更精簡的方法
import requests, csv
from bs4 import BeautifulSoup
def findName():
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
for row in soup.select('div.BodyText h3 table tr'):
names = row.contents[0].text
nationalities = row.contents[1].text
links = 'https://web.archive.org' row.a.get('href')
#print([names,nationalities,links])
f.writerow([names,nationalities,links])
findName()
uj5u.com熱心網友回復:
一些草率的解決方法有點拙劣的答案,但這導致了我所需要的:
import requests
import csv
from bs4 import BeautifulSoup
def findName():
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
last_links = soup.find(class_='AlphaNav')
last_links.decompose()
f = csv.writer(open('b-artist_lastname.csv', 'w')) # Create a file to write
f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')
i = 2
for artist_name in artist_name_list_items:
str_list = list('td:nth-of-type(i)')
str_list[15] = str(i)
selection = "".join(str_list)
names = artist_name.contents[0]
nationality = artist_name_list.select(selection)
links = 'https://web.archive.org' artist_name.get('href')
nat_to_str = str(nationality)
nat_str_final = nat_to_str[5:len(nat_to_str) - 6]
#print(nat_str_final)
f.writerow([names, nat_str_final, links])
i = 2
findName()
感謝所有回答的人。使用 'td:nth-of-type()' 似乎有效,但對于我來說,要讓每個藝術家都出現在頁面上,我每次都需要增加 nth-of-type 中的值,所以我使用了一個字串列和在每次遍歷時增加 I 后將它們轉換為字串。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/346626.html
