嘗試使用BeautifulSoup從<td></td>標簽中檢索文本-有解無憂

所以我使用 BeautifulSoup 來抓取代碼中的鏈接。藝術家姓名和鏈接顯示正常，但我不確定如何訪問第二個標簽中的國籍。

這是代碼：

import requests
import csv
from bs4 import BeautifulSoup

def findName():
  page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')

  soup = BeautifulSoup(page.text, 'html.parser')

  last_links = soup.find(class_='AlphaNav')
  last_links.decompose()

  f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
  f.writerow(['Last Name, First Name', 'Nationality', 'Link'])

  artist_name_list = soup.find(class_='BodyText') 
  artist_name_list_items = artist_name_list.find_all('a') 
  artist_nationality_list_items = artist_name_list.find_all('td')

  print(artist_nationality_list_items)

  for artist_name in artist_name_list_items:        
    names = artist_name.contents[0]
    #nationalities = artist_nationality_list_items.contents[0]  
    links = 'https://web.archive.org'   artist_name.get('href')

    #print(nationalities)

    f.writerow([names, links])

findName()

如果我取消對 for 回圈中的行的注釋，我會收到我期望的運行時錯誤。列印陳述句為我提供了藝術家國籍串列專案的這個值：

<td><a href="/web/20121007172915/http://www.nga.gov/cgi-bin/tsearch?artistid=32727">Babbitt, Platt D.</a></td>, <td>American, died 1879</td>, ..... <- follows this pattern for every artist

基本上，我想要“美國人，死于 1879 年”的部分。

uj5u.com熱心網友回復：

您可以使用selectwhich 接受 CSS 選擇器 with在每個中:nth-child()選擇第二個而不是，因此：<td><tr>find_all

artist_nationality_list_items = artist_name_list.find_all('td')

變成：

artist_nationality_list_items = artist_name_list.select('td:nth-child(2)')

uj5u.com熱心網友回復：

您仍然可以使用contents，但不要陷入所有串列的困境 - 選擇更具體的目標并以更多的流程獲取所有資訊。

發生什么了？

您將artist_nationality_list_items（串列）視為單個元素，這是行不通的。

怎么修？

要從您的結果中獲得正確的結果，您artist_nationality_list_items也必須對其進行迭代。

（有效，但壞主意）：

for i,artist_name in enumerate(artist_name_list_items):        
    names = artist_name.contents[0]
    nationalities = artist_nationality_list_items[i 1].contents[0]  
    links = 'https://web.archive.org'   artist_name.get('href')

替代和更精簡的方法

import requests, csv
from bs4 import BeautifulSoup

def findName():
    page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')

    soup = BeautifulSoup(page.text, 'html.parser')

    f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
    f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
    
    for row in soup.select('div.BodyText h3 table tr'):

        names = row.contents[0].text
        nationalities = row.contents[1].text
        links = 'https://web.archive.org'   row.a.get('href')

        #print([names,nationalities,links])

        f.writerow([names,nationalities,links])

findName()

uj5u.com熱心網友回復：

一些草率的解決方法有點拙劣的答案，但這導致了我所需要的：

import requests
import csv
from bs4 import BeautifulSoup

def findName():
  page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')

  soup = BeautifulSoup(page.text, 'html.parser')

  last_links = soup.find(class_='AlphaNav')
  last_links.decompose()

  f = csv.writer(open('b-artist_lastname.csv', 'w')) # Create a file to write
  f.writerow(['Last Name, First Name', 'Nationality', 'Link'])

  artist_name_list = soup.find(class_='BodyText') 
  artist_name_list_items = artist_name_list.find_all('a') 

  i = 2

  for artist_name in artist_name_list_items:   
    str_list = list('td:nth-of-type(i)')
    str_list[15] = str(i)

    selection = "".join(str_list)

    names = artist_name.contents[0]
    nationality = artist_name_list.select(selection)  
    links = 'https://web.archive.org'   artist_name.get('href')

    nat_to_str = str(nationality)
    nat_str_final = nat_to_str[5:len(nat_to_str) - 6]

    #print(nat_str_final)

    f.writerow([names, nat_str_final, links])
    i  = 2

findName()

感謝所有回答的人。使用 'td:nth-of-type()' 似乎有效，但對于我來說，要讓每個藝術家都出現在頁面上，我每次都需要增加 nth-of-type 中的值，所以我使用了一個字串列和在每次遍歷時增加 I 后將它們轉換為字串。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/346626.html

標籤：Python html 网页抓取美汤

上一篇：從物件陣列中組合未知的物件子陣列并對其進行過濾

下一篇：當它們回傳null/none時，如何為刮取的結果設定默認值