如何在沒有類名的情況下抓取串列元素？-有解無憂

我正在為 Bookdepository 制作一個 webscrapper，但我遇到了該網站的 html 元素的問題。一本書的頁面有一個名為Product Details的部分，我需要從串列中獲取每個元素。但是，某些元素（并非全部）（例如Language ）具有此結構 sample image。怎么可能得到這個元素？

我正在進行的作業是這樣的。非常感謝提前

import bs4
from urllib.request import urlopen


book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/"   book_isbn

source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img
book_language = book_info.find_next(string='Language',)
book_format = book_info.find_all(string='Format', )




print('Number of Pages: '   book_pages.strip())
print('ISBN Number: '   book_ibsn)
print('Publication Date: '   book_publication_date)
print('Publisher Name: '   book_publisher.strip())
print('Author: '  book_author.strip())
print(book_cover)
print(book_language)
print(book_format)

uj5u.com熱心網友回復：

要獲得與<span>您的標簽相對應的標簽，您可以使用：

book_info.find_next(string='Language').find_next('span').get_text(strip=True)

獲取所有這些產品詳細資訊的更通用方法可能是：

import bs4, re
from urllib.request import urlopen

book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/"   book_isbn

source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')

book = {
    'description':soup.find('div', class_='item-excerpt trunc').get_text(strip=True),
    'title':soup.find('h1').text
}

book.update({e.label.text.strip():re.sub('\s ', ' ',e.span.text).strip() for e in soup.select('.biblio-info li')})
book

輸出：

{'description': "'A breathtaking memoir...I was so moved by this book.' Oprah'It is startlingly honest and, at times, a jaw-dropping read, charting her rise from poverty and abuse to becoming the first African-American to win the triple crown of an Oscar, Emmy and Tony for acting.' BBC NewsTHE DEEPLY PERSONAL, BRUTALLY HONEST ACCOUNT OF VIOLA'S INSPIRING LIFEIn my book, you will meet a little girl named Viola who ran from her past until she made a life changing decision to stop running forever.This is my story, from a crumbling apartment in Central Falls, Rhode Island, to the stage in New York City, and beyond. This is the path I took to finding my purpose and my strength, but also to finding my voice in a world that didn't always see me.As I wrote Finding Me, my eyes were open to the truth of how our stories are often not given close examination. They are bogarted, reinvented to fit into a crazy, competitive, judgmental world. So I wrote this for anyone who is searching for a way to understand and overcome a complicated past, let go of shame, and find acceptance. For anyone who needs reminding that a life worth living can only be born from radical honesty and the courage to shed facades and be...you.Finding Me is a deep reflection on my past and a promise for my future. My hope is that my story will inspire you to light up your own life with creative expression and rediscover who you were before the world put a label on you.show more",
 'title': 'Finding Me : A Memoir - THE INSTANT SUNDAY TIMES BESTSELLER',
 'Format': 'Hardback | 304 pages',
 'Dimensions': '160 x 238 x 38mm | 520g',
 'Publication date': '26 Apr 2022',
 'Publisher': 'Hodder & Stoughton',
 'Imprint': 'Coronet Books',
 'Publication City/Country': 'London, United Kingdom',
 'Language': 'English',
 'ISBN10': '1399703994',
 'ISBN13': '9781399703994',
 'Bestsellers rank': '31'}

uj5u.com熱心網友回復：

您可以檢查標簽文本是否等于語言，然后列印文本。我還添加了一種更好的方法來在一次迭代中決議產品詳細資訊部分。

檢查下面給出的代碼： -

import bs4
from urllib.request import urlopen
import re

book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/"   book_isbn

source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')

book_info = soup.find('ul', class_='biblio-info')
lis=book_info.find_all('li')

# Check if the label name is Language and the print the span text
for val in lis:
    label=val.find('label')
    if label.text.strip()=='Language':
        span=val.find('span')
        span_text=(span.text.strip())
        print('Language--> ' span_text)

# A better approach to get all the Name and value pairs in the Product details section in a single iteration
for val in lis:
    label=val.find('label')
    span=val.find('span')
    span_text=(span.text.strip())
    modified_text = re.sub('\n', ' ', span_text)
    modified_text = re.sub('  ', ' ', modified_text)
    print(label.text.strip() '--> ' modified_text)

uj5u.com熱心網友回復：

您可以使用 css 選擇器從詳細資訊部分獲取所需的資料

import bs4
from urllib.request import urlopen
import re

book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/"   book_isbn
#print(book_urls)
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img.get('src')
book_language =soup.select_one('.biblio-info > li:nth-child(7) span').get_text(strip=True)
book_format = soup.select_one('.biblio-info > li:nth-child(1) span').get_text(strip=True)
book_format = re.sub(r'\s ', ' ',book_format).replace('|','')


print('Number of Pages: '   book_pages.strip())
print('ISBN Number: '   book_ibsn)
print('Publication Date: '   book_publication_date)
print('Publisher Name: '   book_publisher.strip())
print('Author: '  book_author.strip())
print(book_cover)
print(book_language)
print(book_format)

輸出：

Number of Pages: 304 pages
ISBN Number: 9781399703994
Publication Date: 26 Apr 2022
Publisher Name: Hodder & Stoughton
Author: Viola Davis
https://d1w7fb2mkkr3kw.cloudfront.net/assets/images/book/lrg/9781/3997/9781399703994.jpg
English
Hardback 304 pages

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/471078.html

標籤：Python html 网页抓取美丽的汤

上一篇：使用Python從表中抓取每日資訊

下一篇：Countif和Sumif作為Access中SQL選擇查詢的一部分