在以下 HTML 代碼中,嘗試提取并組織提取的輸出:
html_doc = """
<html>
<body>
<ul >
<li >
<div >Birds Toys</div>
<div >Toys belonging to the Bird Category</div>
<ul >
<li >
<div >
<span >Eagle</span>
<span >$40.00</span>
</div>
<p >Eagle is the national bird of the US.</p>
</li>
<li >
<div >
<span >Parrot</span>
<span >$14.00</span>
</div>
<p >Parrot is found in tropical and subtropical region.</p>
</li>
<li >
<div >
<span >Owls</span>
<span >$23.00</span>
</div>
<p >Owls are nocturnal.</p>
</li>
</ul>
<ul >
<li >
<div >
<span >Kingfisher</span>
<span >$13.00</span>
</div>
<p >Kigfisher hunts in the water</p>
</li>
<li >
<div >
<span >Quail</span>
<span >$22.00</span>
</div>
<p ></p>
</li>
</ul>
</li>
</ul>
<ul >
<li >
<div >Reptiles Toys</div>
<div >Toys belonging to Reptiles Category</div>
<ul >
<li >
<div >
<span >Snake</span>
<span >$7.00</span>
</div>
<p >Snakes can be poisonous.</p>
</li>
</ul>
<ul >
<li >
<div >
<span >Lizard</span>
<span >$7.00</span>
</div>
<p >Lizards are found both at homes and in jungle</p>
</li>
</ul>
</li>
</ul>
<ul >
<li >
<div >Germs Toys</div>
<div >Toys that belong to germs category</div>
<ul >
<li >
<div >
<span >Bacteria</span>
<span >$12.95</span>
</div>
<p >Bacteria can cause tuberclausis</p>
</li>
</ul>
<ul >
<li >
<div >
<span >Protozoa</span>
<span >$11.95</span>
</div>
<p ></p>
</li>
</ul>
<ul >
<li >
<div >
<span >Virus</span>
<span >$12.95</span>
</div>
<p >Viruses are known to cause Corona, Aids, etc.</p>
</li>
</ul>
</li>
</ul>
</body>
</html>
"""
我能夠使用以下代碼成功提取 div-class、span-class、p-class 組合:
soup = BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
# ITEM CLASS find a list of all div elements
divitemscatg = soup.find_all('div', {'class' : 'h4 category-name section-title'})
linesdivitemscatg = [span.get_text() for span in divitemscatg]
print(linesdivitemscatg)
# ITEM TITLE find a list of all span elements
spansitemtitle = soup.find_all('span', {'class' : 'item-title'})
linesitemtitle = [span.get_text() for span in spansitemtitle]
print(linesitemtitle)
# ITEM PRICE find a list of all span elements
spansitemprice = soup.find_all('span', {'class' : 'item-price'})
linesitemprice = [span.get_text() for span in spansitemprice]
print(linesitemprice)
# DESC find a list of all span elements
spansitemdesc = soup.find_all('p', {'class' : 'description'})
linesitemdesc = [span.get_text() for span in spansitemdesc]
print(linesitemdesc)
我得到的輸出是:
['Birds Toys', 'Reptiles Toys', 'Germs Toys']
['Eagle', 'Parrot', 'Owls', 'Kingfisher', 'Quail', 'Snake', 'Lizard', 'Bacteria', 'Protozoa', 'Virus']
['$40.00', '$14.00', '$23.00', '$13.00', '$22.00', '$7.00', '$7.00', '$12.95', '$11.95', '$12.95']
['Eagle is the national bird of the US.', 'Parrot is found in tropical and subtropical region.', 'Owls are nocturnal.', 'Kigfisher hunts in the water', '', 'Snakes can be poisonous.', 'Lizards are found both at homes and in jungle', 'Bacteria can cause tuberclausis', '', 'Viruses are known to cause Corona, Aids, etc.']
但我需要以不同方式組織的輸出,如下所示:
Birds Toys|Eagle|$40.00|Eagle is the national bird of the US.
Birds Toys|Parrot|$14.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|$23.00|Owls are nocturnal.
Birds Toys|Kingfisher|$13.00|Kigfisher hunts in the water
Birds Toys|Quail|$22.00|
Reptiles Toys|Snake|$7.00|Snakes can be poisonous.
Reptiles Toys|Lizard|$7.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|$12.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|$11.95|
Germs Toys|Virus|$12.95|Viruses are known to cause Corona, Aids, etc.
實作后者需要在上面的代碼中進行哪些更改。我無法以所需的格式正確安排此內容。
提前致謝。
uj5u.com熱心網友回復:
您可以通過這種方式實作目標 - 選擇每個選單項,找到其上一個類別并將其添加到您的內容中:
soup=BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
for l in soup.select('.menu-items'):
data = [
l.find_previous('div',{'class':'h4'}).text,
l.select_one('.item-title').text,
l.select_one('.item-price').text,
l.select_one('.description').text
]
output.write('|'.join(data) '\n')
輸出
Birds Toys|Eagle|$40.00|Eagle is the national bird of the US.
Birds Toys|Parrot|$14.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|$23.00|Owls are nocturnal.
Birds Toys|Kingfisher|$13.00|Kigfisher hunts in the water
Birds Toys|Quail|$22.00|
Reptiles Toys|Snake|$7.00|Snakes can be poisonous.
Reptiles Toys|Lizard|$7.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|$12.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|$11.95|
Germs Toys|Virus|$12.95|Viruses are known to cause Corona, Aids, etc.
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/478734.html
下一篇:使用XPath查找值
