我已經刮掉了標題和網站鏈接,但我無法提取電話號碼和地址。我怎樣才能得到它們?
這是我所擁有的:
import re
import requests
from bs4 import BeautifulSoup
url='https://www.constructionplacements.com/top-construction-companies-in-india/'
req=requests.get(url)
soup =BeautifulSoup(req.content,'lxml')
for h4 in soup.find_all(lambda tag: tag.name=='h4' and re.search(r'^\d \.',tag.text)):
title=h4.text
website=h4.find_next('a')['href']
uj5u.com熱心網友回復:
你可能想試試這個:
注意:并非所有公司都有電話號碼。
import requests
from bs4 import BeautifulSoup
def extractor(search_for: str) -> list:
return [
p.getText() for p in soup if p.getText(strip=True).startswith(search_for)
]
url = 'https://www.constructionplacements.com/top-construction-companies-in-india/'
soup = BeautifulSoup(requests.get(url).text, "lxml").select(".post p")
phone_numbers = extractor("Phone")
addresses = extractor("Address")
print(len(phone_numbers), len(addresses))
輸出:
62 70
這是做什么的
def extractor(search_for: str) -> list:
return [
p.getText() for p in soup if p.getText(strip=True).startswith(search_for)
]
基本上是迭代<p>該post部分中的所有元素,如果p.getText()以給定的短語開頭,search_for它會抓取該元素p并提取其文本值。
該邏輯適用于以Phone或開頭的段落Address。
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/407894.html
標籤:
