我是 python 新手,正在慢慢學習。我之前已經從資料庫執行 API 呼叫以提取資訊。但是,我正在處理一個特定的印度資料庫。html 腳本似乎令人困惑,無法提取我正在尋找的特定資訊。基本上,我有一個草藥名稱鏈接串列作為輸入,如下所示(只有 ID 更改):
http://envis.frlht.org/plantdetails/3315/fd01bd598f0869d65fe5a2861845f9f8
http://envis.frlht.org/plantdetails/2133/fd01bd598f0869d65fe5a2861845f9f9
http://envis.frlht.org/plantdetails/845/fd01bd598f0869d65fe5a2861845f9f10
http://envis.frlht.org/plantdetails/363/fd01bd598f0869d65fe5a2861845f9f11
當我打開每一個時,我想從網頁中提取這些草藥鏈接的“分發”詳細資訊。就這樣。但是,在 html 腳本中,我無法確定哪個標頭具有詳細資訊。在來這里之前我嘗試了很多。有人可以幫幫我嗎。提前致謝。
代碼:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import json
import pandas as pd
import os
from pathlib import Path
from pprint import pprint
user_home = os.path.expanduser('~')
OUTPUT_DIR = os.path.join(user_home, 'vk_frlht')
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
herb_url = 'http://envis.frlht.org/bot_search'
response = requests.get(herb_url)
soup = BeautifulSoup(response.text, "html.parser")
token = soup.find('Type Botanical Name', {'type': 'hidden', 'name': 'token'})
herb_query_url = 'http://envis.frlht.org/plantdetails/3315/fd01bd598f0869d65fe5a2861845f9f8'
response = requests.get('http://envis.frlht.org/plantdetails/3315/fd01bd598f0869d65fe5a2861845f9f8')
#optional code for many links at once
with open(Path, 'r') as f:
frlhtinput = f.readlines()
frlht = [x[:-1] for x in frlhtinput]
for line in frlht:
out = requests.get(f'http://envis.frlht.org/plantdetails/{line}')
#end of the optional code
herb_query_soup = BeautifulSoup(response.text, "html.parser")
text = herb_query_soup.find('div', {'id': 'result-details'})
pprint(text)
uj5u.com熱心網友回復:
這是這個頁面在報廢后的樣子:

中間的加載符號表示只有在 JavaScript 代碼執行后才能加載內容。這意味著有人用 JS 代碼保護了這個內容。您必須使用 Selenium 瀏覽器而不是 BS4。
請參閱此處的教程,了解如何使用它。
uj5u.com熱心網友回復:
嘗試一下。
import requests
from bs4 import BeautifulSoup
from pprint import pprint
plant_ids = ["3315", "2133", "845", "363"]
results = []
for plant_id in plant_ids:
herb_query_url = f"http://envis.frlht.org/plantdetails/{plant_id}/fd01bd598f0869d65fe5a2861845f9f8"
headers = {
"Referer": herb_query_url,
}
response = requests.get(
f"http://envis.frlht.org/bot_search/plantdetails/plantid/{plant_id}/nocache/0.7763327765552295/referredfrom/extplantdetails",
headers=headers,
)
herb_query_soup = BeautifulSoup(response.text, "html.parser")
result = herb_query_soup.findAll("div", {"class": "initbriefdescription"})
for r in result:
result_dict = {r.text.split(":", 1)[0].strip(): r.text.split(":", 1)[1].strip()}
results.append(result_dict)
pprint(results)
uj5u.com熱心網友回復:
該資訊是根據您擁有的 URL 從另一個 URL 獲取的。首先,您需要構建所需的 URL(在瀏覽器中找到)并請求它。
嘗試以下操作:
import requests
from bs4 import BeautifulSoup
urls = [
"http://envis.frlht.org/plantdetails/3315/fd01bd598f0869d65fe5a2861845f9f8",
"http://envis.frlht.org/plantdetails/2133/fd01bd598f0869d65fe5a2861845f9f9",
"http://envis.frlht.org/plantdetails/845/fd01bd598f0869d65fe5a2861845f9f10",
"http://envis.frlht.org/plantdetails/363/fd01bd598f0869d65fe5a2861845f9f11",
]
for url in urls:
print('\n', url)
url_split = url.split('/')
url_details = f"http://envis.frlht.org/bot_search/plantdetails/plantid/{url_split[4]}/nocache/{url_split[5]}/referredfrom/extplantdetails"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
'Referer' : url,
}
req = requests.get(url_details, headers=headers)
soup = BeautifulSoup(req.content, "html.parser")
for div in soup.find_all('div', class_="initbriefdescription"):
print(" ", div.get_text(strip=True))
對于您的 4 個 ID,它顯示:
http://envis.frlht.org/plantdetails/3315/fd01bd598f0869d65fe5a2861845f9f8
Accepted Name:Amaranthus hybridusL. subsp.cruentusvar.paniculatusTHELL.
Family:AMARANTHACEAE
Synonyms:Amaranthus paniculatusL.
Used in:Ayurveda, Siddha, Folk
Distribution:This species is globally distributed in Africa, Asia and India. It is said to be cultivated as a leafy vegetable in Maharashtra, Karnataka (Coorg) and on the Nilgiri hills of Tamil Nadu. It is also found as an escape.
http://envis.frlht.org/plantdetails/2133/fd01bd598f0869d65fe5a2861845f9f9
Accepted Name:Triticum aestivumL.
Family:POACEAE
Synonyms:Triticum sativumLAM.Triticum vulgareWILL.
Used in:Ayurveda, Siddha, Unani, Folk, Chinese, Modern
http://envis.frlht.org/plantdetails/845/fd01bd598f0869d65fe5a2861845f9f10
Accepted Name:Dolichos biflorusL.
Family:FABACEAE
Synonyms:Dolichos uniflorusLAMK.Macrotyloma uniflorum(LAM.) VERDC.
Used in:Ayurveda, Siddha, Unani, Folk, Sowa Rigpa
Distribution:This species is native to India, globally distributed in the Paleotropics. Within India, it occurs all over up to an altitude of 1500 m. It is an important pulse crop particularly in Madras, Mysore, Bombay and Hyderabad.
http://envis.frlht.org/plantdetails/363/fd01bd598f0869d65fe5a2861845f9f11
Accepted Name:Brassica oleraceaL.
Family:BRASSICACEAE
Used in:Ayurveda, Siddha
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/459898.html
下一篇:嘗試刮表提供空輸出
