我試圖從這個有 382 行的表的站點中提取資料。這是站點:
我正在使用 beautifulsoup 進行抓取,我希望這個程式每 5 分鐘運行一次計劃。我試圖將值插入 json 串列中,其中包含 382 行(不包括標題和帶有編號的第一列)。這是我的代碼:
import requests
from bs4 import BeautifulSoup
def convert_to_html5lib(URL, my_list):
r = requests.get(URL)
# Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'html5lib')
soup.prettify()
# result = soup.find_all("div")[1].get_text()
result = soup.find('table', {'class': 'table table-bordered background-white shares-table fixedHeader'}).get_text()
# result = result.find('tbody')
print(result)
for item in result.split():
my_list.append(item)
print(my_list)
# return
details_list = []
convert_to_html5lib("http://www.dsebd.org/latest_share_price_scroll_l.php", details_list)
counter = 0
while counter < len(details_list):
if counter == 0:
company_name = details_list[counter]
counter = 1
last_trading_price = details_list[counter]
counter = 1
last_change_price_in_value = details_list[counter]
counter = 1
schedule.every(5).minutes.do(scrape_stock)
但我沒有得到表的所有值。我想要 382 行表的所有資料作為串列,以便稍后我可以將其保存到資料庫中。但我沒有得到任何結果,調度程式也不起作用。我在做什么錯在這里?
uj5u.com熱心網友回復:
您可以使用 BeautifulSoup 來滿足需求
這里有些地方是錯誤的
- 僅刮 1 行。
- 以正確的方式使用 Schedule 庫。(參考:https : //www.geeksforgeeks.org/python-schedule-library/)
這是您的更改解決方案:
import schedule
import time
from bs4 import BeautifulSoup
import requests
def convert_to_html5lib(url,details_list):
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# extract table from webpage
table = soup.find("table", { "class" : "table table-bordered background-white shares-table fixedHeader" })
rows = table.find_all('tr')
for row in rows:
cols=row.find_all('td')
# remove first element from row
cols=[x.text.strip() for x in cols[1:]]
details_list.append(cols)
print(cols)
# return
details_list = []
counter = 0
url="http://www.dsebd.org/latest_share_price_scroll_l.php"
# schedule job for every 5 mins
schedule.every(5).minutes.do(convert_to_html5lib,url,details_list)
# same as your logic
while counter < len(details_list):
if counter == 0:
company_name = details_list[counter]
counter = 1
last_trading_price = details_list[counter]
counter = 1
last_change_price_in_value = details_list[counter]
counter = 1
# scheduler wait for 5 mins
while True:
schedule.run_pending()
time.sleep(5)
uj5u.com熱心網友回復:
你可以先查看我的代碼,獲取表中的所有資料。由于這里的資料一直在更新,我覺得還是用selenium比較好。
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://www.dsebd.org/latest_share_price_scroll_l.php"
driver = webdriver.Firefox(executable_path="") // Insert your webdriver path please
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'table table-bordered background-white shares-table fixedHeader'})
df = pd.read_html(str(table))
print(df)
輸出:
[ Unnamed: 0 Unnamed: 1 Unnamed: 2 ... Unnamed: 8 Unnamed: 9 Unnamed: 10
0 1 1JANATAMF 6.7 ... 137 4.022 605104
1 2 1STPRIMFMF 21.5 ... 215 5.193 243258
2 3 AAMRANET 52.4 ... 1227 65.793 1264871
3 4 AAMRATECH 31.5 ... 675 37.861 1218353
4 5 ABB1STMF 5.9 ... 57 2.517 428672
.. ... ... ... ... ... ... ...
377 378 WMSHIPYARD 11.2 ... 835 14.942 1374409
378 379 YPL 11.3 ... 247 4.863 434777
379 380 ZAHEENSPIN 8.8 ... 174 2.984 342971
380 381 ZAHINTEX 7.7 ... 111 1.301 174786
381 382 ZEALBANGLA 120.0 ... 102 0.640 5271
[382 rows x 11 columns]]
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/377777.html
上一篇:錯誤:“NoneType”物件沒有“find_all”屬性
下一篇:為什么我無法抓取這些值
