在python的生產服務器中每隔5分鐘抓取一個表資料并插入一個串列中-有解無憂

我試圖從這個有 382 行的表的站點中提取資料。這是站點：在python的生產服務器中每隔5分鐘抓取一個表資料并插入一個串列中

我正在使用 beautifulsoup 進行抓取，我希望這個程式每 5 分鐘運行一次計劃。我試圖將值插入 json 串列中，其中包含 382 行（不包括標題和帶有編號的第一列）。這是我的代碼：

import requests
from bs4 import BeautifulSoup


def convert_to_html5lib(URL, my_list):
    r = requests.get(URL)
    # Create a BeautifulSoup object
    soup = BeautifulSoup(r.content, 'html5lib')
    soup.prettify()

    # result = soup.find_all("div")[1].get_text()
    result = soup.find('table', {'class': 'table table-bordered background-white shares-table fixedHeader'}).get_text()
    # result = result.find('tbody')
    print(result)
    for item in result.split():
        my_list.append(item)
    print(my_list)

    # return


details_list = []
convert_to_html5lib("http://www.dsebd.org/latest_share_price_scroll_l.php", details_list)
counter = 0
while counter < len(details_list):
    if counter == 0:
        company_name = details_list[counter]
        counter  = 1
    last_trading_price = details_list[counter]
    counter  = 1
    last_change_price_in_value = details_list[counter]
    counter  = 1
schedule.every(5).minutes.do(scrape_stock)

但我沒有得到表的所有值。我想要 382 行表的所有資料作為串列，以便稍后我可以將其保存到資料庫中。但我沒有得到任何結果，調度程式也不起作用。我在做什么錯在這里？

uj5u.com熱心網友回復：

您可以使用 BeautifulSoup 來滿足需求

這里有些地方是錯誤的

僅刮 1 行。
以正確的方式使用 Schedule 庫。（參考：https : //www.geeksforgeeks.org/python-schedule-library/）

這是您的更改解決方案：

import schedule
import time
from bs4 import BeautifulSoup
import requests

def convert_to_html5lib(url,details_list):
    # Make a GET request to fetch the raw HTML content
    html_content = requests.get(url).text
    # Parse the html content
    soup = BeautifulSoup(html_content, "lxml")
    # extract table from webpage
    table = soup.find("table", { "class" : "table table-bordered background-white shares-table fixedHeader" })
    rows = table.find_all('tr')
        for row in rows:
        cols=row.find_all('td')
        # remove first element from row
        cols=[x.text.strip() for x in cols[1:]]
        details_list.append(cols)
        print(cols)
        # return

details_list = []
counter = 0
url="http://www.dsebd.org/latest_share_price_scroll_l.php"
# schedule job for every 5 mins   
schedule.every(5).minutes.do(convert_to_html5lib,url,details_list)
# same as your logic
while counter < len(details_list):
    if counter == 0:
        company_name = details_list[counter]
        counter  = 1
    last_trading_price = details_list[counter]
    counter  = 1
    last_change_price_in_value = details_list[counter]
    counter  = 1
# scheduler wait for 5 mins
while True:
    schedule.run_pending()
    time.sleep(5)

uj5u.com熱心網友回復：

你可以先查看我的代碼，獲取表中的所有資料。由于這里的資料一直在更新，我覺得還是用selenium比較好。

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://www.dsebd.org/latest_share_price_scroll_l.php"

driver = webdriver.Firefox(executable_path="") // Insert your webdriver path please

driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'table table-bordered background-white shares-table fixedHeader'})

df = pd.read_html(str(table))

print(df)

輸出：

[     Unnamed: 0  Unnamed: 1  Unnamed: 2  ...  Unnamed: 8  Unnamed: 9  Unnamed: 10
0             1   1JANATAMF         6.7  ...         137       4.022       605104
1             2  1STPRIMFMF        21.5  ...         215       5.193       243258
2             3    AAMRANET        52.4  ...        1227      65.793      1264871
3             4   AAMRATECH        31.5  ...         675      37.861      1218353
4             5    ABB1STMF         5.9  ...          57       2.517       428672
..          ...         ...         ...  ...         ...         ...          ...
377         378  WMSHIPYARD        11.2  ...         835      14.942      1374409
378         379         YPL        11.3  ...         247       4.863       434777
379         380  ZAHEENSPIN         8.8  ...         174       2.984       342971
380         381    ZAHINTEX         7.7  ...         111       1.301       174786
381         382  ZEALBANGLA       120.0  ...         102       0.640         5271

[382 rows x 11 columns]]

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/377777.html

標籤：Python 网页抓取美汤调度器

上一篇：錯誤：“NoneType”物件沒有“find_all”屬性

下一篇：為什么我無法抓取這些值