分別從每列中選擇值-有解無憂

我需要撰寫一個腳本來匯總每列的值（每列是單獨的一天）。此外，我想將值分為計劃（藍色）和計劃外（紅色）。在 HTML 代碼中，我發現計劃外值的類名稱為“colBox cal-unplanned”，計劃值的類名稱為“colBox cal-planned”。

我的代碼：

import pandas as pd
import requests
from bs4 import BeautifulSoup

URL = 'http://gpi.tge.pl/zestawienie-ubytkow' 
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

# Here I tried to convert the data into a dataframe, but then you don't know which values are planned and which are unplanned

table = soup.find_all('table')
df = pd.read_html(str(table),header=2)[0]

# Here the values are correct, but they are collected from the whole table 
sum = 0
for tr in soup.find_all('td', class_='colBox cal-unplanned'):
    val = int(tr.text)
    sum  = val
print(sum)

for tr in soup.find_all('td', class_='colBox cal-planned'):
    print(tr.text)

這是我的問題。如何分別從每列中選擇值

uj5u.com熱心網友回復：

因此，如果我理解正確，您想處理資料框的單列嗎？您可以嘗試使用它df['column_name']來訪問 df 的某個列，然后過濾此列以獲取您要使用的值，例如

df['column_name'] == filter_value

但話又說回來，我不確定我是否明白你的問題。這幫助我進行了大量的資料幀值選擇。

uj5u.com熱心網友回復：

不確定是否有更好的方法，但您可以遍歷表并將計劃和未計劃存盤到列名鍵下的單獨值中。然后總結這些值，然后使用該字典轉換為資料幀。

但是你是對的，你在用.read_html().

這有效，但不確定它對您的情況有多強大。

import pandas as pd
import requests
from bs4 import BeautifulSoup

URL = 'http://gpi.tge.pl/zestawienie-ubytkow' 
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')

data = {}
headers = [x.text.strip() for x in table.find_all('tr')[2].find_all('th')]
for header in headers:
    data[header] = {'planned':[],
                    'unplanned':[]}

rows = table.find_all('tr')[3:]
for row in rows:
    tds = row.find_all('td')[3:len(headers) 3]
    for idx, value in enumerate(tds):
        if value.has_attr("class"):
            if 'cal-planned' in value['class']:
                data[headers[idx]]['planned'].append(int(value.text.strip()))
            elif 'cal-unplanned' in value['class']:
                data[headers[idx]]['unplanned'].append(int(value.text.strip()))


sum_of_columns = {}
for col, values in data.items():
    planned_sum = sum(values['planned'])
    unplanned_sum = sum(values['unplanned'])
    
    sum_of_columns[col] = {'planned':planned_sum,
                           'unplanned':unplanned_sum}

   
df = pd.DataFrame.from_dict(sum_of_columns,orient="columns" )

輸出：

print(df.to_string())
           Cz 14  Pt 15  So 16  N 17  Pn 18  Wt 19  ?r 20  Cz 21  Pt 22  So 23  N 24  Pn 25  Wt 26  ?r 27
planned     8808   8301   7750  6863   6069   6199   6069   5627   5627   5695  5695   5235   5235   5376
unplanned   2320   2020   2313  2783    950    950    950    950    950    950   950    910    910    910

uj5u.com熱心網友回復：

不確定這是否一定是 bs4 的問題，因為我認為該資訊已經作為總和存在于 DataFrame 中。

如何訪問？

看看tail()你的資料框的：

df.tail(3)

例子

import pandas as pd

URL = 'http://gpi.tge.pl/zestawienie-ubytkow' 

df = pd.read_html(URL,header=2)[0]
df.tail(3).iloc[:,2:]

輸出

    Moc Osi?galna (MW)  Cz 14   Pt 15   So 16   N 17    Pn 18   Wt 19   ?r 20   Cz 21   Pt 22   So 23   N 24    Pn 25   Wt 26   ?r 27
219 Planowane           11279   10604   8391    6863    6069    6432    6069    5627    5627    5695    5695    5235    5235    5376
220 Nieplanowane        5520    5620    2313    2783    950 950 950 950 950 950 950 910 910 910
221 ??cznie ubytki      16799   16224   10704   9646    7019    7382    7019    6577    6577    6645    6645    6145    6145    6286

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/317032.html

標籤：Python 网页抓取美汤

上一篇：Python網路刮板列印值“0.00”

下一篇：如何獲取打開模態視窗的url