如何刮取行中有行的網路表格？ -有解無憂

我正在嘗試刮取一個表，誰設定了這個，把一堆資訊放在一個單串列中，而在每一行中，有許多行。

我想從行內抓取每一條資訊，并將其作為一行創建一個資料框。我還想將位于<strong></strong>中的資訊設定為整個資料框架的一個列。

是否有辦法用python來做這個？我一直在使用 selenium 和 pandas read_html，但我認為我在這里遇到了障礙。最終，我想把所有這些資訊串聯到一個資料框架中。

HTML看起來像這樣。

<td>
    <strong> 重要資訊1 </strong>
    <br> 一些資訊
    <br> 一些資訊
    一些資訊
    一些資訊
    一些資訊; <br> 一些資訊
    <br> 一些資訊
</td>
<td>
    <strong> 重要資訊 2 </strong>
    <br> 一些資訊2
    <br> 一些資訊2。
    <br> 一些資訊2。
    <br> 一些資訊2。
    <br> 一些資訊2。 
    <br> 一些資訊2。
    <br> 一些資訊2。 
    <br> 一些資訊2。
    <br> 一些資訊2。 
    <br> 一些資訊2。 
</td>
<td>
    <strong> 重要資訊 3 </strong>
    <br> 一些資訊3
    <br> 一些資訊3。
    <br> 一些資訊3。
    <br> 一些資訊3。 
</td>

預期結果：

重要標題 一些資訊標題
0 重要資訊1 一些資訊
1 重要資訊1 部分資訊
2 重要資訊1 部分資訊
3 重要資訊1 部分資訊
4 重要資訊1 部分資訊
5 重要資訊1 部分資訊
6重要資訊2 一些資訊 2
7重要資訊2 部分資訊2
8重要資訊2 部分資訊2
9重要資訊2 部分資訊2
10重要資訊2 部分資訊2
11重要資訊2 部分資訊2
12重要資訊2 部分資訊2
13 重要資訊2 部分資訊2
14重要資訊2 部分資訊2
15重要資訊2 部分資訊2
16 重要資訊3 部分資訊3
17重要資訊3 部分資訊3
18重要資訊3 部分資訊3
19 重要資訊3 部分資訊 3

uj5u.com熱心網友回復：

如果我沒有理解錯的話，你想把HTML檔案轉換成一個3列的pandas DataFrame：

import pandas as pd
from bs4 import BeautifulSoup

html_doc = ""
<td>
    <strong> 重要資訊1 </strong>
    <br> 一些資訊
    <br> 一些資訊
    一些資訊
    一些資訊
    一些資訊; <br> 一些資訊
    <br> 一些資訊
</td>
<td>
    <strong> 重要資訊2 </strong>
    <br> 一些資訊 2
    <br> 一些資訊 2
    <br> 一些資訊 2
    <br> 一些資訊 2
    <br> 一些資訊 2  
    <br> 一些資訊 2
    <br> 一些資訊 2  
    <br> 一些資訊 2
    <br> 一些資訊 2  
    <br> 一些資訊 2  
</td>
<td>
    <strong> 重要資訊3 </strong>
    <br> 一些資訊3
    <br> 一些資訊 3
    <br> 一些資訊 3
    <br> 一些資訊 3  
</td>
"""

soup = BeautifulSoup(html_doc, "html.parser")

cols = []
for td in soup.select("td") 。
    col_name, *data = td.get_text(strip=True, separator="|").split("|")
    cols.append(pd.Series(data, name=col_name))

print(pd.concat(cols, axis=1)

列印：

 重要資訊1 重要資訊2 重要資訊3
0 一些資訊 一些資訊 2 一些資訊 3
1 一些資訊 一些資訊 2 一些資訊 3
2 一些資訊 一些資訊 2 一些資訊 3
3 一些資訊 一些資訊 2 一些資訊 3
4 某些資訊 某些資訊 2 NaN
5 一些資訊 一些資訊 2 NaN
6 NaN 一些資訊 2 NaN
7 NaN 某些資訊 2 NaN
8 NaN 某些資訊 2 NaN
9 NaN 某些資訊 2 NaN

uj5u.com熱心網友回復：

如果沒有你打算如何搜刮這些元素的例子，很難說什么對你最有效，但如果我假設你是從頭開始，我建議先獲取一個元素，然后獲取該元素的子元素。

它可能需要對錯誤進行處理，以達到穩健的效果。許多人喜歡使用 css 選擇器作為識別符號，但我個人喜歡xpaths。

它可能看起來像：

elements_you_want = driver.find_elements_by_xpath('xpath to parent'/span>)
for child in element:
     # do something

一些邏輯將需要選擇每個父元素，但這將真正取決于你要搜刮的特定頁面。

這一點在這個 stackoverflow 帖子中得到了更詳細的說明。獲取所有子元素

uj5u.com熱心網友回復：

確保匯入環境。

。

# >> Get ready: 匯入編程環境包 你是否使用
匯入os
from selenium import webdriver

# >> 設定chrome瀏覽器
chromedriver = "C:Program FilesPython39Scriptschromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

。 <iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" class="snippet-box-edit snippet-box-result" frameborder="0"></iframe>

代碼片斷來進行搜刮。

。

# - 編程。刮削
element_list = driver.find_elements_by_tag_name('td')
_i_ = 0
資料 = [[]]
for _item_ in element_list:
    _i_  = 1
    標題 = _item_.find_element_by_xpath('//td[' str(_i_) ']/strong').text.strip()
    Data.append([_i_, Title])
    for _element_ in _item_.find_elements_by_xpath('//td[' str(_i_) ']/br') 。
        Value = _element_.text.strip()
        Data[_i_   1].extend(Value) #或者如果填充陣列資料的程式不是真實的，請嘗試。Data[_i_].extend(Value)
    
# - 顯示結果。
print('- Data[1] = ', Data[0])
print('- Data[2] = ', Data[1])
print('- Data[3] = ', Data[2])

Update: Code export csv

import csv

def pad(data):
    max_n = max([len(x) for x in data.values() ] )
    for field in data:
        data[field]  = ['' ] * (max_n - len(data[field]))
    回傳資料

def merge_dicts(*dict_args):
    """
    給定任意數量的字典，淺層復制并合并成一個新的字典。
    優先考慮后一個字典中的鍵值對。
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    回傳結果

Data_1 = Data[0]
Data_2 = Data[0]
Data_3 = Data[0]

sdata_1 = {"Data_1":Data_1, "Data_2":Data_2}。
sdata_2 = { "Data_3":Data_3}。
data = merge_dicts(sdata_1, sdata_2)
print(data)

輸入pandas作為pd
df = pd.DataFrame(pad(data))
df.to_csv("output.csv", index=False)

print('>> 完成匯出到CSV')

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/309763.html

標籤：

上一篇：從具有相同ClassName的物件串列中選擇正確的元素。

下一篇：我怎樣才能選擇這個按鈕并點擊它？