在python中單擊多個帶有Selenium的鏈接-有解無憂

我正在嘗試從如下所示的結構中抓取資料：

<div class = "tables">
        <div class = "table1">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url1"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url1">
            </div>
        </div>
        
        <div class = "table2">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url3"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "url4">
            </div>
        </div>
     </div>

我想要的資料在 div“資料”中，并且在通過單擊 url 可訪問的其他一些頁面上。我使用 BeautifulSoup 遍歷“表”，并嘗試單擊 Selenium 的鏈接，如下所示：

tables = soup.find_all('div', class_ = 'tables')
 for line in tables:
     row = line.find_all('div', class_ = "row")
     for element in row:
         link = driver.find_element_by_xpath('//a[contains(@href,"href")]')
         #some code

在我的腳本中，這一行

link = driver.find_element_by_xpath('//a[contains(@href,"href")]')

當我希望它“關注”BeautifulSoup 并回傳以下 href 時，總是回傳第一個 url。那么有沒有辦法根據源代碼中的url修改href？我應該補充一點，我所有的網址都非常相似，除了最后一部分。（例如：url1 = questions/ask/ 1000 , url2 = questions/ask/ 1001）

我還嘗試在頁面中找到所有 href 以使用它們進行迭代

links = self.driver.find_element_by_xpath('//a[@href]')

但這也不起作用。由于該頁面包含很多對我沒有用的鏈接，我不確定這是否是最好的方法。

uj5u.com熱心網友回復：

混合似乎有點復雜 - 為什么不直接提取hrefwith BeautifulSoup？

for a in soup.select('.tables a[href]'):
    link = a['href']

您還可以修改它，與 baseUrl 連接并存盤在串列中以進行迭代：

urls = [baseUrl a['href'] for a in soup.select('.tables a[href]')]

或者使用 selenium 本身并使用 offind_elements而不是find_element來獲取所有鏈接，而不僅僅是第一個鏈接：

for a in driver.find_elements_by_xpath('//div[@]//a[@href]'):
    print(a.get_attribute('href'))

例子

baseUrl = 'http://www.example.com'

html='''
<div class = "tables">
        <div class = "table1">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url1"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url1">
            </div>
        </div>

        <div class = "table2">
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url3"
            </div>
            <div class = "row">
                <div class = 'data'>Useful Data</div>
                <a href = "/url4">
            </div>
        </div>
     </div>'''
soup = BeautifulSoup(html,'lxml')

urls = [baseUrl a['href'] for a in soup.select('.tables a[href]')]

for url in urls:
    print(url)#or request the website,....

輸出

http://www.example.com/url1
http://www.example.com/url1
http://www.example.com/url3
http://www.example.com/url4

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/407889.html

標籤：

上一篇：使用python將腳本標記原始資料決議為csv

下一篇：如何使用新的相應抓取資料更新給定資料框的多列？