如何僅從9gag抓取圖片帖子-有解無憂

我想抓取第一個圖片帖子并將網址列入黑名單以進行下一次搜索，他跳過已經使用的網址并搜索下一個圖片帖子。我試過這個來找到第一張圖片，但它不起作用。

driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

錯誤：回溯（最近一次呼叫最后）：檔案“C:\Users\klaus\PycharmProjects\testTEST\main.py”，第 37 行，在 gagposttitle = gagpost.find_element(By,value='img').get_attribute(' alt') 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”，第 763 行，在 find_element 回傳 self._execute(Command .FIND_CHILD_ELEMENT，檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py”，第 740 行，在 _execute 中回傳 self。parent.execute(command, params) 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webdriver.py”，第 428 行，在執行回應中= self.command_executor.execute(driver_command, params) 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\remote_connection.py”，第 345 行，在執行資料 = utils.dump_json(params) 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\utils.py”，第 23 行，在dump_json 回傳 json.dumps(json_struct) 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\ json_init.py”，第 231 行，在轉儲中回傳 _default_encoder.encode(obj) 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”，第 199 行，在編碼塊中= self.iterencode(o, _one_shot=True) 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py”，第 257 行，在 iterencode 回傳 _iterencode(o, 0 ) 檔案“C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\ encoder.py ”，第 179 行，默認引發 TypeError( f'Object of type {o.class.name } ' TypeError：型別型別的物件不是 JSON 可序列化的

行程以退出代碼 1 結束

我也試過這個，有時它有效，有時沒有。

driver = webdriver.Chrome()

driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

我將不勝感激任何幫助。

uj5u.com熱心網友回復：

您可以這樣實作：

from selenium.common.exceptions import NoSuchElementException
...
# Get the feed element
feed = driver.find_element(By.CSS_SELECTOR, "div.main-wrap section#list-view-2")
# Get the streams from the feed
streams = feed.find_elements(By.CLASS_NAME, "list-stream")
# Debug number of streams
print(f"Streams: {len(streams)}")
# Iterate over each stream
for stream in streams:
    # Find articles within the stream; these are the 'posts'
    articles = stream.find_elements(By.TAG_NAME, "article")
    # Debug number of articles
    print(f"Articles: {len(articles)}")
    # Iterate over each article
    for article in articles:
        # Try/except here because some articles are adverts, these are skipped
        try:
            # Find the article title
            title = article.find_element(By.CSS_SELECTOR, "header > a")
        except NoSuchElementException:
            continue
        # Print the article title
        print(f"Title: {title.text}")

這列印出來

Streams: 1
Articles: 3
Title: Hahahahaha Git Gud
Title: How to impress your guests

這并沒有列印出頁面上的所有帖子，因為它們是延遲加載的。這意味著在您滾動時會從服務器獲取帖子。要加載它們，您需要對上述代碼實作滾動功能。幸運的是，Python Selenium 的檔案有一個針對這種特殊情況的示例。您還可以參考我之前的回答，了解實作的外觀。

我只添加了足夠的代碼來獲取標題，您可以從article嵌入式回圈中的變數中提取所需的其余資訊。

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/482285.html

標籤：Python python-3.x 硒硒网络驱动程序网络

上一篇：我在使用chrome無頭運行selenium測驗時遇到問題

下一篇：我在嘗試訪問時不斷收到NoSuchElementException。我能做些什么？