PythonBeautifulsoupfindAll找到一些但不是全部-有解無憂

借助一點 Python 知識，我嘗試抓取一些 LinkedIn 公司的帖子。

使用我從該網站獲取的以下代碼，在提取其內容之前首先找到公司 LinkedIn 頁面上的所有帖子。findAll問題是我知道，我已經計算過了，無論使用哪個決議器lxml，html5lib或者html.parser我使用哪個決議器，都比函式回傳的帖子多。在一種情況下，它回傳 67 個帖子中的 43 個，在另一種情況下，它回傳 14 個帖子中的 10 個。通常，它找到大約 3 或 4 個，然后跳過 4 或 5 個帖子，然后又找到一些，等等。

我怎樣才能找出為什么會這樣？

#!/usr/bin/env python
# coding: utf-8

# Import
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Get credentials to log in to LinkedIn
username = input('Enter your linkedin username: ')
password = input('Enter your linkedin password: ')
company_name = input('Name of the company: ')

# Access Webdriver
s=Service(ChromeDriverManager().install())
browser = webdriver.Chrome(service=s)
browser.maximize_window()

# Define page to open
page = "https://www.linkedin.com/company/{}/posts/?feedView=all".format(company_name)

# Open login page
browser.get('https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')

# Enter login info:
elementID = browser.find_element_by_id('username')
elementID.send_keys(username)
elementID = browser.find_element_by_id('password')
elementID.send_keys(password)
elementID.submit()

# Go to webpage
browser.get(page   'posts/')

# Define scrolling time
SCROLL_PAUSE_TIME = 1.5

# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")

# Scroll all the way to the bottom of the page
while True:

    # Scroll down to bottom
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = browser.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Get content of page
content = browser.page_source.encode('utf-8').strip()

# Create soup
linkedin_soup = bs(content, "html5lib")
linkedin_soup.prettify()

# Find entities that contain posts
containers = linkedin_soup.findAll("div",{"class":"occludable-update ember-view"})

uj5u.com熱心網友回復：

所以@chitown88 讓我走上了正確的道路，這是我現在擁有的最終代碼，它可以讓我得到我需要的結果：

# Define scrolling height and time
SCROLL_PAUSE_TIME = 1.5 # [sec]
SCROLL_HEIGHT = 1000

# Pause to be sure page is loaded
time.sleep(SCROLL_PAUSE_TIME)

# Scroll all the way to the bottom of the page
new_height = SCROLL_HEIGHT
while True:

    # Get maximal scroll height
    max_height = browser.execute_script("return document.body.scrollHeight")

    # Check whether maximal scroll height has been exceeded
    if new_height > max_height:
        break

    # Scroll to position
    browser.execute_script("window.scrollTo(0, {});".format(new_height))
    time.sleep(SCROLL_PAUSE_TIME)

    # Get current scroll position
    #current_height = browser.execute_script("return window.pageYOffset")

    # Increase scroll position
    new_height = new_height   SCROLL_HEIGHT

# Make sure to reach last position
browser.execute_script("window.scrollTo(0, {});".format(max_height))

我留在了current_height變數中，不確定是否需要再次使用它，此代碼需要更多驗證。也許有用的保存。

uj5u.com熱心網友回復：

問題是當您向下滾動到底部時，它會跳過一些要呈現的帖子。可能有更好的方法來做到這一點，但基本上我已經滾動了 1/4，然后是 1/2，然后是完整的（希望能抓住所有的帖子）。試試這個調整：

# Scroll all the way to the bottom of the page
while True:

    # Scroll down to bottom
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/4);")
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = browser.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/488865.html

標籤：Python html 硒硒网络驱动程序美丽的汤

上一篇：獲取Selenium錯誤-selenium.common.exceptions.ElementNotInteractableException：訊息：元素不可互動

下一篇：如何在UnityC#中創建分數計數器？