最近改了實驗室之前的爬蟲，感覺有些生疏了，故此記錄一下，
我將會通過抓取網站 https://nonfungible.com/ 來進行講解，

requests + Chrome 瀏覽器
- 使用Chrome 對目標網站資訊進行決議
- requests get 請求
- requests 添加頭
- requests ip代理
使用模擬瀏覽器獲取一些無法決議出來的資訊
- Chrome driver 的安裝和使用(windows macbook )
- 使用beautishape 來決議原始碼
- selenium 模擬點擊
- 設定無頭瀏覽器
小技巧

requests + Chrome 瀏覽器

使用Chrome 對目標網站資訊進行決議

首先打開使用chrome 瀏覽器打開目標網頁，按下 f12 鍵打開開發者界面，點擊Network 選項，開始前可以點擊清除圖表清理一下請求資訊，
在這里插入圖片描述
然后，點擊下一頁圖表，你會發現下面出現了一大堆請求資訊，

然后，將你想要在網頁上爬取的資訊復制，在下面的請求資訊的Response查找，crlt + f 搜索關鍵字
這里我通過網頁中的 id 9696 進行查找，很幸運第一個就是我們需要的請求，
在這里插入圖片描述
點開header 選項你可以看到整個請求的所有資訊，
Request url 則是請求的代碼
Request method 可以知道這是一個 get 請求

獲得了這些資訊就可以開始我們的下一步了，那就是開始請求

requests get 請求

import requests
# ret = requests.get(<Request url>) # 填上get請求鏈接
requests.get('https://nonfungible.com/api/v4/market/history/boredapeclub?filter=%5B%7B%22id%22%3A%22saleType%22%2C%22value%22%3A%22%22%7D%5D&internal=true&length=10&sort=%5B%7B%22id%22%3A%22blockTimestamp%22%2C%22desc%22%3Atrue%7D%5D&start=10')
print(ret.text)

當然可能還會遇到其他類型的請求像是post之類的
可以參考一下文檔
requests

requests 添加頭

import requests
headers = {
"Accept":"application/json",
"Content-Type":"application/json",
"Referer":"https://nonfungible.com/market/history/cyberkongz?filter=saleType%3D&length=10&sort=blockTimestamp%3Ddesc&start=10",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
}
requests.get('https://nonfungible.com/api/v4/market/history/boredapeclub?filter=%5B%7B%22id%22%3A%22saleType%22%2C%22value%22%3A%22%22%7D%5D&internal=true&length=10&sort=%5B%7B%22id%22%3A%22blockTimestamp%22%2C%22desc%22%3Atrue%7D%5D&start=10',headers = headers)

requests ip代理

有時候因為一段時間內請求數目太多，網站就把你封了，這個時候我們除了在每次請求之前要加上一些等待以外（ time.sleep(2.5 + random.random()) ），還可以設定多個ip進行爬取，

import requests

proxies_list = [
    {'http': '103.103.3.6:8080','https': '103.103.3.6:8080'},
    {'https': '211.24.95.49:47615'},
]

proxies = random.choice(proxies_list)  # 從鏈接中隨機抽一個ip出來

html = requests.get('https://nonfungible.com/api/v4/market/history/boredapeclub?filter=%5B%7B%22id%22%3A%22saleType%22%2C%22value%22%3A%22%22%7D%5D&internal=true&length=10&sort=%5B%7B%22id%22%3A%22blockTimestamp%22%2C%22desc%22%3Atrue%7D%5D&start=10',headers=headers,proxies=proxies)

使用模擬瀏覽器獲取一些無法決議出來的資訊

Chrome driver 的安裝和使用(windows macbook )

點擊下載安裝鏈接
安裝命令

pip install selenium

chromedrive = "/usr/local/bin/chromedriver"  
driver = webdriver.Chrome(chromedrive,chrome_options=chrome_options)  # macos
driver =
webdriver.Chrome("C:\ProgramFiles\Google\Chrome\Application\chromedriver") # windows 
driver.get("https://nonfungible.com/market/history/")  # 類似于 requests.get()
time.sleep(10)
pageSource = driver.page_source  # 獲取網頁原始碼

顯式等待

  # 設定瀏覽器：driver  等待時間：20s
   wait = WebDriverWait(driver, 20)
   # 設定判斷條件：等待id='kw'的元素加載完成
   input_box = wait.until(EC.presence_of_element_located((By.ID, '__next')))

使用beautishape 來決議原始碼

from bs4 import BeautifulSoup

pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'lxml')

selenium 模擬點擊

在這里插入圖片描述
通過右鍵點擊你想要的函式，點擊檢查跳轉到對應的位置 copy xpath 即可獲得元素的xpath
然后通過 find_element_by_xxx函式找到你想要點擊的元素
最后使用你想要進行的點擊動作的函式

實體：

next = driver.find_element_by_xpath("//[@id=\"__next\"]/div/div/section/div[4]/div/div/div[2]/div[2]/div/div[3]/div[3]")
next.click()

附上點擊函式參考鏈接：

設定無頭瀏覽器

# 匯入包
from selenium.webdriver.chrome.options import Options
# 規避檢測
from selenium.webdriver import ChromeOptions


chrome_options = Options()
chrome_options.add_argument('User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36')
 # chrome_options.add_argument('--no-sandbox')                # 解決DevToolsActivePort檔案不存在的報錯
 # chrome_options.add_argument('window-size=1920x3000')       # 設定瀏覽器解析度
 chrome_options.add_argument('--disable-gpu')               # 谷歌檔案提到需要加上這個屬性來規避bug
 chrome_options.add_argument('--hide-scrollbars')           # 隱藏滾動條，應對一些特殊頁面
 chrome_options.add_argument('blink-settings=imagesEnabled=false')      # 不加載圖片，提升運行速度
 # chrome_options.add_argument('--headless')
 # chrome_options.add_argument("--proxy-server=http://59.188.24.24:10152")
 # #實作規避操作
 # options = ChromeOptions()
 # options.add_experimental_option('excludeSwitches', ['enable-automation'])

chromedrive = "/usr/local/bin/chromedriver"
# driver = webdriver.Chrome(chromedrive,chrome_options=chrome_options)  macos
driver = webdriver.Chrome("C:\Program Files\Google\Chrome\Application\chromedriver",chrome_options=chrome_options)

小技巧

很多情況下我們的網路會出現一些波動，為了讓程式變得更加的健壯我們需要添加一些措施，

while True:
	try:
		<你的代碼>
		break
	except Exception as e:
		print("have some problem ",e)

單筆資料寫入函式蠻方便的
輸入檔案路徑要寫的line 和寫入方式即可使用

# 寫入
def myWriteLine(filepath, line, model="a+"):

    with open( filepath , model, newline="", encoding="utf-8") as f:
        csv_write = csv.writer(f)
        x = np.array(line)
        if x.ndim == 1:
            csv_write.writerow(line)
        if x.ndim == 2:
            for l in line :
                csv_write.writerow(l)

保存網頁函式

def saveHtml(file_name, file_content):
    #    注意windows檔案命名的禁用符，比如 /
    with open(file_name + ".html", 'w', encoding='utf-8') as f:
        #   寫檔案用bytes而不是str，所以要轉碼
        f.write(file_content)

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/333810.html

標籤：python

上一篇：iOS逆向之hook框架frida的安裝和使用

下一篇：我可以在svg剪輯路徑上設定固定的寬度和高度嗎？

python 爬蟲簡單爬蟲教程(requests + selenium )

目錄

requests + Chrome 瀏覽器

使用Chrome 對目標網站資訊進行決議

requests get 請求

requests 添加頭

requests ip代理

使用模擬瀏覽器獲取一些無法決議出來的資訊

Chrome driver 的安裝和使用(windows macbook )

使用beautishape 來決議原始碼

selenium 模擬點擊

設定無頭瀏覽器

小技巧

python 爬蟲 簡單爬蟲教程(requests + selenium )

目錄

requests + Chrome 瀏覽器

使用Chrome 對目標網站資訊進行決議

requests get 請求

requests 添加頭

requests ip代理

使用模擬瀏覽器獲取一些無法決議出來的資訊

Chrome driver 的安裝和使用(windows macbook )

使用beautishape 來決議原始碼

selenium 模擬點擊

設定無頭瀏覽器

小技巧

python 爬蟲簡單爬蟲教程(requests + selenium )