概念：

使用代碼模擬用戶，批量發送網路請求，批量獲取資料，

Robots協議：

robots協議也叫robots.txt（統一小寫）是一種存放于網站根目錄下的ASCII編碼的文本檔案，它通常告訴網路搜索引擎的漫游器（又稱網路蜘蛛），此網站中的哪些內容是不應被搜索引擎的漫游器獲取的，哪些是可以被漫游器獲取的，

因為一些系統中的URL是大小寫敏感的，所以robots.txt的檔案名應統一為小寫，robots.txt應放置于網站的根目錄下，
如果想單獨定義搜索引擎的漫游器訪問子目錄時的行為，那么可以將自定的設定合并到根目錄下的robots.txt，或者使用robots元資料（Metadata，又稱元資料），
robots協議并不是一個規范，而只是約定俗成的，所以并不能保證網站的隱私，

簡單來說，robots決定是否允許爬蟲（通用爬蟲）抓取某些內容，

注：聚焦爬蟲不遵守robots，

eg：

爬取流程：

大多數情況下的需求，我們都會指定去使用聚焦爬蟲，也就是爬取頁面中指定部分的資料值，而不是整個頁面的資料，

指定url
發起請求
獲取回應資料
資料決議
持久化存盤

Test：

Test1：

import urllib.request

def load_data():
    url = "http://www.baidu.com/"
    #GET請求
    #http請求
    #response：http回應物件
    response = urllib.request.urlopen(url)
    print(response)

load_data()

urllib

Test2：

import urllib.request

def load_data():
    url = "http://www.baidu.com/"
    #GET請求
    #http請求
    #response：http回應物件
    response = urllib.request.urlopen(url)
    print(response)
    #讀取內容 byte型別
    data = response.read()
    print(data)
load_data()

讀取內容

Test3：

import urllib.request

def load_data():
    url = "http://www.baidu.com/"
    #GET請求
    #http請求
    #response：http回應物件
    response = urllib.request.urlopen(url)
    #print(response)
    #讀取內容 byte型別
    data = response.read()
    #print(data)
    #將檔案獲取的內容轉換為字串
    str_data = data.decode("UTF-8")
    print(str_data)
load_data()

字串方式讀取內容

Test4：

import urllib.request

def load_data():
    url = "http://www.baidu.com/"
    #GET請求
    #http請求
    #response：http回應物件
    response = urllib.request.urlopen(url)
    #print(response)
    #讀取內容 byte型別
    data = response.read()
    #print(data)
    #將檔案獲取的內容轉換為字串
    str_data = data.decode("UTF-8")
    #print(str_data)
    #將資料寫入檔案
    with open("baidu.html", "w", encoding="utf-8") as f:
        f.write(str_data)
load_data()

將資料寫入檔案

注：

出于安全性，https請求的話將無法打開，而http則可以打開，

Test5：

str_name = "baidu"
    bytes_name = str_name.encode("utf-8")
    print(str_name)

將字串型別傳喚為bytes

注：

python爬取的型別：str，bytes

如果爬取回傳的是bytes型別：但寫入的時候需要字串 => decode(“utf-8”);

如果爬取回傳的是str型別：但寫入的時候需要bytes型別 => encode(“utf-8”).

Test1 ~ Test4代碼：

import urllib.request

def load_data():
    url = "http://www.baidu.com/"
    #GET請求
    #http請求
    #response：http回應物件
    response = urllib.request.urlopen(url)
    #print(response)
    #讀取內容 byte型別
    data = response.read()
    #print(data)
    #將檔案獲取的內容轉換為字串
    str_data = data.decode("UTF-8")
    #print(str_data)
    #將資料寫入檔案
    with open("baidu.html", "w", encoding="utf-8") as f:
        f.write(str_data)
    #將字串型別傳喚為bytes
    str_name = "baidu"
    bytes_name = str_name.encode("utf-8")
    print(str_name)


load_data()

View Code

Test5：

import urllib.request
import urllib.parse
import string

def get_method_params():
    url = "http://www.baidu.com/?wd="
    #拼接字串(漢字)
    name = "爬蟲"
    final_url = url + name
    #print(final_url)
    #代碼發送了請求
    #網址里面包含了漢字；ascii是沒有漢字的；URL轉義
    #使用代碼發送網路請求
    #將包含漢字的網址進行轉義
    encode_new_url = urllib.parse.quote(final_url, safe=string.printable)
    #response = urllib.request.urlopen(final_url)
    print(encode_new_url)
    #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
    #針對報錯結合上一條注釋的解釋：
    #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！


get_method_params()

GET - params

import urllib.request
import urllib.parse
import string

def get_method_params():
    url = "http://www.baidu.com/?wd="
    #拼接字串(漢字)
    name = "爬蟲"
    final_url = url + name
    #print(final_url)
    #代碼發送了請求
    #網址里面包含了漢字；ascii是沒有漢字的；URL轉義
    #將包含漢字的網址進行轉義
    encode_new_url = urllib.parse.quote(final_url, safe=string.printable)
    response = urllib.request.urlopen(encode_new_url)
    print(response)
    #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
    #針對報錯結合上一條注釋的解釋：
    #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！


get_method_params()

直接運行

import urllib.request
import urllib.parse
import string

def get_method_params():
    url = "http://www.baidu.com/?wd="
    #拼接字串(漢字)
    name = "爬蟲"
    final_url = url + name
    #print(final_url)
    #代碼發送了請求
    #網址里面包含了漢字；ascii是沒有漢字的；URL轉義
    #將包含漢字的網址進行轉義
    encode_new_url = urllib.parse.quote(final_url, safe=string.printable)
    response = urllib.request.urlopen(encode_new_url)
    print(response)
    #讀取內容
    data = response.read().decode()
    print(data)
    #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
    #針對報錯結合上一條注釋的解釋：
    #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！


get_method_params()

讀取內容

import urllib.request
import urllib.parse
import string

def get_method_params():
    url = "http://www.baidu.com/?wd="
    #拼接字串(漢字)
    name = "爬蟲"
    final_url = url + name
    #print(final_url)
    #代碼發送了請求
    #網址里面包含了漢字；ascii是沒有漢字的；URL轉義
    #將包含漢字的網址進行轉義
    encode_new_url = urllib.parse.quote(final_url, safe=string.printable)
    response = urllib.request.urlopen(encode_new_url)
    print(response)
    #讀取內容
    data = response.read().decode()
    print(data)
    #保存到本地
    with open("encode_test.html", "w", encoding="utf-8")as f:
        f.write(data)
    #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
    #針對報錯結合上一條注釋的解釋：
    #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！


get_method_params()