[Python] 前程無憂招聘網爬取軟體工程職位網路爬蟲 https://www.51job.com-有解無憂

首先進入該網站的https://www.51job.com/robots.txt頁面

給出提示：

1 找不到該頁       File not found
2 
3 您要查看的頁已洗掉，或已改名，或暫時不可用，
4 
5 請嘗試以下操作:
6 如果您已經在地址欄中輸入該網頁的地址，請確認其拼寫正確，
7 打開 www.51job.com 主頁，然后查找指向您感興趣資訊的鏈接，
8 單擊后退按鈕，嘗試其他鏈接，

　　注：

網路爬蟲：自動或人工識別robots.txt，再進行內容爬取
約束性:robots協議建議但非約束性，不遵守可能存在法律風險

如果一個網站不設定robots協議說明所有內容都可以爬取，所以為可爬取內容，

源程式如下：

 1 #!/usr/bin/env python
 2 # -*- coding: utf-8 -*-
 3 # @File  : HtmlParser.py
 4 # @Author: 趙路倉
 5 # @Date  : 2020/2/28
 6 # @Desc  : 前程無憂求職網的爬蟲程式
 7 # @Contact : [email protected]
 8 
 9 from bs4 import BeautifulSoup
10 import requests
11 import csv
12 import re
13 import io
14 
15 # 請求頭
16 head = {
17     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
18 }
19 # 鏈接
20 url = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E8%25BD%25AF%25E4%25BB%25B6,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="
21 
22 
23 # csv寫入表頭
24 def headcsv():
25     with open('/position.csv', 'w', encoding='utf-8', newline='') as csvfile:
26         writer = csv.writer(csvfile)
27         writer.writerow(["職位", "公司", "所在地", "薪酬", "日期", "網址"])
28 
29 
30 # txt寫入表頭
31 def headtxt():
32     ftxt = open('E:/data/position.txt', 'w', encoding='utf-8')
33     ftxt.write("職位 公司 所在地 薪酬 日期 網址")
34     ftxt.close()
35 
36 
37 def position(url, head):
38     # fcsv = open('/position.csv', 'a', encoding='utf-8', newline='')
39     ftxt = open('E:/data/position.txt', 'a', encoding='utf-8')
40     try:
41         r = requests.get(url, headers=head, timeout=3)
42         # 設定決議編碼格式
43         r.encoding = r.apparent_encoding
44         print(r.apparent_encoding)
45         # 列印狀態碼
46         print(r.status_code)
47         # 列印頁面代碼
48         # print(r.text)
49         # print(soup.prettify())
50         text = r.text
51         soup = BeautifulSoup(text, 'html.parser')
52         # 一條招聘資訊
53         item = soup.find_all(class_='el', recursive=True)
54         num = 0
55         for i in item:
56             num += 1
57             if num > 16:
58                 itemdetail = i.text.replace(" ", "").replace("\n", " ").replace("   ", " ").lstrip() + i.find("a").attrs['href']
59                 print(itemdetail)
60                 ftxt.write(itemdetail.replace("\n","")+'\r')
61                 print("寫入成功")
62         ftxt.close()
63     except:
64         print("爬取職位程序中出錯！")
65 
66 
67 def write(url, head):
68     for i in range(1, 2000):
69         url = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E8%25BD%25AF%25E4%25BB%25B6,2,"+str(i)+".html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="
70         print(url)
71         position(url, head)
72 
73 
74 if __name__ == "__main__":
75     # head()
76     write(url, head)

所爬取條目分布為職位公司所在地薪酬日期網址，保存路徑為E:/data/position.txt可自行修改路徑或者檔案格式，

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/188619.html

標籤：Python

上一篇：python（內置高階函式）

下一篇：scrapy 當當網爬蟲

[Python] 前程無憂招聘網爬取軟體工程職位 網路爬蟲 https://www.51job.com

注：

[Python] 前程無憂招聘網爬取軟體工程職位網路爬蟲 https://www.51job.com

　　注：