個人灌水博文#1
本文使用python爬蟲爬取學校內部網信箱內容,并將內容做成詞云來直觀獲取學生最需要解決的問題
涉及到了爬蟲,需要登陸驗證網頁的爬蟲爬取,詞云的制作
主要實作思路:用帶有cookie資訊的爬蟲爬取學校內部網校務信箱資訊,將資訊通過jieba庫分詞并通過wordcloud庫來生成詞庫
程式主體分為五個部分:
1、程式所使用的庫的資訊:
# coding:utf-8
import requests
from bs4 import BeautifulSoup
import re
import jieba
import wordcloud
from cv2 import imread
其中requests,BeautifulSoup,re庫用于爬取資訊,jieba,wordcloud,imread庫用于生成詞云
2、爬取網頁部分:
def getHTML(Cookie,url): #用于爬取網頁內容
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.75 Chrome/73.0.3683.75 Safari/537.36',
'Cookie': Cookie #Cookie資訊用于登陸網頁
}
session = requests.Session()
response = session.get(url, headers=headers)
response.encoding = response.apparent_encoding
return response.text
關鍵在于使用cookie資訊來登陸網頁進行資訊的爬取
3、將爬取到的資訊放入串列中并過濾掉一些無用的資訊:
def fill_List(List,HTML): #將爬取的資訊放于串列中
soup = BeautifulSoup(HTML,"html.parser")
for i in soup.find_all(string = re.compile("\xa0")):
o = "".join(i.split())
if o != '' and o != '檢索結果:2018年起信件共' and o != '共358頁直接跳到第' and o != '信件內容':
List.append(o)
pass
因為筆者爬取的網頁是學校的校務信箱,較多內容固定,所以可以剔除掉一些無用的資訊
4、由串列生成詞云部分:
def WordCloud(list): #詞云部分,將串列分詞并生成詞云
#mk = imread('C:/Users/BoletusAo/Desktop/wordcloud.png')
w = wordcloud.WordCloud(width=1920,height=1080,font_path="msyh.ttc",stopwords = {"問題","的","疑問","關于","對于","故障","可以","情況","建議","投訴","反饋","真的","能","不能","能否","呢","還是","樓","你","中","與","為什么","的一些","請問","已經","回復","要","疑惑","點","了嗎","人","怎么","嗎","是","又","也","我們","級","無法","一直","很","是不是","等","意見","以及","處理","部分","好","多","這","為","被","未","后","就","吧","啊","里","了","時候","什么","還","一點","一個","使用","在","晚上","希望","何時","想","存在","不","和","有","讓","沒","及","請","到","通知","是否","有關","為何","用","對","嚴重","解決","不合理","讓","沒有","我","都","不了","新","正常","導致","出現","一下","開", \
"深大","經常","差","說","作為","一些","最近","服務","穩定","人員","安排","吃","上","上課","再"},background_color="white")
str = ",".join(list)
jieba.setLogLevel(jieba.logging.INFO)
w.generate(" ".join(jieba.lcut(str)))
w.to_file('C:/Users/BoletusAo/Desktop/SZU2.png')
pass
生成詞云,關鍵在于屏蔽詞的設定,信件標題有較多無用的詞語需要屏蔽,例如:問題,的等
5、main函式:
if __name__ == '__main__':
List = []
for i in range (1,359): #校務信箱一共有358頁
if i == 1:
url = 'url資訊'
Cookie = 'cooki資訊'
HTML = getHTML(Cookie,url)
fill_List(List,HTML)
else:
url = 'url資訊{}'.format(i) #此處涉及到換頁的相關操作,使用的是直接更改url資訊的方法
Cookie = 'coookie資訊'
HTML = getHTML(Cookie, url)
fill_List(List, HTML)
WordCloud(List)
具體url,cookie和翻頁的處理因涉及到學校資訊固不提供
最終結果展示:

撰寫程式中遇到的問題:
1、需要登陸驗證的網頁該如何爬取
解決方法:使用cookie資訊
2、如何在爬取的程序中進行翻頁
更改url的相關資訊實作翻頁
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/292498.html
標籤:python
