爬取資料并存入資料庫

一.匯入需要的包

from bs4 import BeautifulSoup  # 網頁決議,獲取資料
import re  # 正則運算式,進行文字匹配
import xlwt  # 進行excel操作
import urllib.request, urllib.parse  # 指定URL,獲取網頁資料
import sqlite3  # 進行SQLite資料庫操作

二.爬取網頁資料

url=”網址”
(根據網址的規律,可以選擇特定的頁面,也可以應用回圈讀取不同的頁面)
#如果網址含有中文則,則中文需要經過兩次編碼轉換
#格式為 from urllib import parse   keyword= parse.quote(parse.quote("中文"))
#對請求的內容進行填充 (如果網址有反爬功能則需要填充,以模擬真實的瀏覽器)
head={“User-Agent”:… ; ….} 
req = urllib.request.Request(url, headers=head) 
html=””
#默認為發送get請求,如果要輸入賬戶或cookie等則需要發送post請求,通過data = bytes(urllib.parse.urlencode({"xxx": "xxx"}), encoding="utf-8")傳入資料即可
#獲取網頁源代碼到response中(限定反應時間為3s),讀取源代碼到變數html,然后決議提提取資料
response=urlib.request.urlopen(req, [timeout=3],[ data=data]) 
html=response.read().decode(‘utf-8’) #這里根據網頁的編碼格式決定,F12即可查看soup=BeautifulSoup(html,”html.parser”)
data=[]
#提取資訊方式一:F12定位得到你所需要的元素,逐層提取,如div.el  p.p1  span  a  
link=soup.select(“.el>p1>span>a”)  
data.append(link[“href”])   #link[“屬性”]  link.text可以獲取標簽
#可以用strip()去除左右空白,用replace替換字符,用split劃分字串,用[切片]選擇區域,[0:2]選擇前兩個字符,[0:-2]表示去掉最后兩個字符
data.append({“link”:link[“href”]})
#提取資訊方式二:
#1.得到網頁中的div標簽,class屬性值為item的全部資訊
#2.將得到的資訊進一步篩選,需要先將item資訊轉化為字串,通過定義的規則進行匹配,并保存起來,通過篩選之后可能也不止一個選項,但都會放到一個串列中,因此可以根據自身要求再度篩選
for item in soup.find_all(‘div’,class_=’item’):
item=(str)item
findLink=re.compile(r’<a href=”(.*?)”>) #正則運算式
link=re.findall(findLink,item) #while len(link)>1  link.remove()
link[0]=re.sub(‘<br(\s+)?/>(\s+)?’,””,link[0]) #根據規則去除空白匹配,strip()函式可以去掉左右兩邊的空白
data.append(link[0])

三.保存資料

保存到excel->創建作業書,在作業書中添加表格,編輯表格內容,將作業書保存到路徑中
book=xlwt.Workbook(encoding=”utf-8”)
sheet=book.add_sheet(‘Sheet’[,cell_overwrit_ok=True]) #允許同一個格子進行重寫
sheet.write(0,0,data[0])
book.save(“我的檔案.xls”)
#保存到SQLite->獲取鏈接,獲取游標,用游標執行sql陳述句,提交鏈接,關閉連接
con=sqlite3.connect(“MyWord.db”)
cur=con.cursor()
sql=’’’
create table table1
(id integer primary key autoincrement,
word text)
‘’’
cur.execute(sql)
con.commit()
sql=”insert into table1(word) values(“%s”)”%data[0]
cur.execute(sql)
con.commit()
con.close()

資料可視化Flask

from flask import Flask, render_template

app=Flask(__name__)
@app.route(‘/index/<String : str>’)  
def index(str):  
#可以寫入sqlite陳述句,然后將得到的資料傳入html中
#傳入html的資料有可能因為轉移字符出現問題
#如雙引號”會變成&#39,因此在html頁面中使用{{s|tojson}}
return render_template(“index.html”,s=str) 
if __name__==’__main__’:
app.run()

html代碼問題及解答

問題一:圖片加載出錯
解決:路徑出錯,使用../在全路徑下尋找
問題二:圖片受css影響導致簡單地修改圖片尺寸無法奏效
解決:在頭部添加<style> .樣式名字{height:650px !important}</style>
問題三:跳轉連接時無法回到根目錄
解決:使用/team會直接回傳/team頁面,使用team回傳的是在所在頁面后面加/team

Echarts的使用

#頭部引入echars.js檔案 
<script src="../static/js/echarts.min.js"></script>
#在需要防止圖示的位置為echars準備一個具備大小的DOM
<div id="main" style="width: 1200px;height: 600px;">
</div> <script type="text/javascript">
    #基于準備好的dom,初始話echarts實體
    var myChart=echarts.init(document.getElementById('main'));
    #此處插入在官網中找到的實體代碼
    #使用剛指定的配置項和資料顯示圖表，
    myChart.setOption(option);
    </script>

WordCloud的使用

import jieba
import wordcloud as wordcloud
from matplotlib import pyplot as plt
from wordcloud import WordCloud
from PIL import Image
import numpy as np
import sqlite3
#創建資料庫連接并把資料保存到字串變數中  代碼略
#結巴分詞得到很多詞的串列->將串列的詞都連接起來,此時詞與詞直接有空格->
#打開遮罩圖片->將圖片轉化為陣列->指定詞云格式->指定字串->繪制圖片->生成圖片
cut = jieba.cut(text)
string=’  ’.join(cut)
img = Image.open(r'.\static\img\p6.jpg')
img_array = np.array(img)
wc = WordCloud(
    background_color='white',
    mask=img_array,
    # C:Windows/Fonts 選擇字體
    font_path="STZHONGS.TTF"
)
wc.generate_from_text(string)
fig = plt.figure(1)  # 從第一個位置開始繪圖
plt.imshow(wc)  # 按照wc的規則繪圖
plt.axis('off')  # 顯示坐標軸
plt.savefig(r'.\static\img\p8.png', dip=800)  # 保存圖片到檔案 dpi指清晰度

宣告:學習資源來自Python爬蟲編程基礎5天速成（2021全新合集）Python入門+資料分析_嗶哩嗶哩_bilibili

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/300752.html

標籤：python

上一篇：Django-Admin基本配置

下一篇：中秋節出門游玩住宿找不到，酒店資訊掌握不準？爬蟲幫你解決

Python爬蟲和資料可視化總結

爬取資料并存入資料庫

一.匯入需要的包

二.爬取網頁資料

三.保存資料

資料可視化Flask

html代碼問題及解答

Echarts的使用

WordCloud的使用