python爬蟲入門(2)
在上一則博文中我只爬取了一頁,顯然差強人意,本次爬取所有的top250并儲存為txt和excel
先看一下怎么爬取所有的電影:
#看看每一頁的標簽的異同:(列舉前三頁)
#https://movie.douban.com/top250?start=0&filter=
#https://movie.douban.com/top250?start=25&filter=
#https://movie.douban.com/top250?start=50&filter=
#所以可以遍歷
只有start=不一樣,所以可以把那個數字設為引數,每次遞增25,然后25為一組進行遍歷
先上txt的代碼因為比較直白:(每個位置都有注釋)
下面展示一些 行內代碼片,
import requests
from lxml import etree
import time
class Movie(object):
def __init__(self):
self.headers = {'user-agent':'mozilla/5.0'}
self.url = 'https://movie.douban.com/top250?start={}&filter=' #url has changed compared with the former one,cuz' I'm preparing for the items be read in lines,{} is aimed for the format()
def get_html(self,url):
resp = requests.get(url,headers = self.headers)
html = resp.text
self.parse_html(html)
def parse_html(self,html):
xp_hpml = etree.HTML(html)
titles = xp_hpml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()')
scores = xp_hpml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()')
comments = xp_hpml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/p[2]/span/text()')
for x,y,z in zip(titles,scores,comments):
print(x ,y, z)
with open('movie','a+',encoding='utf-8') as f:
f.writelines([x,'--',y,'--',z])
f.write('\n') #為了好看需要手動添加空行與空格
n1.append(x)
n2.append(y)
n3.append(z)
def main(self):
start = time.time()
for i in range(0,250,25):
url = self.url.format(i)
self.get_html(url)
end = time.time()
print('time :',end-start)
spider = Movie()
spider.main()
上述代碼中的url中:start={},在for回圈中進行格式化輸出,把所有的給了get_html,然后訪問,決議,儲存,
對于儲存入excel中,我參考了一篇博文第一次用爬蟲的資料存盤到excel
然后進行了改動:在for回圈中加入三個串列,再在儲存至excel時遍歷:
n1 = []
n2 = []
n3 = []
import xlwt
book=xlwt.Workbook() #創建一個excel
sheet1=book.add_sheet('first');i=0;i1=0;i2=0 #創建一個名字為first的sheet1
for j in n1: #遍歷串列
sheet1.write(i,1,j) #在sheet1中第i行第一列寫入j值
i+=1#增強字符疊加
for q in n2: #遍歷串列
sheet1.write(i1,2,q) #在sheet1中第i1行第二列寫入q值
i1+=1#增強字符疊加
for k in n3: #遍歷串列
sheet1.write(i2,3,k) #在sheet1中第i2行第三列寫入k值
i2+=1 #增強字符疊加
book.save('Movie.xlsx') #創建保存檔案夾
繼續打卡 day 2!!!
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/261733.html
標籤:python
