使用scrapy框架爬取時光網日本影片電影時光網評分TOP100并保存到mysql資料庫
- 首先創建scrapy專案shiguang(scrapy startproject shiguang (shiguang為你的專案名稱))
- 在創建的shiguang檔案中找到spider檔案創建movie Python檔案

分析網站網頁
我們做一個多頁爬取的處理,分析一下時光網下一頁的連接
第一頁為:http://movie.mtime.com/list/1709.html
第二頁:http://movie.mtime.com/list/1709-2.html
第三頁:http://movie.mtime.com/list/1709-3.html
由此往下可以看出依次遞增?1
start_urls = ['http://movie.mtime.com/list/1709.html']
page = 1
page_url = "http://movie.mtime.com/list/1709-%d.html"
if self.page <= 3:
self.page += 1
new_page_url = self.page_url % self.page
print(new_page_url)
yield scrapy.Request(url=new_page_url, callback=self.parse)

可以分析出xpath,以下的可以依次分析得出,廢話不多說直接上代碼
在movie中寫入代碼
import scrapy
from Shiguang.items import ShiguangItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['movie.mtime.com']
start_urls = ['http://movie.mtime.com/list/1709.html']
page = 1
page_url = "http://movie.mtime.com/list/1709-%d.html"
def parse(self, response):
list_selector = response.xpath("//div[@class='top_nlist']/dl/dd")
for one_selector in list_selector:
name = one_selector.xpath("./div/h3/a/text()").getall()[0]#電影名稱
director = one_selector.xpath("./div/p[1]/a/text()").get()#導演
performer = one_selector.xpath("./div/p[2]/a/text()").get()#主演
content = one_selector.xpath("./div/p[3]/text()").get()#內容簡介
item = ShiguangItem()
item["name"] = name
item["director"] = director
item["performer"] = performer
item["content"] = content
yield item
#print(item)
#print('--' * 22)
if self.page <= 3:
self.page += 1
new_page_url = self.page_url % self.page
print(new_page_url)
yield scrapy.Request(url=new_page_url, callback=self.parse)
在items中寫入代碼
import scrapy
class ShiguangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
director = scrapy.Field()
performer = scrapy.Field()
content = scrapy.Field()
在管道檔案pipelines中連接資料庫
class ShiguangPipeline:
def process_item(self, item, spider):
return item
import MySQLdb
class MySQLPipeline(object):
def open_spider(self,spider):
db_name = spider.settings.get("MYSQL_DB_NAME","mtime")
host = spider.settings.get("MYSQL_HOST","localhost")
user = spider.settings.get("MYSQL_USER","root")
pwd = spider.settings.get("MYSQL_PASSWORD","123456")
self.db_conn = MySQLdb.connect(db=db_name,
host=host,
user=user,
password=pwd,
charset="utf8")
self.db_cursor = self.db_conn.cursor()
def process_item(self, item, spider):
values = (item['name'],
item["director"],
item["performer"],
item["content"])
sql = 'insert into dianyin(name,director,performer,content)values(%s,%s,%s,%s)'
self.db_cursor.execute(sql,values)
return item
def close_spider(self,spider):
self.db_conn.commit()
self.db_cursor.close()
self.db_conn.close()
最后在settings檔案中做出以下設定




這樣就完成了整個時光網的爬取,下面看效果圖


第一次寫博客,可能寫的不太好,如果有什么疑問可以私聊我
Q:502037970
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/139209.html
標籤:AI
