如何使用 Scrapy 將多個頁面的結果抓取到一個專案中?
應該考慮的頁面:
- 原始頁面
o(例如由 給出start_requests()) - 中的所有頁面
url,urls其中是根據.urls抓取創建的欄位。oparse()
請注意,urls對于不同的o可能不會不相交。
具體例子
我有一個蜘蛛,它為專案“i”(即抓取的頁面)產生以下欄位:idpriourls
urls是一個 url 串列,對于每個 url(不是死的)我想從 url 中抓取一些資訊來擴展i欄位
image_listhead_list
最后,我想過濾結果專案,以便對于每個id專案,只保留最高的專案prio。
我試過的
因為我已經讀過所有的抓取都應該在蜘蛛內部完成(而不是在專案管道組件內部),我認為最好的方法是將抓取與后處理分開:
- 使用從起始頁收集所有資料的蜘蛛,通過
parseinto決議資料i,然后呼叫'sresponse.follow(url, callback=self.parse_given_url, meta={'item':i})中的每個 urliurls parse_given_url將元資料提取到i中,決議給定的 url,并添加image_list和head_list到i- 通過專案管道組件對所有抓取的資料進行所有后處理(合并和過濾)以獲得所有最終專案。
我的方法的最小可重復示例:
import scrapy
class Minimal(scrapy.Spider):
name = "minimal"
def start_requests(self):
url = 'https://www.arztsuche-bw.de/index.php?suchen=1&id_fachgruppe=441&arztgruppe=facharzt&plz=761&direction=ASC'
yield scrapy.Request(url=url, method="POST", callback=self.parse)
def parse(self, response):
for office in response.css('li.row.resultrow.even') response.css('li.row.resultrow.odd'):
full_name = office.css('dd.name dl').xpath('string(.//dt[1])').get()
contact_selectors = office.css('dd.adresse dl dd')
urls = contact_selectors.xpath('.//a[@title="Homepage aufrufen"]/@href').getall()
office_data = {
'name': full_name,
'url': urls,
}
if urls:
for url in urls:
yield response.follow(url, callback=self.parse_hp, meta={'item':office_data})
else:
yield office_data
def parse_hp(self, response):
office_data = response.meta['item']
return {
**office_data,
'hp_head': response.xpath('//h1/text()').get(),
'hp_logo_image': response.xpath('//img/@src').get(),
}
但是,由于urls來自不同專案的欄位不是不相交response.follow_all()的,因此會丟棄來自呼叫的一些請求,因此缺少結果專案。我可以將引數添加dont_filter=True到response.follow_all()呼叫中,但是一個 url 可能會被多次抓取,我想避免這種情況。因此,我覺得我的方法不對。
uj5u.com熱心網友回復:
要將來自主網站的資訊與從各個診所網站挑選的資訊結合起來,您可以執行以下操作(編輯:包括custom_settings,以及為沒有網站的人重定向到“google.com”,現在它將產生 56 個結果63 - 需要進一步除錯):
import scrapy
from german_medical.items import GermanMedicalItem
class DoctorsSpider(scrapy.Spider):
name = 'doctors'
custom_settings = {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
}
allowed_domains = []
start_urls = ['https://www.arztsuche-bw.de/index.php?suchen=1&offset=0&id_z_arzt_praxis=0&id_fachgruppe=441&id_zusatzbezeichnung=0&id_genehmigung=0&id_dmp=0&id_zusatzvertraege=0&id_sprache=0&vorname=&nachname=ohne Titel (Dr.)&arztgruppe=facharzt&geschlecht=alle&wochentag=alle&zeiten=alle&fa_name=&plz=761&ort=&strasse=&schluesselnr=&schluesseltyp=lanr7&landkreis=&id_leistungsort_art=0&id_praxis_zusatz=0&sorting=name&direction=ASC&checkbox_content=&name_schnellsuche=&fachgebiet_schnellsuche=']
offset = 20
def parse(self, response):
doctor_cards = response.xpath('//ul[contains(@class, "resultlist")]/li[contains(@class, "resultrow")]')
for d in doctor_cards:
full_name = ' '.join(d.xpath('.//dd[@]/dl/dt/text()').extract())
address = ', '.join(d.xpath('.//dd[@]/p[@]/text()').extract()[1:])
urls = [x for x in d.xpath('.//dd[@]/p[@]/following-sibling::dl//a/@href').extract() if 'mailto:' not in x ]
resp_meta = {
'full_name': full_name,
'address': address,
'urls': urls
}
if not urls:
urls = ['https://google.com']
for url in urls:
print(url)
yield response.follow(url=url, callback = self.parse_doctor_clinik, meta = resp_meta)
next_page = 'https://www.arztsuche-bw.de/index.php?suchen=1&offset=' str(self.offset) '&id_z_arzt_praxis=0&id_fachgruppe=441&id_zusatzbezeichnung=0&id_genehmigung=0&id_dmp=0&id_zusatzvertraege=0&id_sprache=0&vorname=&nachname=ohne Titel (Dr.)&arztgruppe=facharzt&geschlecht=alle&wochentag=alle&zeiten=alle&fa_name=&plz=761&ort=&strasse=&schluesselnr=&schluesseltyp=lanr7&landkreis=&id_leistungsort_art=0&id_praxis_zusatz=0&sorting=name&direction=ASC&checkbox_content=&name_schnellsuche=&fachgebiet_schnellsuche='
print(next_page)
if self.offset < 80:
self.offset = 20
yield response.follow(next_page, callback = self.parse)
def parse_doctor_clinik(self, response):
items = GermanMedicalItem()
try:
website_header = response.xpath('//h1/text()').get() if response.xpath('//h1/text()') else None
logo_url = response.xpath('//img/@src').get() if response.xpath('//img/@src') else None
except Exception as e:
website_header = 'Not specified'
logo_url = 'Not specified'
items['full_name'] = response.request.meta['full_name']
items['address'] = response.request.meta['address']
items['office_urls'] = response.request.meta['urls']
items['website_header'] = website_header
items['logo_url'] = logo_url
yield items
您的items.py檔案應如下所示:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class GermanMedicalItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
full_name = scrapy.Field()
office_urls = scrapy.Field()
address = scrapy.Field()
website_header = scrapy.Field()
logo_url = scrapy.Field()
運行scrapy crawl doctors -o doctors_germ.json,你會得到一個 json 檔案,如:
[
{"full_name": "Dr. med. Jan Gestrich Sprechstundenzeiten ", "address": "Zeppelinstr. 2, 76185 Karlsruhe, Ortsteil: Gr\u00fcnwinkel, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.ka-nephrologie.de"], "website_header": "Diagnostik und Therapie in unserer Nephrologischen Praxis", "logo_url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAC0lEQVQYV2NgAAIAAAUAAarVyFEAAAAASUVORK5CYII="},
{"full_name": "Dr. med. Martin Andre Sprechstundenzeiten ", "address": "S\u00fcdendstr. 47-49, 76137 Karlsruhe, Ortsteil: S\u00fcdweststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.nephrologie-karlsruhe.de"], "website_header": null, "logo_url": "https://static.wixstatic.com/media/689a07_b6517c8c92574851a08a4b37c9a23142~mv2.jpg/v1/fill/w_101,h_72,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/Logo_Nephro_neu.jpg"},
{"full_name": "Dr. med. Kathrin Drognitz Sprechstundenzeiten ", "address": "Moltkestr. 90, 76133 Karlsruhe, Ortsteil: Nordstadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.klinikum-karlsruhe.de/einrichtungen/spezielle-medizinische-einrichtungen/"], "website_header": "Spezielle medizinische Einrichtungen", "logo_url": "data:image/svg xml;charset=utf-8,"},
{"full_name": "Dr. med. Thorsten Dorn Sprechstundenzeiten ", "address": "Kriegsstr. 140, 76133 Karlsruhe, Ortsteil: Innenstadt-West, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.hormone-karlsruhe.de"], "website_header": null, "logo_url": "/templates/web_joomla_neu/images/spacer.gif"},
{"full_name": "Dr. med. Wilhelm Hausch Sprechstundenzeiten ", "address": "Lammstr. 21, 76133 Karlsruhe, Ortsteil: Innenstadt-West, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.gastroenterologie-karlsruhe.de"], "website_header": "Herzlich Willkommen in der Praxis f\u00fcr Gastroenterologie am Ettlinger Tor.", "logo_url": "/assets/asset.babb34fd.png"},
{"full_name": "Dr. med. Norbert Bruhn Sprechstundenzeiten ", "address": "Gartenstr. 71, 76135 Karlsruhe, Ortsteil: S\u00fcdweststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.praxis-bruhn.com"], "website_header": null, "logo_url": "https://www.praxis-bruhn.com/s/img/emotionheader7307447.jpg?1472391703.667px.483px"},
{"full_name": "Dr. med. Kurt Beier Sprechstundenzeiten ", "address": "Ludwig-Erhard-Allee 24, 76131 Karlsruhe, Ortsteil: Innenstadt-Ost, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.deRossi.de", "https://www.medGAIN.de"], "website_header": "\r\n\t\t\t\t\r\n\t\t\t\t\tmedGAIN | Praxis Dr. med. Thomas de Rossi und Kollegen\r\n\t\t\t\t\r\n\t\t\t\t", "logo_url": "img/med_gain_logo.svg"},
{"full_name": "Dr. med. Kai Haberl Sprechstundenzeiten ", "address": "Waldstra\u00dfe 41-43, 76133 Karlsruhe, Ortsteil: Innenstadt-West, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.kardiologie-waldstrasse.de"], "website_header": " Unser Team hei\u00dft Sie herzlich willkommen! ", "logo_url": "images/logo_kardiologie_karlsruhe.svg"},
{"full_name": "Dr. med. Lutz Krieglstein Sprechstundenzeiten ", "address": "Hans-Sachs-Str. 1, 76133 Karlsruhe, Ortsteil: Weststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.praxis-muehlburger-tor.de"], "website_header": "Gastroenterologische Gemeinschaftspraxis in Karlsruhe", "logo_url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAC0lEQVQYV2NgAAIAAAUAAarVyFEAAAAASUVORK5CYII="},
{"full_name": "Dr. med. Mirko Krivokuca Sprechstundenzeiten ", "address": "Kaiserallee 30, 76185 Karlsruhe, Ortsteil: Weststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.kardiologie-musikerviertel.de"], "website_header": "Fieber\n?\u00a0\u00a0\u00a0 Husten?\u00a0\u00a0\u00a0 Atemwegsinfekt?", "logo_url": "https://image.jimcdn.com/app/cms/image/transf/none/path/sb3d393a4e68b5222/image/i855f937e8779839c/version/1608138272/image.jpg"},
....
]
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/519406.html
標籤:Python网页抓取刮擦
