我使用 scrapy 將我的網站抓取 4 列(庫存數量/名稱/價格/網址)。我希望通過名稱列中的字母順序對輸出的檔案進行排序。我可以進入 csv 并手動對其進行排序,但某些向導必須知道在腳本中執行此操作的方法?
代碼:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
cs = open('results/2x2_results.csv', 'w', newline="", encoding='utf-8')
header_names = ['stk','name','price','url']
csv_writer = csv.DictWriter(cs, fieldnames=header_names)
csv_writer.writeheader()
class SCXX(scrapy.Spider):
name = 'SCXX'
start_urls = [
'https://website.com'
]
def parse(self,response):
product_urls = response.css('div.grid-uniform a.product-grid-item::attr(href)').extract()
for product_url in product_urls:
yield scrapy.Request(url='https://website.com' product_url,callback=self.next_parse_two)
next_url = response.css('ul.pagination-custom li a[title="Next ?"]::attr(href)').get()
if next_url != None:
yield scrapy.Request(url='https://website.com' next_url,callback=self.parse)
def next_parse_two(self,response):
item = dict()
item['stk'] = response.css('script#swym-snippet::text').get().split('stk:')[1].split(',')[0]
item['name'] = response.css('h1.h2::text').get()
item['price'] =response.css('span#productPrice-product-template span.visually-hidden::text').get()
item['url'] = response.url
csv_writer.writerow(item)
cs.flush()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(SCXX)
process.start()
uj5u.com熱心網友回復:
解決方案
Scrapy 異步作業,請求沒有按順序處理,想象一下一群工人。有些得到蘋果有些得到香蕉有些得到橘子,你將如何分類它們,你可以告訴他們挑選每個水果并將其放入籃子中(這就是我們所說的插入或放入分類)但是在編程中這也是很麻煩,我建議只是獲取資料并sort()在之后基本上使用它。
資料沒有按任何順序寫入。一切都立即啟動并即時撰寫。你可以做的是運行一個 after scrape 腳本,它最終會對其進行排序。這可能是最好的方法。
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
# Load the JSON and use .sort() on the dict and write it again.
with open('items.json') as file:
data = json.load(file)
data.sort() # we would have to use a specific key to sort it alphabetically like the title.
with open('output.json', 'w') as outfile:
json.dump(data, outfile) (write to a file)
補充說明
我們最好將它寫入記憶體流io庫,但我猜你不知道如何做到這一點,這就是為什么將它寫入檔案然后對該檔案執行操作更容易的原因。
如果您有任何問題,請告訴我
轉載請註明出處,本文鏈接:https://www.uj5u.com/qiye/513294.html
