如何從抓取的鏈接中洗掉后綴？-有解無憂

我正在尋找一種從網站獲取全尺寸影像的解決方案。

通過使用我最近通過某人對 stackoverflow 的幫助完成的代碼，我能夠下載全尺寸影像和縮小尺寸的影像。

我想要的是所有下載的影像都是全尺寸的。

例如，某些影像檔案名具有“-625x417.jpg”作為后綴，而某些影像則沒有。

https://www.bikeexif.com/1968-harley-davidson-shovelhead（有后綴） https://www.bikeexif.com/harley-panhead-walt-siegl（無后綴）

如果可以洗掉此后綴，則它將是全尺寸影像。

https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead-625x417.jpg（已報廢） https://kickstart.bikeexif.com/wp-content/uploads/ 2018/01/1968-harley-davidson-shovelhead.jpg（全尺寸影像的檔案名，如果洗掉：-625x417）

考慮到檔案名可能存在不同的影像解析度，因此也需要以不同的大小將其洗掉。

我想我可能需要使用正則運算式從下面過濾掉'- 3digit x 3digit'。

但我真的不知道該怎么做。

如果你能做到，請幫我完成這個。謝謝！

images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall()   \
             selector_article.xpath('//div[@id="content"]//img/@data-src').getall()

完整代碼：

import requests
import parsel
import os

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

for page in range(1, 310):
    print(f'======= Scraping data from page {page} =======')

    url = f'https://www.bikeexif.com/page/{page}'

    response = requests.get(url, headers=headers)
    selector = parsel.Selector(response.text)

    containers = selector.xpath('//div[@]/div/article[@]')

    for v in containers:

        old_title = v.xpath('.//div[2]/h2/a/text()').get()
        
        if old_title is not None:
            title = old_title.replace(':', ' -').replace('?', '')

        title_url = v.xpath('.//div[2]/h2/a/@href').get()
        print(title, title_url)

        os.makedirs( os.path.join('bikeexif', title), exist_ok=True )

        response_article = requests.get(url=title_url, headers=headers)
        selector_article = parsel.Selector(response_article.text)

        # Need to get full-size images only
        # (* remove if suffix exist, such as -625x417, if different size of suffix exist, also need to remove)
        images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall()   \
                    selector_article.xpath('//div[@id="content"]//img/@data-src').getall()
        print('len(images_url):', len(images_url))

        for img_url in images_url:

            response_image = requests.get(url=img_url, headers=headers)

            filename = img_url.split('/')[-1]
            
            with open( os.path.join('bikeexif', title, filename), 'wb') as f:
                f.write(response_image.content)
                print('Download complete!!:', filename)

uj5u.com熱心網友回復：

我會用這樣的東西：

import re

url = 'https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead-625x417.jpg'

new_url = re.sub('(.*)-\d x\d (\.jpg)', r'\1\2', url)
#https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead.jpg

說明（另見此處）：

正則運算式分為三個部分：(.*)基本上表示任何長度的任何字符集，括號將它們組合在一起。
-\d x\d 表示破折號，后跟一位或多位數字，后跟一位或多位x數字。
最后一部分很簡單.jpg，我們使用\因為.是帶有正則運算式的特殊字符，所以斜杠轉義以知道我們的意思是 a.而不是“0或更多”

在第二部分中，re.sub我們的\1\2意思是“第一部分第一組括號中的內容”和“第一部分第二組括號中的內容”。

最后，最后一部分只是您要決議的字串。

轉載請註明出處，本文鏈接：https://www.uj5u.com/qianduan/439590.html

標籤：Python 正则表达式网页抓取

上一篇：如何過濾回圈并“保存以備后用”結果？

下一篇：在scrapy上正確的Xpath