如何從html中提取一些網址？-有解無憂

我需要從本地 html 檔案中提取所有影像鏈接。不幸的是，我無法安裝bs4和cssutils處理 html。

html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""

我嘗試使用正則運算式提取資料：

images = []
for line in html.split('\n'):
    images.append(re.findall(r'(https://s2.*\?lastmod=\d )', line))
print(images)

[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
 ['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]

我想我的正則運算式是貪婪的，因為我使用了.*? 如何得到以下結果？

images = ['https://s2.example.com/path/image0.jpg',
          'https://s2.example.com/path/image1.jpg',
          'https://s2.example.com/path/image2.jpg',
          'https://s2.example.com/path/image3.jpg']

如果它可以幫助所有鏈接都包含在src="..."或url(...)

謝謝你的幫助。

uj5u.com熱心網友回復：

import re
indeces_start = sorted(
    [m.start() 5 for m in re.finditer("src=", html)]
      [m.start() 4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]

image_list = []

for start,end in zip(indeces_start,indeces_end):
  image_list.append(html[start:end])

print(image_list)

這是我想到的一個解決方案。它包括查找影像路徑字串的開始和結束索引。如果有不同的影像型別，顯然必須進行調整。

編輯：更改了啟動條件，以防檔案中存在其他 URL

uj5u.com熱心網友回復：

您可以使用

import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)

請參閱Python 演示。輸出：

['https://s2.example.com/path/image0.jpg',
 'https://s2.example.com/path/image1.jpg',
 'https://s2.example.com/path/image2.jpg', 
 'https://s2.example.com/path/image3.jpg']

也請參閱正則運算式演示。它的意思是

https://s2 - 一些文字
[^\s?]*- 除空格和?字符外的零個或多個字符
(?=\?lastmod=\d)- 緊靠右側，必須有?lastmode=一個數字（文本不會添加到匹配中，因為它是正向前瞻中的一個模式，一個非消耗模式）。

uj5u.com熱心網友回復：

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="如何從html中提取一些網址？"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
  x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
  if(len(x)== 0): continue
  x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
  if(len(x)== 0): continue
  url.append(x[0])
print(url)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/336630.html

標籤：Python html 正则表达式

上一篇：如何僅匹配某些字串文字的正則運算式？

下一篇：多行正則運算式匹配python