如何使用出現一組字符的特定開始和結束位置子串？-有解無憂

我正在嘗試清理從他們的鏈接中抓取的資料。我正在嘗試清理的 CSV 中有超過 100 個鏈接。

這是 CSV 中鏈接的樣子：

"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

我觀察到為 HTML 資料抓取此內容并不順利，我必須獲取其中存在的 URL。我想獲取以真實 URL 所在位置開頭&url=和結尾的子字串。&ct

我讀過這樣的帖子，但也找不到結束 str 的帖子。我已經嘗試了一種使用包的方法substring，但它不適用于多個角色。

我該怎么做呢？最好不使用第三方軟體包？

uj5u.com熱心網友回復：

我不明白問題

如果你有字串，那么你可以使用字串函式.find()和 slice[start:end]

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

start = text.find('url=')   len('url=')
end   = text.find('&ct=')

text[start:end]

但它可能有url=不同ct=的順序，所以最好先&搜索url=

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

start = text.find('url=')   len('url=')
end   = text.find('&', start)

text[start:end]

編輯：

還有urllib.parse與 url 一起使用的標準模塊 - 拆分或加入它。

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

import urllib.parse

url, query = urllib.parse.splitquery(text)
data       = urllib.parse.parse_qs(query)

data['url'][0]

在data你有字典

{'cd': ['SldisGkopisopiasenjA6Y28Ug'],
 'ct': ['ga'],
 'rct': ['j'],
 'sa': ['t'],
 'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
 'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}

編輯：

Python 顯示警告splitquery()并且deprecated as of 3.8代碼應該使用urlparse()

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

import urllib.parse

parts = urllib.parse.urlparse(text)
data  = urllib.parse.parse_qs(parts.query)

data['url'][0]

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/442002.html

標籤：Python 细绳网页抓取子串

上一篇：如何使用默認網關路由器5268AC的設備訪問代碼請求資料？

下一篇：如何從我的網路抓取工具中洗掉<ahref...標簽