我在站點 urls.tmp 檔案中得到了這個檔案,其中包含 3 個 url:
https://site1.com.br/wp-content/uploads/2020/06/?SD
https://site2.com.br/wp-content/uploads/tp-datademo/home-4/data/tp-hotel-booking/?SD
https://site3.com.br/wp-content/uploads/revslider/hotel-home/?MD
我想洗掉每個“com.br/”之后的所有內容。
我試過這段代碼:
# open the file
sys.stdout = open("urls.tmp", "w")
# start remove
for i in "urls.tmp":
url_parts = urllib.parse.urlparse(i)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=url_parts)
print(result) #overwrite the file
# close the file
sys.stdout.close()
但是輸出給了我這個奇怪的東西:
:///
:///
:///
:///
:///
:///
:///
:///
我是初學者,我做錯了什么?
uj5u.com熱心網友回復:
您正在迭代"urls.tmp"字串本身,但想要逐行瀏覽打開的檔案物件。
所以試試這個:
with open("urls.tmp", "r") as urls_file:
for line in urls_file:
url_parts = urllib.parse.urlparse(line)
result = "{uri.scheme}://{uri.netloc}/".format(uri=url_parts)
print(result)
編輯:作者更新了原始問題,提到源檔案內容應該用處理后的 url 重寫,這是示例:
new_urls = []
with open("urls.tmp", "r") as urls_file:
old_urls = urls_file.readlines()
for line in old_urls:
url_parts = urllib.parse.urlparse(line)
proc_url = "{uri.scheme}://{uri.netloc}/\n".format(uri=url_parts)
new_urls.append(proc_url)
with open("urls.tmp", "w") as urls_file:
urls_file.writelines(new_urls)
uj5u.com熱心網友回復:
請參閱Savva Surenkov回答以解決您的問題。
您可以使用字串的拆分方法,例如:
url = r"https://site1.com.br/wp-content/uploads/2020/06/?SD"
split_by = "com.br/"
new_url = url.split(split_by)[0] split_by
# this gives you the part before <split_by> and then we can attach it again
new_url == r"https://site1.com.br"
如果您想添加一些額外的檢查,您可以查看正則運算式。
您沒有要求但可能會幫助您作為初學者的東西。我建議使用
with open("urls.tmp", "w") as f:
# do something with f
或者
import pathlib
urls = pathlib.Path("urls.tmp").read_text()
# which gives you all lines in single string
平淡無奇open。如果您想了解更多資訊,我建議您查看背景關系管理器。
f-strings從 Python 3.6 開始,我認為它比"{}".format.
uj5u.com熱心網友回復:
您可以繼續使用字串的 find() 方法。
urllist=[
'https://site1.com.br/wp-content/uploads/2020/06/?SD',
'https://site2.com.br/wp-content/uploads/tp-datademo/home-4/data/tp-hotel-booking/?SD',
'https://site3.com.br/wp-content/uploads/revslider/hotel-home/?MD']
newlist=[]
breaktext='com.br/'
for item in urllist:
position=item.find(breaktext)
newlist.append(item[:position len(breaktext)])
print (newlist)
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/424448.html
