我只是開始使用python rutine 來從許多基于服務器名稱的網頁中抓取鏈接,但即使它可以作業,但輸出卻不是預期的格式:
期望的輸出:
https://www.someserver.com/files/1
https://www.someserver.com/files/2
https://www.someserver.com/files/3....
實際輸出:
[None, '//server.org', '//server.org', '//server.org/recent', '//server.org/popular', '//server.org/trolls', 'https://server.org/software/', 'https://www.serverstore.com', '//server.org/submission', '//server.org/my/login', '//server.org/my/newuser', '//devices.server.org', '//build.server.org', '//entertainment.server.org', '//technology.server.org', '//server.org/?fhfilter=somefilter', '//science.server.org', '//yro.server.org', 'http://rss.server.org/server/serverMain', 'http://www.facebook.com/server', 'https://server.org', '#', '//server.org/blog', '#', '#', '#', '//server.org']
那么如何自定義連接以獲得預期的格式而不是 //server.org,或者如何格式化soup.findAlland the append.
非常感謝。
代碼
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request("https://somepagewithlinks.com")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
file = open("lk", "w")
lista = repr(links)
file.write(str(links))
file.close
更新
感謝 uingtea,但由于更改鏈接/鏈接說明失敗并顯示與
file.close
<built-in method close of _io.TextIOWrapper object at 0x7ffe8ec74b40>
并且在使用file.close()它時會生成一個空檔案。我知道必須定義一個串列(鏈接),然后它應該被參考到 links.instruction()。我錯過了什么?
uj5u.com熱心網友回復:
檢查字串開始
for link in soup.findAll('a'):
link = link.get('href')
if link.startswith('//'):
link= 'https:' link
elif link.startswith('#'):
link= 'https/domainname/' link
links.append(link)
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/454893.html
下一篇:字串太長時正則運算式不起作用?
