我有一個 HTML 頁面,我想從該頁面獲取鏈接,然后將它們轉換為 JSON 格式。這是搜索頁面的鏈接
這是我嘗試過的。
class HtmltoJsonParser(HTMLParser):
def __init__(self,raise_exception = True):
HTMLParser.__init__(self)
#self.reset()
self.doc = {}
self.path = []
self.cur = self.doc
self.line = 0
self.raise_exception = raise_exception
@property
def json(self):
return self.doc
@staticmethod
def to_json(content, raise_exception = True):
parser = HtmltoJsonParser(raise_exception = raise_exception)
parser.feed(content)
return parser.json
def handle_starttag(self, tag, attrs):
# Only parse the 'anchor' tag.
if tag == "a":
for name,link in attrs:
if name == "href" and link.startswith("http"):
self.cur["" name]= link
#print (link)
我從這個博客中得到了幫助。我想得到這樣的輸出
{
"ads": [
{
"position": 1,
"link": "https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwitk5Ou2qX6AhVK07IKHdyyCwQYABADGgJscg&ohost=www.google.com&cid=CAASJeRoa3Q-GtJJqeqbQ0EjhhL22QNYj4Sg_79Man_cWa0tjzSi8Ho&sig=AOD64_3-qhJH4tfcxt1VMfxwOTF8BKeFXA&q&adurl&ved=2ahUKEwikz4uu2qX6AhVXAxAIHfwECwoQ0Qx6BAgFEAM",
},
{
"position": 2,
"link": "https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwitk5Ou2qX6AhVK07IKHdyyCwQYABAAGgJscg&ohost=www.google.com&cid=CAASJeRoa3Q-GtJJqeqbQ0EjhhL22QNYj4Sg_79Man_cWa0tjzSi8Ho&sig=AOD64_1ZUcXQhcCFUYnBHo3jqlckXL2agg&q&adurl&ved=2ahUKEwikz4uu2qX6AhVXAxAIHfwECwoQ0Qx6BAgCEAE",
} ] }
但我得到了這個
{'href': 'https://policies.google.com/terms?hl=en-PL&fg=1'}
為什么不將鏈接附加到 JSON self.cur?我曾嘗試附加它,但每次都遇到關鍵錯誤。
uj5u.com熱心網友回復:
問題就在這里
self.cur["" name]= link
確實如此name=='href',這將始終更新存盤在的值name而不是附加它。嘗試這個。
def handle_starttag(self, tag, attrs):
# Only parse the 'anchor' tag.
if tag == "a":
for name, link in attrs:
print(attrs)
if name == "href" and link.startswith("http"):
cur = {}
cur["position"]= self.line
self.line = 1
cur["link"] = link
self.doc["ads"].append(cur)
#print (link)
但是根據您提到的內容,您應該更改link.startswith("http")為link.startswith("/url?q=")
轉載請註明出處,本文鏈接:https://www.uj5u.com/yidong/512206.html
