在python中將鏈接從html頁面轉換為json格式-有解無憂

我有一個 HTML 頁面，我想從該頁面獲取鏈接，然后將它們轉換為 JSON 格式。這是搜索頁面的鏈接

這是我嘗試過的。

class HtmltoJsonParser(HTMLParser):
    def __init__(self,raise_exception = True): 
        HTMLParser.__init__(self)
        #self.reset()
        self.doc = {}
        self.path = [] 
        self.cur = self.doc
        self.line = 0
        self.raise_exception = raise_exception
        
    @property
    def json(self):
        return self.doc
   
    @staticmethod
    def to_json(content, raise_exception = True):
        parser = HtmltoJsonParser(raise_exception = raise_exception)
        parser.feed(content)
        return parser.json
    
    def handle_starttag(self, tag, attrs):
        # Only parse the 'anchor' tag.
        if tag == "a":
            for name,link in attrs:
                if name == "href" and link.startswith("http"):
                    self.cur[""  name]= link
                    #print (link)

我從這個博客中得到了幫助。我想得到這樣的輸出

{
    "ads": [
      {
           "position": 1,
           "link": "https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwitk5Ou2qX6AhVK07IKHdyyCwQYABADGgJscg&ohost=www.google.com&cid=CAASJeRoa3Q-GtJJqeqbQ0EjhhL22QNYj4Sg_79Man_cWa0tjzSi8Ho&sig=AOD64_3-qhJH4tfcxt1VMfxwOTF8BKeFXA&q&adurl&ved=2ahUKEwikz4uu2qX6AhVXAxAIHfwECwoQ0Qx6BAgFEAM",
    },
    {
         "position": 2,
         "link": "https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwitk5Ou2qX6AhVK07IKHdyyCwQYABAAGgJscg&ohost=www.google.com&cid=CAASJeRoa3Q-GtJJqeqbQ0EjhhL22QNYj4Sg_79Man_cWa0tjzSi8Ho&sig=AOD64_1ZUcXQhcCFUYnBHo3jqlckXL2agg&q&adurl&ved=2ahUKEwikz4uu2qX6AhVXAxAIHfwECwoQ0Qx6BAgCEAE",
    }   ] }

但我得到了這個

{'href': 'https://policies.google.com/terms?hl=en-PL&fg=1'}

為什么不將鏈接附加到 JSON self.cur？我曾嘗試附加它，但每次都遇到關鍵錯誤。

uj5u.com熱心網友回復：

問題就在這里

self.cur[""  name]= link

確實如此name=='href'，這將始終更新存盤在的值name而不是附加它。嘗試這個。

    def handle_starttag(self, tag, attrs):
        # Only parse the 'anchor' tag.
        if tag == "a":
            for name, link in attrs:
                print(attrs)
                if name == "href" and link.startswith("http"):
                    cur = {}
                    cur["position"]= self.line
                    self.line  = 1
                    cur["link"] = link 
                    self.doc["ads"].append(cur)
                    #print (link)

但是根據您提到的內容，您應該更改link.startswith("http")為link.startswith("/url?q=")

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/512206.html

標籤：Pythonjson解析html解析器

上一篇：pd.read_html()不讀取日期

下一篇：正則運算式無法將文本轉換為天數-Python3.10.x