網路抓取困難，搜索時URL不變，保持不變，對于搜索的每個專案都沒有url頁面，保持不變-有解無憂

我一直在考慮從理論上抓取此頁面的方法

https://www.mercadopublico.cl/Home是智利政府的開放企業，您可以在其中申請向該州提供一些服務。

市場

所以我搜索卡馬斯（意思是西班牙語中的“床”）床搜索

所以我發現的第一個障礙是 URL 在我的搜索中根本沒有改變：https ://www.mercadopublico.cl/Home/BusquedaLicitacion在任何搜索中都是相同的

網址不變

第二個障礙，如果我切換到下一頁，也不會改變。所以我不能像我想做的那樣在陣列上撰寫 URL 更改型別。

第三個障礙是我想要的最多資訊

在另一個不變的主視窗中的彈出視窗中

彈出視窗

那里的資訊可以以 CSV 或 JSON 格式下載，也可以從彈出視窗中抓取。

但是到目前為止，當我更改搜索或頁面時，我無法為 url 沒有更改的部分找到解決方案。所以到目前為止我無法思考，因為我無法完成第一部分。

我認為網路抓取彈出視窗會更容易，因為那時我已經有一個 URL。（彈出視窗確實有不同的 URL！）

如果您知道如何或我是否需要另一種方法來做到這一點（因為我知道我一直只使用 BS4 來做到這一點），請告訴我我應該朝哪個方向走。

這是我的第一個錯誤，我不知道如何用通常的代碼解決，如果你幫我解決這個問題，我不能更進一步，那就是更改 URL 以獲取矩陣 url，因為我不能使用范圍方法

 # -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'

#problem here because i cant navigate beacuse ajax doesnt let me
params = {
    'page': 0,
    'page1': 40,
}

results = []

for offset in range(0, 121, 40):  #  this method doesnt work on ajax page

    params['start'] = offset

    response = requests.get(url, params=params)
    print('url:', response.url)
    #print('status:', response.status_code)
                    
    soup = bs(response.text, "html.parser")

    all_products = soup.find_all('div', {'class': 'product-tile'})

    for product in all_products:
        itemid = product.get('data-itemid') 
        print('itemid:', itemid)

        data = product.get('data-product') 
        print('data:', data)
        
        name = product.find('span', {'itemprop': 'name'}).text
        print('name:', name)
        
        all_prices = product.find_all('div', {'class': 'price__text'})
        print('len(all_prices):', len(all_prices))
        
        price = all_prices[0].get('aria-label')
        print('price:', price)
        
        results.append( (itemid, name, price, data) )
        print('results')

# ---

# ... here you can save all `results` in file ...
import pandas as pd
df = pd.DataFrame(data = results[1:],columns = results[0])
df.to_excel('results.xlsx', index=False,header = False)#Writing to Excel file

所以，我現在正在嘗試通過此代碼修改來獲取網址

import requests
from bs4 import BeautifulSoup as bs    
from selenium import webdriver

#set chromodriver.exe path
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
#implicit wait
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get('https://www.mercadopublico.cl/Home/BusquedaLicitacion')
#identify element
l =driver.find_element_by_xpath("//button[text()='Check it Now']")
#perform click
driver.execute_script("arguments[0].click();", l);

    
url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'
    
response = requests.get(url)
print('url:', response.url)
#print('status:', response.status_code)
                        
soup = bs(response.text, "html.parser")
    
all_products = soup.find_all('a', {'href': '#'})
    
for product in all_products:
    itemurl = product.get('onclick') 
    print('itemurl:', itemurl)# hasta aca

#close browser
driver.quit()

但沒有得到任何列印，不確定 wat 失敗了。

非常感謝。

uj5u.com熱心網友回復：

URL 不會更改，因為它正在使用搜索查詢發出發布請求。

POST https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar

請求資料是：

{
  "textoBusqueda":"camas",
  "idEstado":"5",
  "codigoRegion":"-1",
  "idTipoLicitacion":"-1",
  "fechaInicio":null,
  "fechaFin":null,
  "registrosPorPagina":"10",
  "idTipoFecha":[],
  "idOrden":"1",
  "compradores":[],
  "garantias":null,
  "rubros":[],
  "proveedores":[],
  "montoEstimadoTipo":[0],
  "esPublicoMontoEstimado":null,
  "pagina":0
}

還有一個可能需要的 cookie __RequestVerificationToken_L0hvbWU1。

然后，您可以在 HTML 中獲取指向彈出視窗的鏈接。它位于鏈接的 onclick 屬性中。

如果您需要更多幫助，請在評論部分詢問。

Python 示例：我目前已經讓它作業到最后一步。當我查看 csv 和 json 檔案時，我意識到它們都是無效的。該網站似乎在兩者的底部都附加了一些 html。我建議只從最后一頁抓取資料，而不是下載 csv/json。

import requests
from bs4 import BeautifulSoup


def get_headers(session):
    res = session.get("https://www.mercadopublico.cl/Home")
    if res.status_code == 200:
        print("Got headers")
        # return res.text
    else:
        print("Failed to get headers")



def search(session):
    data = {
        "textoBusqueda": "Camas",
        "idEstado": "5",
        "codigoRegion": "-1",
        "idTipoLicitacion": "-1",
        "fechaInicio": None,
        "fechaFin": None,
        "registrosPorPagina": "10",
        "idTipoFecha": [],
        "idOrden": "1",
        "compradores": [],
        "garantias": None,
        "rubros": [],
        "proveedores": [],
        "montoEstimadoTipo": [0],
        "esPublicoMontoEstimado": None,
        "pagina": 0
    }
    res = session.post(
        "https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar",
        data=data)
    if res.status_code == 200:
        print("Search succeeded")
        return res.text
    else:
        print("Search failed with error:", res.reason)



def get_popup_link(html):
    soup = BeautifulSoup(html, "html.parser")
    dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
    # clean onclick links
    clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
    return clean_links


def get_download_html(s, links):
    for link in links:
        res = s.get(link)
        if res.status_code == 200:
            print("fetch succeeded")
            return res.text
        else:
            print("fetch failed with error:", res.reason)

def get_download_links(html):
    soup = BeautifulSoup(html, "html.parser")
    dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
    # clean onclick links
    clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
    return clean_links

def main():
    with requests.Session() as s:
        get_headers(s)
        html = search(s)
        popup_links = get_popup_link(html)
        print(popup_links)
        download_html = get_download_html(s, popup_links)
        # print(download_html)

main()

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/448274.html

標籤：javascript html 网页抓取网络抓取语言

上一篇：使用requests-html進行網頁抓取-如何從網站收集一個簡單的數字？

下一篇：pythonBeautifulSoupwebScraping輸出沒有寫入資訊