我在使用 Puppeteer 時遇到了一些問題,我想提取專案串列并在 headless 為 FALSE 時成功,但在 TRUE 時不成功。
首先,我想在映射之前獲取這些元素。
這是我的腳本,也許你可以復制它,它真的很基礎。
const chalk = require("chalk");
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl searchTerm;
(async () => {
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: [`--window-size=1920,1080`],
defaultViewport: {
width: 1920,
height: 1080,
},
});
const page = await browser.newPage();
// Begin navigation
console.log(chalk.yellow("Beginning navigation."));
await page.goto(searchUrl);
// Await List of elements;
console.log(chalk.yellow("Wait for Network Idle..."));
await page.waitForNetworkIdle();
// get Items
const findElements = await page.evaluate(() => {
const elements = document.querySelectorAll(".sale-item");
console.log(elements);
return elements;
});
console.log(findElements);
console.log(chalk.blue("Waiting..."));
await page.waitForTimeout(10000);
await browser.close();
console.log(chalk.red("Closed."));
})();
Expected results : {
'0': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'1': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'2': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'3': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'4': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
.
.
}
uj5u.com熱心網友回復:
對于初學者來說,我寧愿page.waitForSelector(yourSelector)過page.waitForNetworkIdle();。在大多數情況下,它更直接地保證您想要的資料在頁面上,而網路空閑可以阻止等待與您嘗試抓取的資料完全無關的各種請求。
一些網站檢查標題以阻止刮刀。您可以嘗試添加用戶代理標頭,如 Puppeteer GitHub 問題不同的行為 { headless: false } 和 { headless: true } #665 中所述:
const puppeteer = require("puppeteer");
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl searchTerm;
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.goto(searchUrl);
await page.waitForSelector(".sale-item");
const elements = await page.$$(".sale-item");
console.log(elements.length); // => 48
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
使用puppeteer-extra中描述的為什么 headless 需要為 false 才能使 Puppeteer 作業?是您可以嘗試的另一種選擇。它還匿名化用戶代理標頭。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/369480.html
標籤:javascript 网页抓取 傀儡师 无头浏览器
上一篇:如何為多個變數多次運行腳本?
