使用Python、BeautifulSoup、Pandas從.csv讀取URL并在前面附加刮取結果-有解無憂

盡管無知，我還是讓這段代碼幾乎可以作業。請幫助本壘打！

問題 1：輸入：

我有很長的 URL 串列（1000 ）可供讀取，它們位于 .csv 的單個列中。我更愿意從該檔案中讀取而不是將它們粘貼到代碼中，如下所示。

問題 2：輸出：

源檔案實際上有 3 個驅動程式和 3 個挑戰。在一個單獨的 python 檔案中，下面的代碼查找、列印和保存所有 3 個，但當我使用下面的這個資料框時不會（見下文 - 它只保存 2 個）。

問題 3：輸出：

我希望輸出（兩個檔案）在第 0 列中包含 URL，然后在以下列中包含驅動程式（或挑戰）。但是我在這里寫的（可能是“下降”）使它們不僅下降了一行，而且還移動了 2 列。

最后，我顯示了輸入和當前和所需的輸出。抱歉問了這么長的問題。我將非常感謝您的幫助！

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

每個 URL 中的輸入如下所示。它們只是串列：

市場驅動因素

晶圓廠投資增加
電子產品的小型化
對物聯網設備的需求不斷增加

市場挑戰

半導體行業技術變革日新月異
半導體行業波動
技術鴻溝的影響表驅動因素和挑戰的影響

我想要的驅動程式輸出是：

0	1	2	3
http/.../Global-Induction-Hobs-30196623/	產品創新和新設計	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/	Demand for automated recruitment processes	Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
http/.../Global-Probe-Card-30196643/	Growing investment in fabs	Miniaturization of electronic products	Increasing demand for IoT devices

But instead I get:

0	1	2	3	4	5	6
http/.../Global-Induction-Hobs-30196623/	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/			Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
http/.../Global-Probe-Card-30196643/					Miniaturization of electronic products	Increasing demand for IoT devices

uj5u.com熱心網友回復：

將您的資料存盤在一個字典串列中，從中創建一個資料框。將drivers/的串列拆分challenges為單個columns并將其連接到最終資料幀。

例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')

輸出

網址	型別	0	1	2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	司機	產品創新和新設計	隨著生活方式的改變，對便利家電的需求不斷增加	越來越多地采用節能電器
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	挑戰	高成本限制了大眾市場的采用	與電磁爐有關的健康危害	僅使用平面器具和感應專用炊具的限制
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	司機	對自動化招聘流程的需求	對所有 HR 職能統一解決方案的需求不斷增加	增加勞動力多樣性
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	挑戰	來自開源軟體的威脅	實施和維護成本高	威脅資料安全
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	司機	晶圓廠投資增加	電子產品的小型化	對物聯網設備的需求不斷增加
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	挑戰	半導體行業技術變革日新月異	半導體行業波動	技術鴻溝的影響

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/369472.html

標籤：python pandas web-scraping beautifulsoup export-to-csv

上一篇：BeautifulSoup回傳與來自Chrome(Zillow)的視圖源

下一篇：在python中抓取嵌入的第二頁