盡管無知,我還是讓這段代碼幾乎可以作業。請幫助本壘打!
- 問題 1:輸入:
我有很長的 URL 串列(1000 )可供讀取,它們位于 .csv 的單個列中。我更愿意從該檔案中讀取而不是將它們粘貼到代碼中,如下所示。
- 問題 2:輸出:
源檔案實際上有 3 個驅動程式和 3 個挑戰。在一個單獨的 python 檔案中,下面的代碼查找、列印和保存所有 3 個,但當我使用下面的這個資料框時不會(見下文 - 它只保存 2 個)。
- 問題 3:輸出:
我希望輸出(兩個檔案)在第 0 列中包含 URL,然后在以下列中包含驅動程式(或挑戰)。但是我在這里寫的(可能是“下降”)使它們不僅下降了一行,而且還移動了 2 列。
最后,我顯示了輸入和當前和所需的輸出。抱歉問了這么長的問題。我將非常感謝您的幫助!
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
df = pd.DataFrame(data, columns=[url])
dataframes.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes)
tdata = df2.T
tdata.to_csv(f'detail-dr.csv', header=True)
get_drivers()
def get_challenges():
data = []
for y in toc.select('li:-soup-contains-own("Market challenges") li'):
data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
df = pd.DataFrame(data, columns=[url])
dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes2)
tdata = df2.T
tdata.to_csv(f'detail-ch.csv', header=True)
get_challenges()
每個 URL 中的輸入如下所示。它們只是串列:
市場驅動因素
- 晶圓廠投資增加
- 電子產品的小型化
- 對物聯網設備的需求不斷增加
市場挑戰
- 半導體行業技術變革日新月異
- 半導體行業波動
- 技術鴻溝的影響表 驅動因素和挑戰的影響
我想要的驅動程式輸出是:
| 0 | 1 | 2 | 3 |
|---|---|---|---|
| http/.../Global-Induction-Hobs-30196623/ | 產品創新和新設計 | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances |
| http/.../Global-Human-Capital-Management-30196628/ | Demand for automated recruitment processes | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity |
| http/.../Global-Probe-Card-30196643/ | Growing investment in fabs | Miniaturization of electronic products | Increasing demand for IoT devices |
But instead I get:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| http/.../Global-Induction-Hobs-30196623/ | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances | ||||
| http/.../Global-Human-Capital-Management-30196628/ | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity | ||||
| http/.../Global-Probe-Card-30196643/ | Miniaturization of electronic products | Increasing demand for IoT devices |
uj5u.com熱心網友回復:
將您的資料存盤在一個字典串列中,從中創建一個資料框。將drivers/的串列拆分challenges為單個columns并將其連接到最終資料幀。
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url':url,
'type':'driver',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url':url,
'type':'challenges',
'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')
輸出
| 網址 | 型別 | 0 | 1 | 2 |
|---|---|---|---|---|
| https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ | 司機 | 產品創新和新設計 | 隨著生活方式的改變,對便利家電的需求不斷增加 | 越來越多地采用節能電器 |
| https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ | 挑戰 | 高成本限制了大眾市場的采用 | 與電磁爐有關的健康危害 | 僅使用平面器具和感應專用炊具的限制 |
| https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ | 司機 | 對自動化招聘流程的需求 | 對所有 HR 職能統一解決方案的需求不斷增加 | 增加勞動力多樣性 |
| https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ | 挑戰 | 來自開源軟體的威脅 | 實施和維護成本高 | 威脅資料安全 |
| https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ | 司機 | 晶圓廠投資增加 | 電子產品的小型化 | 對物聯網設備的需求不斷增加 |
| https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ | 挑戰 | 半導體行業技術變革日新月異 | 半導體行業波動 | 技術鴻溝的影響 |
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/369472.html
標籤:python pandas web-scraping beautifulsoup export-to-csv
上一篇:BeautifulSoup回傳與來自Chrome(Zillow)的視圖源
下一篇:在python中抓取嵌入的第二頁
