網頁抓取資料到python上的csv檔案，以及抓取鏈接的代碼-有解無憂

1 - 當我檢查 csv 檔案時，我只能從最后一個鏈接（Tugende）中找到資料。但是當我列印資料時，我得到了我想要的一切。如何獲取 csv 檔案中的所有資料？

2 - 對于“源”變數，我如何才能從中僅獲取文章鏈接并將其添加到 csv 檔案。

import requests
from bs4 import BeautifulSoup as bs
import csv

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']
for startup in startups:
    u = url.format(startup)
    html_text = requests.get(u).text
    soup = bs(html_text, 'lxml')
    
    list1 = soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark')
    source1 =soup.find_all('div',class_='col-md-2 mt-3 mt-lg-0')
    file = open('funding.csv', 'w',newline='')
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors'])
    writer.writerow(mama)



    for L in list1:      
        name1 = L.find('span', class_="line-height-1").text
        amount1 = L.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = L.find('span', class_="pt-0").text
        funding_type1 = L.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = L.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source =L.find('div',class_="col-md-2 mt-3 mt-lg-0")
        
        print(name1, funding_type1, date1,amount1, investor1)

        writer.writerow([name1, funding_type1, date1,amount1, investor1])
    file.close()

uj5u.com熱心網友回復：

1：您應該在寫入 csv 檔案時使用背景關系管理器來處理它。我已經在下面修復了您的代碼，首先我在“w”模式下添加標題（因此它在您第一次運行代碼時寫入檔案）然后我在抓取每一頁時將“a”資料附加到它。

2：你需要找到源鏈接所在的'a'標簽，然后像這樣獲取href屬性：find（'a'）['href']見下文

import requests
from bs4 import BeautifulSoup as bs
import csv

#write header
with open('funding.csv','w',newline='') as file:
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors','source'])
    writer.writerow(mama)

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']

for startup in startups:

    html_text = requests.get(url.format(startup))
    soup = bs(html_text.text,'lxml')

    for list1 in soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark'):
        name1 = list1.find('span', class_="line-height-1").text
        amount1 = list1.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = list1.find('span', class_="pt-0").text
        funding_type1 = list1.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = list1.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source = list1.find('div',class_="col-md-2 mt-3 mt-lg-0").find('a')['href']

        print(name1, funding_type1, date1,amount1, investor1, source)

        with open('funding.csv','a',newline='') as file:
            writer = csv.writer(file)
            writer.writerow([name1, funding_type1, date1,amount1, investor1, source])

uj5u.com熱心網友回復：

您僅獲取最終啟動資料的原因是您打開輸出檔案的方式：

    file = open('funding.csv', 'w',newline='')

這會根據要求打開檔案進行寫入，但會將“檔案開頭”指標放在檔案的最開頭。第一次通過回圈時這很好，但隨后就不行了。

如果你真的想在回圈中打開檔案，你需要使用a(for "open for writing, but in append mode if it already exists ")。

但是，在回圈內執行此操作效率不高。我建議在開始for回圈之前打開檔案進行寫入，然后也創建 writer 物件：

writer = csv.writer(open('funding.csv', 'w', newline=''))
for startup in startups
....

[do loop operations]
....
writer.close()

并close()在回圈結束后進行操作。

uj5u.com熱心網友回復：

當您 print(element.find()) 并保存您的元素時，結果會有所不同。
實際上 element.find() 回傳 bs4.element.Tag 而不是 str。
在您的情況下，您看不到它，因為 python 在列印某些內容時會應用 str(element.find()) 。
您需要進行強制轉換，否則可能會導致不需要的結果。
例子：

element = BeautifulSoup('<div></div>')
print(type(element.find()))
print(type(str(element.find())))

轉載請註明出處，本文鏈接：https://www.uj5u.com/caozuo/407896.html

標籤：

上一篇：bs4查找子文本/find_next

下一篇：如何處理Selenium中的406回應？