print和pd.DataFrame之間的不同值-有解無憂

所以我嘗試通過將其提供到將轉換為串列的資料幀來抓取多個新聞。但是當我插入到資料幀中時，它只給出最后的抓取值，但列印顯示不同的結果。我的示例 df 是這樣的

df = {data:['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia],
            [https://ekonomi.bisnis.com/read/20211010/98/1452514/hari-pos-sedunia-pos-indonesia-kasih-diskon-70-persen-paket-kilat]}

這是我的代碼

import pandas as pd
import newspaper
from newspaper import Article
df = pd.read_excel(' 1.xlsx')
urls = df['data'].to_list()


for url in urls:
    try:
        a = Article(url, language='id')
        a.download()
        a.parse()

        author = a.authors
        dates = a.publish_date
        add_data = a.additional_data
        text = a.text
        tag = a.tags
        title = a.title
        keywords = a.keywords

        new_df = pd.DataFrame({'author':[author]}) #it need in [] because it can be multiple  
        print(author,dates,add_data,text,tag,title,keywords)

    except Exception as e:
        print(e)

當我運行print(author)它顯示結果

['S. Dian Andryanto', 'Reporter', 'Editor']
['Ali Akhmad Noor Hidayat', 'Reporter', 'Editor']

但是當我插入到資料框時，它們只回傳最后一個值

new_data = {"author":['Ali Akhmad Noor Hidayat', 'Reporter', 'Editor']}

任何人都可以解釋如何將我的所有作者插入到資料框中？

uj5u.com熱心網友回復：

您正在遍歷urls串列并在每次將整個 DataFrame 存盤在new_df. 為避免這種情況，您可以創建一個外部字典，并在回圈結束時創建整個 DataFrame，如下面的代碼所示：

import pandas as pd
import newspaper
from newspaper import Article
df = pd.read_excel(' 1.xlsx')
urls = df['data'].to_list()

all_authors = {"author": []}
for url in urls:
   try:
     a = Article(url, language='id')
     a.download()
     a.parse()

     author = a.authors
     dates = a.publish_date
     add_data = a.additional_data
     text = a.text
     tag = a.tags
     title = a.title
     keywords = a.keywords

     all_authors['author'].append(author) #it need in [] because it can be multiple  
    

   except Exception as e:
     print(e)
new_df = pd.DataFrame(data=all_authors)

uj5u.com熱心網友回復：

收集new_df在一個串列中并在最后連接它們。

我稍微修改了您的代碼，因為捕獲所有例外是一個壞主意，請newspaper.ArticleException改用。

urls = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia',
        'https://ekonomi.bisnis.com/read/20211010/98/1452514/hari-pos-sedunia-pos-indonesia-kasih-diskon-70-persen-paket-kilat']

data = []
for url in urls:
    try:
        a = Article(url, language='id')
        a.download()
        a.parse()

    except newspaper.ArticleException as e:
        print(e)

    else:    
        author = a.authors
        dates = a.publish_date
        add_data = a.additional_data
        text = a.text
        tag = a.tags
        title = a.title
        keywords = a.keywords

        new_df = pd.DataFrame({'author':[author]})
        data.append(new_df)        

df = pd.concat(data)

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/317049.html

標籤：Python 熊猫数据框网页抓取

上一篇：如何在R中創建一個“for回圈”，它可以從URL串列中的每個URL中抓取資料？

下一篇：將復雜的javascript物件轉換為JSONnodejs