感謝您在此問題上的幫助。我正在嘗試抓取包括表情符號在內的論壇帖子。獲取文本作業正常,但不包括表情符號,我想使用您在下面看到的功能將它們與文本一起抓取。謝謝您的幫助!
對于下面的鏈接,影像被稱為 class = 'smilies'。
這是我的代碼:
### import
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()
### second, create a function to get all the user comments
def get_comments(lst_name):
# find all user comments and save them to a list
comment = bs.find_all(class_= "content")
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip = True))
# return the list
return lst_name
### third, start the scraping
link = 'https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120'
# create the lists for the functions
user_comments = []
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, 'html.parser')
# call the functions to get the information
get_comments(user_comments)
# create a pandas dataframe for the comments
comments_dict = {
'user_comments': user_comments
}
df_comments_info = pd.DataFrame(data=comments_dict)
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
uj5u.com熱心網友回復:
一種方法是<img >用文本替換所有內容。例如:
### import
import requests
import pandas as pd
from bs4 import BeautifulSoup
### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()
### second, create a function to get all the user comments
def get_comments(lst_name):
# replace all <img > with text:
for img in bs.select("img.smilies"):
img.replace_with(img["alt"])
bs.smooth()
# find all user comments and save them to a list
comment = bs.find_all(class_="content")
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip=True))
# return the list
return lst_name
### third, start the scraping
link = "https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120"
# create the lists for the functions
user_comments = []
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, "html.parser")
# call the functions to get the information
get_comments(user_comments)
# create a pandas dataframe for the comments
comments_dict = {"user_comments": user_comments}
df_comments_info = pd.DataFrame(data=comments_dict)
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
print(df)
印刷:
...
Danke!Erst Mal sollte ich bei den Tabletten bleiben. Hab die ja schon Mal genommen. Genau die gleichen wie ihr mir empfiehlt. Aber die sind fast leer und auf Amazon gibt's die nicht mehr.Soll ich bei der Quelle nachfragen oder denkst du findest es günstiger? :)
...
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/442022.html
