我想使用 Beautiful Soup 提取此頁面上作者單位的文本資料。
我知道使用 selenium 只需單擊“顯示更多”鏈接并再次掃描頁面的解決方法?我不確定這些是什么型別的元素,隱藏的?因為它們只有在單擊按鈕后才會出現在檢查器中。
有沒有辦法只使用美麗的湯來提取這些資訊,或者我是否需要硒或等效的東西來顯示 HTML 代碼中的元素?
from bs4 import BeautifulSoup
import requests
url = 'https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596'
sp = BeautifulSoup(r.content, 'html.parser')
r = sp.get(url)
author_data = sp.find('div', id='author-group')
affiliations = author_data.find('dl', class_='affiliation').text
print(affiliations)
uj5u.com熱心網友回復:
該資訊位于script標簽內,但您需要將隸屬關系的字母映射到實際隸屬關系。下面的代碼提取包含您想要的資訊并使用 JSON 庫處理的 JavaScript 物件。
然后有一系列步驟來動態確定哪些索引包含感興趣的資訊,然后使用字母到單位的構造映射來為每個作者分配正確的單位。
作者的名字和姓氏也是動態確定的,并用空格連接在一起。
其目的是避免硬編碼可能隨時間變化的索引。
import re
import json
import requests
r = requests.get('https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596',
headers={'User-Agent': 'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"abstracts".*})', r.text).group(1))
base = [i for i in data['authors']['content']
if i.get('#name') == 'author-group'][0]['$$']
affiliation_data = [i for i in base if i['#name'] == 'affiliation']
author_data = [i for i in base if i['#name'] == 'author']
name_info = [i['_'] for author in author_data for i in author['$$']
if i['#name'] in ['given-name', 'surname']]
affiliations = dict(zip([j['_'] for i in affiliation_data for j in i['$$'] if j['#name'] == 'label'], [
j['_'] for i in affiliation_data for j in i['$$'] if isinstance(j, dict) and '_' in j and j['_'][0].isupper()]))
# print(affiliations)
author_affiliations = dict(zip([' '.join([i[0], i[1]]) for i in zip(name_info[0::2], name_info[1::2])], [
affiliations[j['_']] for author in author_data for i in author['$$'] if i['#name'] == 'cross-ref' for j in i['$$'] if j['_'] != '?']))
print(author_affiliations)
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/391978.html
