BeautifulSoup/如何提取特定的文本段落？-有解無憂

我正在使用 Beautifulsoup 從單個 MP 頁面中提取資訊，例如https://publications.parliament.uk/pa/cm/cmregmem/211115/cox_geoffrey.htm

我想提取每個編號的粗體標題下的文本（例如“1. 就業和收入”）并單獨保存它們。每個不同議員的標題都會改變（例如，有些宣告“3. 英國來源的禮物、利益和款待”，有些則沒有）——我想要一個適用于任何議員頁面的腳本。

目前，我正試圖用回圈來做這件事，結果一團糟。我對 BS（和 python）很陌生，所以我覺得我可能錯過了一個技巧。有沒有人有任何想法？

import requests
from bs4 import BeautifulSoup

#urls
home_url = "https://publications.parliament.uk/pa/cm/cmregmem/211101/"

#extracting list of mp names and links   save as tuples in list (mp_list)
home_page = requests.get(home_url 'contents.htm')
home_soup = BeautifulSoup(home_page.content, "html.parser")

mp_list = []
mp_elements = home_soup.find_all("p", attrs={'class':None, 'xmlns':'http://www.w3.org/1999/xhtml'})

for mp_element in mp_elements:
    try:
        mp_name = list(mp_element.children)[1].text.strip()
        mp_url = list(mp_element.children)[1]['href']
        mp_list.append((mp_name,mp_url))
    except:
        pass

#extract text from mp page
mp_url = home_url mp_list[115][1] ##this is just to pick out an example MP page to test with
print(mp_url)
mp_page = requests.get(mp_url)
mp_soup = BeautifulSoup(mp_page.content, "html.parser")
mp_text_all = mp_soup.find_all("p")

mp_text_list = []
for item in mp_text_all:
    mp_text_list.append(item.text)

uj5u.com熱心網友回復：

你可以這樣做。

在text你需要的是目前內部<p>有標簽class=indent。使用選擇所有這些<p>標簽.find_all().
如果你想要標題，那么你需要<p>在上面選擇的<p>標簽之前選擇。我曾經在這里.findPreviousSibling()這樣做過。

這是適用于任何 MP 頁面的完整代碼。您只需要get_data()通過傳入 MP 的 url來呼叫該函式。

import requests
from bs4 import BeautifulSoup

def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    p = soup.find_all('p', class_='indent')

    for i in p:
        heading = i.findPreviousSibling('p').find('strong')
        if heading:
            heading = heading.text.strip()
            print(heading)
        print(f'{i.text.strip()}\n')


url1 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/bridgen_andrew.htm'
url2 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/robinson_mary.htm'

print(' URL-1 '.center(50, '*'))
get_data(url1)
print(' URL-2 '.center(50, '*'))
get_data(url2)

這適用于任何 MP 的頁面。這是兩個不同 MP 鏈接的輸出。

********************* URL-1 **********************
1. Employment and earnings
From 6 May 2020 to 5 May 2022, Adviser to Mere Plantations Ltd of Unit 1 Cherry Tree Farm, Cherry Tree Lane, Rostherne WA14 3RZ; a company which grows teak in Ghana. I provide advice on business and international politics. I will be paid ￡12,000 a year for an expected monthly commitment of 8 hrs. (Registered 17 June 2020; updated 23 December 2020)

Payments from Open Dialogus Ltd, 14 London Street, Andover SP11 6UA, for writing articles:

7. (i) Shareholdings: over 15% of issued share capital
AB Produce PLC; processing and distribution of fresh vegetables.

AB Produce Trading Ltd; holding company.

Bridgen Investments Ltd; investment company, investing in shares, property, building projects.

From 6 February 2017, AB Farms Ltd; potato production and storage. (Registered 21 March 2017)

********************* URL-2 **********************
2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation
Name of donor: IX Wireless LtdAddress of donor: 4 Lockside Office Park, Lockside Road, Riversway, Preston PR2 2YSAmount of donation or nature and value if donation in kind: ￡2,000 to my local associationDonor status: company, registration 11008144(Registered 30 July 2021)

7. (i) Shareholdings: over 15% of issued share capital
Mary Felicity Design Ltd; clothing design company. (Registered 03 June 2015)

8. Miscellaneous
From 31 January 2020, member of Cheadle Towns Fund Board. This is an unpaid role. (Registered 28 January 2020)

From 20 June 2021, unpaid director of the Northern Research Group Ltd, a shared services company for northern MPs. (Registered 04 August 2021)

uj5u.com熱心網友回復：

到目前為止，所需的解決方案如下：

import pandas as pd
import requests
from bs4 import BeautifulSoup
data=[]
def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    h1 =[x.get_text(strip=True) for x in soup.select('p[xmlns="http://www.w3.org/1999/xhtml"]')]
    print(h1)
    


url1 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/bridgen_andrew.htm'
url2 = 'https://publications.parliament.uk/pa/cm/cmregmem/211101/robinson_mary.htm'

print(' URL-1 '.center(50, '*'))
get_data(url1)
print(' URL-2 '.center(50, '*'))
get_data(url2)

cols = ["heading", "details"]

df = pd.DataFrame(data, columns= cols)
#print(df)
#df.to_csv('info.csv',index = False)

輸出：

['Bridgen, Andrew (North West Leicestershire)', '1. Employment and earnings', 'From 6 May 2020 to 5 May 2022, Ady, building projects.', 'From 6 February 2017, AB Farms Ltd; potato production and storage. (Registered 21 March 2017)', '']
********************* URL-2 **********************
['Robinson, Mary (Cheadle)', '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'Name of donor: IX Wireless LtdAddress of donor: 4 Lockside Office Park, Lockside Road, Riversway, Preston PR2 2YSAmount of donation or nature and value if donation in kind: ￡2,000 to my local associationDonor status: company, registration 11008144(Registered 30 July 2021)', '7. (i) Shareholdings: over 15% of issued share capital', 'Mary Felicity Design Ltd; clothing design company. (Registered 03 June 2015)', '8. Miscellaneous', 'From 31 January 2020, member of Cheadle Towns Fund Board. This is an unpaid role. (Registered 28 January 2020)', 'From 20 June 2021, unpaid director of the Northern Research Group Ltd, a shared services company for northern MPs. (Registered 04 August 2021)', '']

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/361645.html

標籤：Python 蟒蛇-3.x 网页抓取美汤刮的

上一篇：將“復雜”的HTML結構分配給JavaScript變數

下一篇：Selenium沒有找到iframe元素