我遇到了一個為資料庫準備資料的問題,因為
我第一次這樣做是從 html dt和dd標簽中抓取文本,所以我得到了很多我需要但我不需要的資訊。
我的輸出如下所示:
{'Plotas:': '49,16 m2', 'Kambari? sk.:': '2', 'Auk?tas:': '2', 'Auk?t? sk.:': '7', 'Metai:': '2022', 'Pastato tipas:': 'Mūrinis', '?ildymas:': 'Centrinis kolektorinis', '?rengimas:': 'Dalin? apdaila NAUDINGA:\nInterjero dizaineriai', 'Pastato energijos suvartojimo klas?:': 'A ', 'Reklama/pasiūlymas:': 'Pasirinkite geriausi? internet? namams', 'Ypatyb?s:': 'Nauja kanalizacija\nNauja elektros instaliacija', 'Papildomos patalpos:': 'Sand?liukas\nVieta automobiliui', 'Apsauga:': '?arvuotos durys\nKodin? laiptin?s spyna\nVaizdo kameros'}
我的代碼如下所示:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time
import csv
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
for puslapis in range(2, 3):
driver.get(f'https://www.aruodas.lt/butai/vilniuje/puslapis/{puslapis}')
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
blocks = soup.find_all('tr', class_= 'list-row')
stored_urls = []
for url in blocks:
try:
stored_urls.append(url.a['href'])
except:
pass
for link in stored_urls:
driver.get(link)
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
try:
#Reikia su RegEx sutvarkyti adresa
adress = soup.find('h1','obj-header-text').text.strip()
# print(adress)
except:
adress = 'n/a'
def get_dl(soup):
keys, values = [], []
for dl in soup.findAll("dl", {"class": "obj-details"}):
for dt in dl.findAll("dt"):
keys.append(dt.text.strip())
for dd in dl.findAll("dd"):
values.append(dd.text.strip())
return dict(zip(keys, values))
dl_dict = get_dl(soup)
問題:我怎樣才能過濾和準備我需要的資料..例如,我想要的輸出應該是這樣的:
Plotas: 49,16 m2
Kambariu_sk: 2
Metai: 2022
我應該如何將該資訊放入資料庫中以便于傳輸?
uj5u.com熱心網友回復:
我建議您改進回圈以同時查找dt和dd條目。然后只添加必需串列中的鍵。
嘗試以下方法:
from selenium import webdriver
from bs4 import BeautifulSoup
def get_dl(soup):
d = {}
for dl in soup.findAll("dl", {"class": "obj-details"}):
for el in dl.find_all(["dt", "dd"]):
if el.name == 'dt':
key = el.get_text(strip=True)
elif key in ['Plotas:', 'Kambari? sk.:', 'Metai:']:
d[key] = el.get_text(strip=True)
return d
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
data = []
for puslapis in range(2, 3):
driver.get(f'https://www.aruodas.lt/butai/vilniuje/puslapis/{puslapis}')
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
blocks = soup.find_all('tr', class_= 'list-row')
stored_urls = []
for url in blocks:
try:
stored_urls.append(url.a['href'])
except:
pass
for link in stored_urls:
driver.get(link)
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
h1 = soup.find('h1', 'obj-header-text')
if h1:
address = h1.get_text(strip=True)
else:
address = 'n/a'
data.append({'Address' : address, **get_dl(soup)})
for entry in data:
print(entry)
給你data開始:
{'Address': 'Vilnius, Marku?iai, Pakra??io g., 2 kambari? butas', 'Plotas:': '44,9 m2', 'Kambari? sk.:': '2', 'Metai:': '2023'}
{'Address': 'Vilnius, Pa?ilai?iai, Budini?ki? g., 2 kambari? butas', 'Plotas:': '49,16 m2', 'Kambari? sk.:': '2', 'Metai:': '2022'}
{'Address': 'Vilnius, Senamiestis, Liejyklos g., 4 kambari? butas', 'Plotas:': '55 m2', 'Kambari? sk.:': '4', 'Metai:': '1940'}
{'Address': 'Vilnius, ?irmūnai, Kareivi? g., 2 kambari? butas', 'Plotas:': '24,3 m2', 'Kambari? sk.:': '2', 'Metai:': '2020'}
你可以這樣寫output.csv:
with open('output.csv', 'w', encoding='utf-8', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=data[0].keys())
csv_output.writeheader()
csv_output.writerows(data)
給予output.csv開始:
Address,Plotas:,Kambari? sk.:,Metai:
"Vilnius, Marku?iai, Pakra??io g., 2 kambari? butas","44,9 m2",2,2023
"Vilnius, Pa?ilai?iai, Budini?ki? g., 2 kambari? butas","49,16 m2",2,2022
"Vilnius, Senamiestis, Liejyklos g., 4 kambari? butas",55 m2,4,1940
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/471180.html
