我正在通過 python 中的抓取來提取房地產資料。我希望這些資料在 CSV 檔案中。
當我將資料寫入 CSV 時,如果第一個抓取的專案沒有我需要的值,它只會跳過所有行(但其他專案具有該值),它是空的并且不創建任何行,即使是空值也不行。
我的網頁抓取代碼塊:
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import csv
import time
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
data = []
def get_dl(soup):
d_list = {}
for dl in soup.findAll("dl", {"class": "obj-details"}):
for el in dl.find_all(["dt", "dd"]):
if el.name == 'dt':
key = el.get_text(strip=True)
elif key in ['Plotas:', 'Buto numeris:', 'Metai:', 'Namo numeris:', 'Kambari? sk.:', 'Auk?tas:', 'Auk?t? sk.:', 'Pastato tipas:', '?ildymas:', '?rengimas:', 'Pastato energijos suvartojimo klas?:', 'Ypatyb?s:', 'Papildomos patalpos:', 'Papildoma ?ranga:', 'Apsauga:']:
d_list[key] = ' '.join(el.text.strip().replace("\n", ", ").split('NAUDINGA')[0].split('m2')[0].split())
return d_list
for puslapis in range(1, 2):
driver.get(f'https://www.aruodas.lt/butai/kaune/puslapis/{puslapis}')
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
blocks = soup.find_all('tr', class_='list-row')
stored_urls = []
for url in blocks:
try:
stored_urls.append(url.a['href'])
except:
pass
for link in stored_urls:
driver.get(link)
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
h1 = soup.find('h1', 'obj-header-text')
price = soup.find('div', class_ = 'price-left')
try:
address1 = h1.get_text(strip=True)
address2 = re.findall(r'(.*),[^,]*$', address1)
address = ''.join(address2)
city, district, street = address.split(',')
except:
city, district, street = 'NaN'
try:
full_price = price.find('span', class_ = 'price-eur').text.strip()
full_price1 = full_price.replace('€', '').replace(' ','').strip()
except:
full_price1 = 'NaN'
try:
price_sq_m = price.find('span', class_ = 'price-per').text.strip()
price_sq_m1 = price_sq_m.replace('€/m2)', '').replace('(domina keitimas)', '').replace('(', '').replace(' ','').strip()
except:
price_sq_m1 = 'NaN'
try:
price_change = price.find('div', class_ = 'price-change').text.strip()
price_change1 = price_change.replace('%', '').strip()
except:
price_change1 = 'NaN'
data.append({'city': city, 'district': district, 'street': street, 'full_price': full_price1, 'price_sq_m': price_sq_m1, 'price_change': price_change1, **get_dl(soup)})
例如在鍵串列中有值:
['Ypatyb?s:']:
但是在頁面中,我正在刮第一個平面的地方沒有那個價值,而且根本沒有創建行,這不是我需要的。
用于寫入 csv 的代碼塊:
with open('output_kaunas.csv', 'w', encoding='utf-8', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=data[0].keys(), extrasaction='ignore')
csv_output.writeheader()
csv_output.writerows(data)
所以,我的問題是,如何創建具有我需要的功能的行,即使該功能在第一個抓取的專案中也不存在。
uj5u.com熱心網友回復:
要將資料存盤在 csv 檔案中,您可以使用 pandas Dataframe
df = pd.DataFrame(data).to_csv('output_kaunas.csv',index=False)
根據您的完整代碼:
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import pandas as pd
import time
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
data = []
def get_dl(soup):
d_list = {}
for dl in soup.findAll("dl", {"class": "obj-details"}):
for el in dl.find_all(["dt", "dd"]):
if el.name == 'dt':
key = el.get_text(strip=True)
elif key in ['Plotas:', 'Buto numeris:', 'Metai:', 'Namo numeris:', 'Kambari? sk.:', 'Auk?tas:', 'Auk?t? sk.:', 'Pastato tipas:', '?ildymas:', '?rengimas:', 'Pastato energijos suvartojimo klas?:', 'Ypatyb?s:', 'Papildomos patalpos:', 'Papildoma ?ranga:', 'Apsauga:']:
d_list[key] = ' '.join(el.text.strip().replace("\n", ", ").split('NAUDINGA')[0].split('m2')[0].split())
return d_list
for puslapis in range(1, 2):
driver.get(f'https://www.aruodas.lt/butai/kaune/puslapis/{puslapis}')
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
blocks = soup.find_all('tr', class_='list-row')
stored_urls = []
for url in blocks:
try:
stored_urls.append(url.a['href'])
except:
pass
for link in stored_urls:
driver.get(link)
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
h1 = soup.find('h1', 'obj-header-text')
price = soup.find('div', class_ = 'price-left')
try:
address1 = h1.get_text(strip=True)
address2 = re.findall(r'(.*),[^,]*$', address1)
address = ''.join(address2)
city, district, street = address.split(',')
except:
city, district, street = 'NaN'
try:
full_price = price.find('span', class_ = 'price-eur').text.strip()
full_price1 = full_price.replace('€', '').replace(' ','').strip()
except:
full_price1 = 'NaN'
try:
price_sq_m = price.find('span', class_ = 'price-per').text.strip()
price_sq_m1 = price_sq_m.replace('€/m2)', '').replace('(domina keitimas)', '').replace('(', '').replace(' ','').strip()
except:
price_sq_m1 = 'NaN'
try:
price_change = price.find('div', class_ = 'price-change').text.strip()
price_change1 = price_change.replace('%', '').strip()
except:
price_change1 = 'NaN'
data.append({'city': city, 'district': district, 'street': street, 'full_price': full_price1, 'price_sq_m': price_sq_m1, 'price_change': price_change1, **get_dl(soup)})
df = pd.DataFrame(data).to_csv('output_kaunas.csv',index=False)
轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/479366.html
上一篇:代碼回傳如下值:<functionmax_pt_dateat0x000002209087F040>
下一篇:用逗號分隔值分割csv中的行
