我有一個資料挖掘腳本,可以將我的資料回傳到這樣的陣列中:
price_per_m2 = [742.0, 1210.0, 954.0, 1078.0, 910.0, 1553.0, 0, 1.0, 417.0, 553.0, 41.0, 550.0, 367.0, 11.0, 533.0, 2.0, 1139.0, 1466.0, 1042.0, 800.0, 906.0, 60.0, 91.0, 812.0, 412.0, 1000.0, 64.0, 778.0, 63.0, 1043.0, 899.0, 951.0]
type_of_property = ['Магазин', 'Двустаен апартамент', 'Тристаен апартамент', 'Тристаен апартамент', 'Тристаен апартамент', 'Тристаен апартамент', 'Парцел', 'Парцел', 'Гараж', 'Офис', 'Заведение', 'Офис', 'Гараж', 'Парцел', 'Офис', 'Парцел', 'Офис', 'Офис', 'Магазин', 'Магазин', 'Гараж', 'Земеделски имот', 'Парцел', 'Магазин', 'Офис', 'Двустаен апартамент', 'Парцел', 'Магазин', 'Парцел', 'Двустаен апартамент', 'Едностаен апартамент', 'Двустаен апартамент', 'Офис', 'Едностаен апартамент', 'Земеделски имот', 'Офис', 'Едностаен апартамент', 'Едностаен апартамент', 'Магазин', 'Двустаен апартамент', 'Офис', 'Двустаен апартамент', 'Едностаен апартамент', 'Двустаен апартамент']
- 請注意,這兩個陣列的長度可能不相等,因為我沒有粘貼完整的陣列,因為它們太長了。
最終目標是每天從所有陣列(每天提取)中創建一個 excel 檔案。
不過目前的目標是:
- 從上述陣列之一創建一個熊貓陣列
- 將該陣列保存到excel檔案。
到目前為止我做了什么:
df_price_per_m2 = pd.DataFrame(data=price_per_m2)
df_type_of_property = pd.DataFrame(type_of_property)
df_price_per_m2.to_excel('sqm.xlsx')
df_type_of_property.to_excel('sqm.xlsx')
正如你會注意到的,我已經嘗試過,既有“資料=”這個詞,也有沒有它。我的程式在此代碼的第一行回傳錯誤。
完整程式:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import re
import os
s = HTMLSession()
url = 'https://www.imoti.net/bg/obiavi/r/prodava/sofia/?page=1&sid=fSNNpb'
r = s.get(url)
soup_for_last_page = BeautifulSoup(r.text, 'html.parser')
# Get all the data from the page
def getdata(url):
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# print(soup)
return soup
def getnextpage(soup):
page = soup.find('nav', {'class': 'paginator'})
if page.find('a', {'class': 'next-page-btn'}):
url = str(page.find('a', {'class': 'next-page-btn'})['href'])
return url
else:
return
last_page = soup_for_last_page.find('a', {'class': 'last-page'})
last_page_number = int(last_page.get_text())
urls = []
for page in range(1, last_page_number 1):
url = f'https://www.imoti.net/bg/obiavi/r/prodava/sofia/?page={page}&sid=fSNNpb'
urls.append(url)
# while True:
# soup = getdata(url)
# url = getnextpage(soup)
# if not url:
# break
# urls.append(url)
# #print(url)
prices = []
type_of_property = []
sqm_area = []
locations = []
publisher = []
price_per_m2 = []
def price_per_m2_0(x):
if x.get_text().strip().find('/:') == -1:
return 0
else:
return float(x.get_text().strip().split('/:')[1].strip().replace('EUR', '').strip().replace(' ', ''))
def get_sqm(links):
for i in links:
soup = getdata(i)
for sqm in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
sqm_value = sqm.get_text().split(',')[1].split()[0]
sqm_area.append(sqm_value)
return sqm_area
def get_location(links):
for i in links:
soup = getdata(i)
for location in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
location_value = location.get_text().split(',')[-1].strip()
locations.append(location_value)
return locations
def get_type(links):
for i in links:
soup = getdata(i)
for property_type in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
property_type_value = ' '.join(
property_type.get_text().split(',')[0].split()[1:3])
type_of_property.append(property_type_value)
return type_of_property
def get_publisher(links):
for i in links:
soup = getdata(i)
for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'})[1::2]:
publish_value = publish.get_text().strip()
publisher.append(publish_value)
return publisher
def get_price_per_m2(links):
for i in links:
soup = getdata(i)
for price_per_m2_ in soup.find('ul', {'class': 'list-view real-estates'}).find_all('ul', {'class': 'parameters'}):
price_per_m2_value = price_per_m2_0(price_per_m2_)
price_per_m2.append(price_per_m2_value)
return price_per_m2
def total_price(links):
for i in links:
soup = getdata(i)
for price in soup.find('ul', {'class': 'list-view real-estates'}).find_all('strong', {'class': 'price'}):
price_text = price.get_text()
price_arr = re.findall('[0-9] ', price_text)
final_price = ''
for each_sub_price in price_arr:
final_price = each_sub_price
prices.append(final_price)
return prices
print(get_sqm(urls))
print(get_location(urls))
print(get_type(urls))
print(get_publisher(urls))
print(get_price_per_m2(urls))
print(total_price(urls))
df_get_sqm = pd.DataFrame(data=get_sqm)
df_get_location = pd.DataFrame(get_location)
df_get_type = pd.DataFrame(get_type)
df_get_publisher = pd.DataFrame(get_publisher)
df_get_price_per_m2 = pd.DataFrame(get_price_per_m2)
df_total_price = pd.DataFrame(total_price)
df_get_sqm.to_excel('sqm.xlsx')
編輯:我收到的錯誤訊息:
Traceback (most recent call last):
File "/Users/tdonov/Desktop/Python/Realestate Scraper/real_estate_test.py", line 130, in <module>
df_get_sqm = pd.DataFrame(data=get_sqm)
File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 590, in __init__
raise ValueError("DataFrame constructor not properly called!")
ValueError: DataFrame constructor not properly called!
[Finished in 70.198s]
uj5u.com熱心網友回復:
嘗試 :
df_price_per_m2 = pd.DataFrame(data={'price':price_per_m2})
uj5u.com熱心網友回復:
從上述陣列之一創建一個熊貓陣列
請注意,像這樣的東西[1,2,3]通常被稱為串列而不是陣列python。如果您有單個平面串列(如您的price_per_m2)那么pandas.Series應該就足夠了,請嘗試以下操作
import pandas as pd
price_per_m2 = [742.0, 1210.0, 954.0, 1078.0, 910.0, 1553.0, 0, 1.0, 417.0, 553.0, 41.0, 550.0, 367.0, 11.0, 533.0, 2.0, 1139.0, 1466.0, 1042.0, 800.0, 906.0, 60.0, 91.0, 812.0, 412.0, 1000.0, 64.0, 778.0, 63.0, 1043.0, 899.0, 951.0]
s = pd.Series(price_per_m2)
s.to_excel('sqm.xlsx')
如果您想了解有關寫入excel 檔案的更多資訊,請閱讀pandas.Series_to_excel檔案pandas.Series。
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/352683.html
