文章目錄
- 前言
- 一、繪制房源分布地圖
- 1.匯入基本模塊
- 2.資料清洗
- 3.繪制房源分布地圖
- 二、單因素方差分析
- 1.Entire home/apt 下地區對房租價格的影響
- 2.Private room 下地區對房租價格的影響
- 3. Shared room 下地區對房租價格的影響
前言
共享,通過讓渡閑置資源的使用權,在有限增加邊際成本的前提下,提高了資源利用效率,隨著資訊的透明化,越來越多的共享發生在陌生人之間,短租,共享空間的一種模式,不論是否體驗過入住陌生人的家中,你都可以從短租的資料里挖掘有趣的資訊,本文主要根據愛彼迎平臺公布的資料繪制房源分布地圖以及進行在三種租房模式下地區因子水平對租房價格的影響,
資料集鏈接:天池大賽資料集
一、繪制房源分布地圖
1.匯入基本模塊
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyecharts.charts import Geo
from pyecharts import options as opts
from pyecharts.globals import GeoType
from pyecharts.charts import Map
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import warnings
warnings.filterwarnings('ignore')#忽略生成圖片時的報錯
plt.rcParams['font.sans-serif']='SimHei'
plt.rcParams['axes.unicode_minus']=False#是圖片顯示中文
%matplotlib inline
2.資料清洗
data = pd.read_csv(r'D:\天池大賽\短租資料集\listings.csv')
data.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
| 0 | 44054 | Modern and Comfortable Living in CBD | 192875 | East Apartments | NaN | 朝陽區 / Chaoyang | 39.89503 | 116.45163 | Entire home/apt | 792 | 1 | 89 | 2019-03-04 | 0.85 | 9 | 341 |
| 1 | 100213 | The Great Wall Box Deluxe Suite A團園長城小院東院套房 | 527062 | Joe | NaN | 密云縣 / Miyun | 40.68434 | 117.17231 | Private room | 1201 | 1 | 2 | 2017-10-08 | 0.10 | 4 | 0 |
| 2 | 128496 | Heart of Beijing: House with View 2 | 467520 | Cindy | NaN | 東城區 | 39.93213 | 116.42200 | Entire home/apt | 389 | 3 | 259 | 2019-02-05 | 2.70 | 1 | 93 |
| 3 | 161902 | cozy studio in center of Beijing | 707535 | Robert | NaN | 東城區 | 39.93357 | 116.43577 | Entire home/apt | 376 | 1 | 26 | 2016-12-03 | 0.28 | 5 | 290 |
| 4 | 162144 | nice studio near subway, sleep 4 | 707535 | Robert | NaN | 朝陽區 / Chaoyang | 39.93668 | 116.43798 | Entire home/apt | 537 | 1 | 37 | 2018-08-01 | 0.40 | 5 | 352 |
觀察資料,可以看到共有16列,分別代表著房源id,房源名稱、房主id、房主名稱、北京的行政區劃分、經度、維度、房源型別、價格、最小租住天數、最后評論日期等, 下面我們進一步粗略瀏覽整體資料情況
data.info()#使用info方法先整體查看資料集
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28452 entries, 0 to 28451
Data columns (total 16 columns):
id 28452 non-null int64
name 28451 non-null object
host_id 28452 non-null int64
host_name 28452 non-null object
neighbourhood_group 0 non-null float64
neighbourhood 28452 non-null object
latitude 28452 non-null float64
longitude 28452 non-null float64
room_type 28452 non-null object
price 28452 non-null int64
minimum_nights 28452 non-null int64
number_of_reviews 28452 non-null int64
last_review 17294 non-null object
reviews_per_month 17294 non-null float64
calculated_host_listings_count 28452 non-null int64
availability_365 28452 non-null int64
dtypes: float64(4), int64(7), object(5)
memory usage: 3.5+ MB
通過總覽資料我們可以看到,name列有一個空值,neighbourhood_group列全是空值,last_review、reviews_per_month 兩列有一部分是空值,
data.describe().T#使用轉置方法使結果更可視化
| count | mean | std | min | 25% | 50% | 75% | max | |
| id | 28452.0 | 2.628583e+07 | 6.403312e+06 | 44054.00000 | 2.245616e+07 | 2.787765e+07 | 3.134482e+07 | 3.395441e+07 |
| host_id | 28452.0 | 1.442821e+08 | 7.057051e+07 | 192875.00000 | 8.708958e+07 | 1.525464e+08 | 2.061464e+08 | 2.563498e+08 |
| neighbourhood_group | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| latitude | 28452.0 | 3.998323e+01 | 1.869841e-01 | 39.45581 | 3.989733e+01 | 3.993090e+01 | 3.999047e+01 | 4.094966e+01 |
| longitude | 28452.0 | 1.164420e+02 | 2.047957e-01 | 115.47339 | 1.163553e+02 | 1.164347e+02 | 1.164911e+02 | 1.174953e+02 |
| price | 28452.0 | 6.112033e+02 | 1.623535e+03 | 0.00000 | 2.350000e+02 | 3.890000e+02 | 5.770000e+02 | 6.898300e+04 |
| minimum_nights | 28452.0 | 2.729685e+00 | 1.792093e+01 | 1.00000 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.125000e+03 |
| number_of_reviews | 28452.0 | 7.103156e+00 | 1.681507e+01 | 0.00000 | 0.000000e+00 | 1.000000e+00 | 6.000000e+00 | 3.220000e+02 |
| reviews_per_month | 17294.0 | 1.319757e+00 | 1.581243e+00 | 0.01000 | 2.900000e-01 | 8.000000e-01 | 1.750000e+00 | 2.000000e+01 |
| calculated_host_listings_count | 28452.0 | 1.281829e+01 | 2.926132e+01 | 1.00000 | 2.000000e+00 | 5.000000e+00 | 1.100000e+01 | 2.220000e+02 |
| availability_365 | 28452.0 | 2.203421e+02 | 1.384307e+02 | 0.00000 | 8.700000e+01 | 2.090000e+02 | 3.610000e+02 | 3.650000e+02 |
洗掉neighbourhood_group一列
data = data.drop('neighbourhood_group',axis = 1)
洗掉name列的空值行
data = data.dropna(axis = 0,subset = ['name'])
規范neighbourhood列,使其只含有中文名
def neighbourhood_str(data):
neighbourhoods=[]
list=data["neighbourhood"].str.findall("\w+").tolist()
for i in list:
neighbourhoods.append(i[0])
return neighbourhoods
data["neighbourhood"]=neighbourhood_str(data)
data.head()
| id | name | host_id | host_name | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
| 0 | 44054 | Modern and Comfortable Living in CBD | 192875 | East Apartments | 朝陽區 | 39.89503 | 116.45163 | Entire home/apt | 792 | 1 | 89 | 2019-03-04 | 0.85 | 9 | 341 |
| 1 | 100213 | The Great Wall Box Deluxe Suite A團園長城小院東院套房 | 527062 | Joe | 密云縣 | 40.68434 | 117.17231 | Private room | 1201 | 1 | 2 | 2017-10-08 | 0.10 | 4 | 0 |
| 2 | 128496 | Heart of Beijing: House with View 2 | 467520 | Cindy | 東城區 | 39.93213 | 116.42200 | Entire home/apt | 389 | 3 | 259 | 2019-02-05 | 2.70 | 1 | 93 |
| 3 | 161902 | cozy studio in center of Beijing | 707535 | Robert | 東城區 | 39.93357 | 116.43577 | Entire home/apt | 376 | 1 | 26 | 2016-12-03 | 0.28 | 5 | 290 |
| 4 | 162144 | nice studio near subway, sleep 4 | 707535 | Robert | 朝陽區 | 39.93668 | 116.43798 | Entire home/apt | 537 | 1 | 37 | 2018-08-01 | 0.40 | 5 | 352 |
3.繪制房源分布地圖
我們現在比較感興趣的是這這兩萬多個房源在北京16個行政區的分布情況,
data.neighbourhood.value_counts()
朝陽區 10810
東城區 3346
海淀區 3197
豐臺區 1758
西城區 1701
通州區 1290
昌平區 1034
密云縣 935
順義區 920
懷柔區 833
大興區 823
延慶縣 718
房山區 578
石景山區 213
門頭溝區 152
平谷區 143
Name: neighbourhood, dtype: int64
data.neighbourhood.hist(bins = 30,figsize = (20,8))

def test_geo():
city = '北京'
g = Geo()
g.add_schema(maptype=city,itemstyle_opts=opts.ItemStyleOpts(color="#D9D9D9", border_color="#111"))
# 定義坐標對應的名稱,添加到坐標庫中 add_coordinate(name, lng, lat)
list1 = data['id'].tolist()
list2 = data.longitude.tolist()
list3 = data.latitude.tolist()
for x,y,z in zip(list1,list2,list3):
g.add_coordinate(str(x),y,z)
#將坐標點名稱及坐標點值添加到圖表中
b = []
for i in zip(data['id'].map(str),data['id'].value_counts()):
b.append(i)
g.add('', b, type_='scatter', symbol_size=3,color = '#68228B')
# 設定樣式成不顯示圖例
g.set_series_opts(label_opts=opts.LabelOpts(is_show=False))
#設定標題
g.set_global_opts(
title_opts=opts.TitleOpts(title="{}-房源分布".format(city))
)
return g
g = test_geo()
g.render_notebook()
效果如下圖,房源分布地圖繪制完畢!

二、單因素方差分析
接下來進行方差分析,本來想進行短租房屋型別因子下對于房屋價格的影響分析,但后來查資料了解到Entire home/apt 代表的是全職房,Private room 代表的是獨立房間,shared room 代表的是合住房間,那么他們對于價格的影響必然有顯著性差異,所以我們做三種型別下的地區對房價的影響,尤其像 了解共享房間這類新型合租模式地區會對其產生顯著性影響嗎?
1.Entire home/apt 下地區對房租價格的影響
e_data = data[data['room_type'] == 'Entire home/apt']
e_data.head()
| id | name | host_id | host_name | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
| 0 | 44054 | Modern and Comfortable Living in CBD | 192875 | East Apartments | 朝陽區 | 39.89503 | 116.45163 | Entire home/apt | 792 | 1 | 89 | 2019-03-04 | 0.85 | 9 | 341 |
| 2 | 128496 | Heart of Beijing: House with View 2 | 467520 | Cindy | 東城區 | 39.93213 | 116.42200 | Entire home/apt | 389 | 3 | 259 | 2019-02-05 | 2.70 | 1 | 93 |
| 3 | 161902 | cozy studio in center of Beijing | 707535 | Robert | 東城區 | 39.93357 | 116.43577 | Entire home/apt | 376 | 1 | 26 | 2016-12-03 | 0.28 | 5 | 290 |
| 4 | 162144 | nice studio near subway, sleep 4 | 707535 | Robert | 朝陽區 | 39.93668 | 116.43798 | Entire home/apt | 537 | 1 | 37 | 2018-08-01 | 0.40 | 5 | 352 |
| 5 | 279078 | Nice Apartment in Beijing | 1455726 | Fiona | 東城區 | 39.93958 | 116.43485 | Entire home/apt | 403 | 1 | 29 | 2018-11-02 | 0.33 | 7 | 353 |
e_data.price.describe()#瀏覽Entire home/apt下的價格
count 16955.000000
mean 746.479151
std 1705.645806
min 0.000000
25% 356.000000
50% 470.000000
75% 658.000000
max 68983.000000
Name: price, dtype: float64
觀察到價格最低有0元/晚,最高有68983元/晚顯然不合理,需要洗掉
e_data = e_data[e_data['price']>0]#洗掉價格為0的房源
import seaborn as sns
sns.boxplot(e_data.price,whis=2,orient='h')#選取2倍四分位距仍然有很多例外值,需要洗掉

def box_plot_outliers(data,data_ser, box_scale):
iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
val_low = data_ser.quantile(0.25) - iqr
val_up = data_ser.quantile(0.75) + iqr
a = data[(data_ser> val_low) & (data_ser<val_up)] #洗掉例外值
b = a[['price','neighbourhood']]#由于方差分析只需要各地區因子水平,以及價格,所以洗掉其他列
return b
e_data = box_plot_outliers(e_data,e_data.price,2)
e_data
| price | neighbourhood | |
| 0 | 792 | 朝陽區 |
| 2 | 389 | 東城區 |
| 3 | 376 | 東城區 |
| 4 | 537 | 朝陽區 |
| 5 | 403 | 東城區 |
| ... | ... | ... |
| 28444 | 832 | 延慶縣 |
| 28446 | 228 | 房山區 |
| 28447 | 396 | 朝陽區 |
| 28449 | 329 | 朝陽區 |
| 28451 | 295 | 豐臺區 |
sns.boxplot(e_data.price,whis = 2)

e_data.price.describe()
count 15299.000000
mean 481.556115
std 201.162288
min 54.000000
25% 342.000000
50% 443.000000
75% 584.000000
max 1248.000000
Name: price, dtype: float64
neighbourhood_to_list = {'朝陽區':1,
'海淀區':2,
'東城區':3,
'豐臺區':4,
'西城區':5,
'通州區':6,
'昌平區':7,
'大興區':8,
'順義區':9,
'石景山區':10,
'房山區':11,
'密云縣':12,
'門頭溝區':13,
'平谷區':14,
'懷柔區':15,
'延慶縣':16}
e_data['neighbourhood'] = e_data['neighbourhood'].map(neighbourhood_to_list)
e_data
| price | neighbourhood | |
| 0 | 792 | 1 |
| 2 | 389 | 3 |
| 3 | 376 | 3 |
| 4 | 537 | 1 |
| 5 | 403 | 3 |
| ... | ... | ... |
| 28444 | 832 | 16 |
| 28446 | 228 | 11 |
| 28447 | 396 | 1 |
| 28449 | 329 | 1 |
| 28451 | 295 | 4 |
model = ols('price ~ neighbourhood',e_data).fit()
anovat = anova_lm(model)
print(anovat)
df sum_sq mean_sq F PR(>F)
neighbourhood 1.0 2.967648e+05 296764.791065 7.336672 0.006764
Residual 15297.0 6.187562e+08 40449.511263 NaN NaN
可以看到對于全租房來說,房價與地區有強烈的顯著相關性,
2.Private room 下地區對房租價格的影響
p_data = data[data['room_type'] == 'Private room']
p_data.price.describe()
count 9838.000000
mean 430.681236
std 1203.643527
min 0.000000
25% 181.000000
50% 248.000000
75% 389.000000
max 66667.000000
Name: price, dtype: float64
p_data = p_data[p_data['price']>0]
p_data = box_plot_outliers(p_data,p_data.price,2)
sns.boxplot(p_data.price,whis = 2)

p_data['neighbourhood'] = p_data['neighbourhood'].map(neighbourhood_to_list)
model = ols('price ~ neighbourhood',p_data).fit()
anovat = anova_lm(model)
print(anovat)
df sum_sq mean_sq F PR(>F)
neighbourhood 1.0 1.294708e+07 1.294708e+07 593.986554 4.194081e-127
Residual 9002.0 1.962160e+08 2.179693e+04 NaN NaN
3. Shared room 下地區對房租價格的影響
s_data = data[data['room_type'] == 'Shared room']
s_data.price.describe()
count 1658.000000
mean 293.343185
std 2521.130124
min 27.000000
25% 87.000000
50% 107.000000
75% 148.000000
max 67909.000000
Name: price, dtype: float64
s_data = box_plot_outliers(s_data,s_data.price,2)
sns.boxplot(s_data.price,whis = 2)

p_data['neighbourhood'] = p_data['neighbourhood'].map(neighbourhood_to_list)
model = ols('price ~ neighbourhood',s_data).fit()
anovat = anova_lm(model)
print(anovat)
df sum_sq mean_sq F PR(>F)
neighbourhood 13.0 7.287987e+04 5606.144166 2.838746 0.00048
Residual 1503.0 2.968225e+06 1974.866697 NaN NaN
單因子方差分析完畢
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/163561.html
標籤:其他
