某壳网房屋租赁信息抓取与浅析

Catalogue

摘要
收集整合房屋租赁信息，并加以简单的分析及可视化～

项目描述

项目背景：最近在为租房的事情所烦恼，在网站上寻找各种房源信息，看的眼花缭乱。萌生了将房源基本信息汇总起来便于查看的想法，便用python将基本信息爬下来放到excel，并整理成可分析的数据。

数据来源：利用python获取网页代码，并用lxml中的xpath提取信息，最后将数据保存到execl文件

字段意义：

community: 小区名称

address：租房地址

landlord：房屋来源

postime：发布时间

introduction：租房说明

price：房屋价格

area：房屋面积

orientation：房屋朝向

pattern：房屋户型

floor：房屋楼层

date_time：查看日期

项目目的：整合某壳网房源信息，方便查看及分析

环境解释：

基于jupyter notebook

爬虫所用库：requests、lxml、xlsxwriter

数据处理所用库：pandas、numpy

可视化：execl

数据获取

利用requests库获取网页信息

lxml解析网页信息

利用xlsxwriter模块将数据保存至execl

# 导入模块

import requests
import time
from lxml import etree
import xlsxwriter

def get_html(page):
    """获取网站html代码"""
    url = "https://dg.zu.ke.com/zufang/pg{}/#contentList".format(page)
    headers = {
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    }
    response = requests.get(url, headers=headers).text
    return response

def parse_html(htmlcode, data):
    """解析html代码"""
    content = etree.HTML(htmlcode)
    results = content.xpath('///div[@class="content__article"]/div[1]/div')
    for result in results[:]:
        community = result.xpath('./div[1]/p[@class="content__list--item--title twoline"]/a/text()')[0].replace('\n','').strip().split()[0]
        address = "-".join(result.xpath('./div/p[@class="content__list--item--des"]/a/text()'))
        landlord = result.xpath('./div/p[@class="content__list--item--brand oneline"]/text()')[0].replace('\n','').strip()
        #if len(result.xpath('./div/p[@class="content__list--item--brand oneline"]/text()')) > 0 else ""
        postime = result.xpath('./div/p[@class="content__list--item--time oneline"]/text()')[0]
        introduction = ",".join(result.xpath('./div/p[@class="content__list--item--bottom oneline"]/i/text()'))
        price = result.xpath('./div/span/em/text()')[0]
        description = "".join(result.xpath('./div/p[2]/text()')).replace('\n', '').replace('-', '').strip().split()
        area = description[0]
        count = len(description)
        if count == 6:
            orientation = description[1] + description[2] + description[3] + description[4]
        elif count == 5:
            orientation = description[1] + description[2] + description[3]
        elif count == 4:
            orientation = description[1] + description[2]
        elif count == 3:
            orientation = description[1]
        else:
            orientation = ""
        pattern = description[-1]
        floor = "".join(result.xpath('./div/p[2]/span/text()')[1].replace('\n', '').strip().split()).strip() if len(
            result.xpath('./div/p[2]/span/text()')) > 1 else ""
        date_time = time.strftime("%Y-%m-%d", time.localtime())
        """数据存入字典"""
        data_dict = {
            "community": community,
            "address": address,
            "landlord": landlord,
            "postime": postime,
            "introduction": introduction,
            "price": '￥' + price,
            "area": area,
            "orientation": orientation,
            "pattern": pattern,
            "floor": floor,
            "date_time": date_time
        }

        data.append(data_dict)

def excel_storage(response):
    """将字典数据写入excel"""
    workbook = xlsxwriter.Workbook('./beikeHouse.xlsx')
    worksheet = workbook.add_worksheet()
    """设置标题加粗"""
    bold_format = workbook.add_format({'bold': True})
    worksheet.write('A1', '小区名称', bold_format)
    worksheet.write('B1', '租房地址', bold_format)
    worksheet.write('C1', '房屋来源', bold_format)
    worksheet.write('D1', '发布时间', bold_format)
    worksheet.write('E1', '租房说明', bold_format)
    worksheet.write('F1', '房屋价格', bold_format)
    worksheet.write('G1', '房屋面积', bold_format)
    worksheet.write('H1', '房屋朝向', bold_format)
    worksheet.write('I1', '房屋户型', bold_format)
    worksheet.write('J1', '房屋楼层', bold_format)
    worksheet.write('K1', '查看日期', bold_format)

    row = 1
    col = 0
    for item in response:
        worksheet.write_string(row, col, item['community'])
        worksheet.write_string(row, col + 1, item['address'])
        worksheet.write_string(row, col + 2, item['landlord'])
        worksheet.write_string(row, col + 3, item['postime'])
        worksheet.write_string(row, col + 4, item['introduction'])
        worksheet.write_string(row, col + 5, item['price'])
        worksheet.write_string(row, col + 6, item['area'])
        worksheet.write_string(row, col + 7, item['orientation'])
        worksheet.write_string(row, col + 8, item['pattern'])
        worksheet.write_string(row, col + 9, item['floor'])
        worksheet.write_string(row, col + 10, item['date_time'])
        row += 1
    workbook.close()

def main():
    all_datas = []
    """网页循环"""
    for page in range(1, 100):
        html = get_html(page)
        parse_html(html, all_datas)
    excel_storage(all_datas)


if __name__ == '__main__':
    main()

数据处理

获取到数据后，进行初步的整理和清洗。

import pandas as pd
import numpy as np
import os
os.chdir(r'/Users/nanb/Documents/数据存放')
import warnings
warnings.filterwarnings('ignore')

df = pd.read_excel('beikeHouse.xlsx')
df.drop(['房屋来源','租房说明','查看日期'],axis=1,inplace = True)

房屋朝向

# 房屋朝向

df.apply(lambda x:sum(x.isnull()))
df = df[~df['房屋朝向'].isnull()]
# df['房屋朝向'].value_counts()

# 类似'在租'值 -- 处理

df = df[~df['房屋朝向'].str.contains('在租')]
# df['房屋朝向'].value_counts()

# 错误值简单处理

df['房屋朝向'][df['房屋朝向'].isin(['东南南','东东南','东东南南'])] = '东南'
df['房屋朝向'][df['房屋朝向'].isin(['西西北','西北北'])] = '西北'
df['房屋朝向'][df['房屋朝向'].isin(['东东北'])] = '东北'
df['房屋朝向'][df['房屋朝向'].isin(['南西南'])] = '西南'
# df['房屋朝向'].value_counts()


# 数据筛选

fwcx_counts = df['房屋朝向'].value_counts()
to_remove = fwcx_counts[fwcx_counts <= 5].index
df.replace(to_remove,np.nan,inplace=True)
df = df[~df['房屋朝向'].isnull()]

租房区域

根据租房地址生成新字段：租房区域

1
2
3

# 租房区域

df['租房区域'] = df['租房地址'].apply(lambda x:x[0:3])

房屋价格

对房屋价格进行处理并生成新字段：价格类型

因为本次获取的数据仅为租房信息，依据数据观察，房屋价格大多处于1000～2500之间

故价格类型分成5个level：1000以下、1001～1500、1501～2000、2001~2500、2500以上

# 字段值处理

df['房屋价格'] = df['房屋价格'].apply(lambda x:x.replace('￥',''))

df['房屋价格'][df['房屋价格'].str.contains('-')] = df['房屋价格'][df['房屋价格'].str.contains('-')].apply(lambda x:x.split('-',1)[1])

df['房屋价格'] = df['房屋价格'].apply(lambda x:int(x))

# 根据价格分组

lst_bins = [0,1001,1501,2001,2501,1000000]
lst_labels = ['1000以下','1001~1500','1501~2000','2001~2500','2500以上']

df['价格类型'] = pd.cut(df['房屋价格'],bins = lst_bins,labels = lst_labels,include_lowest=True)

户型区分

为方便数据查看以及选择，根据房屋户型生成3个新字段：

户型(室)、户型(厅)、户型(卫)

# 户型(室)

df = df[~df['房屋户型'].str.contains('未知')]
df['户型(室)'] = df['房屋户型'].apply(lambda x:x.partition('室')[0])

# 户型(厅)

df['户型(厅)'] = df['房屋户型'].apply(lambda x:x.split('室')[1][0])

# 户型(卫)

df['户型(卫)'] = df['房屋户型'].apply(lambda x:x.split('厅')[1][0])

房屋楼层

根据数据观察，房屋楼层中地下室在网页中主图显示基本为车位，故将地下室更换成车位，便于区分

# 房屋楼层

df['房屋楼层'] = df['房屋楼层'].apply(lambda x:str(x)[0:3])
# df[df['房屋楼层'] == 'nan'] = '未知'

df['房屋楼层'][df['房屋楼层'] == '地下室'] = '车位'
#df['房屋楼层'].value_counts()

租赁类型

根据小区名称生成新字段：租赁类型

租赁类型：整租、合租

1 2	df['租赁类型'] = df['小区名称'].apply(lambda x:x[0:2]) df['小区名称'] = df['小区名称'].apply(lambda x:x.split('·')[1])

当月数据提取

以天为单位

将房源的上新时间为天的数据提取出来，生成当月上新数据。

处理逻辑：发布时间为类似 ‘1天前发布’ 这种数据整合为当月上新数据，并更改数据类型为int

# 日报

df1 = df[df['发布时间'].str.contains('天')]

# 数据截取生成新字段

df1['上新时间'] = df1['发布时间'].apply(lambda x:x[0:2])

# 字段值处理

df1['上新时间'] = df1['上新时间'].str.replace('今','0')

df1['上新时间'] = df1['上新时间'].str.replace('天','')

df1['上新时间'] = df1['上新时间'].apply(lambda x:int(x))

以周为单位

将当月上新数据，以周为单位做区分

说明：这里仅以一周为7天来做区间分类，并未加入实际日期进行换算，仅做简单处理

处理逻辑：上新时间为0～7天为第一周，8～14天为第二周，依此类推

# 周报

df2 = df1.copy()
lst_bins = [0,8,15,22,31]
lst_labels = [1,2,3,4]

df2['上新周数'] = pd.cut(df1['上新时间'],bins = lst_bins,labels = lst_labels,include_lowest=True)

单间分类

提取户型(室)为1的数据整合为单间数据，并依据房屋面积大小生成新字段：单间分类

单间分类区间为：

0～30：极小单间、31～40：普通单间、41～50：大单间、51及以上：超大单间

# 一居室面积

df3 = df[df['户型(室)'] == '1']
df3['房屋面积'] = df3['房屋面积'].apply(lambda x:x.replace('㎡',''))
df3['房屋面积'] = df3['房屋面积'].astype('int')
df3 = df3[df3['房屋面积'] <= 100]

# 区间分类

lst_bins = [0,31,41,51,1000000]
lst_labels = ['极小单间','普通单间','大单间','超大单间']

df3['单间分类'] = pd.cut(df3['房屋面积'],bins = lst_bins,labels = lst_labels,include_lowest=True)

数据保存

# 删除无用字段

df.drop(['发布时间','房屋户型'],axis=1,inplace = True)
df1.drop(['发布时间','房屋户型'],axis=1,inplace = True)
df2.drop(['发布时间','房屋户型'],axis=1,inplace = True)
df3.drop(['发布时间','房屋户型'],axis=1,inplace = True)

写入execl

with pd.ExcelWriter('finall_bkHouse.xlsx') as writer:
    df.to_excel(writer, sheet_name='源数据',index=False)
    df1.to_excel(writer, sheet_name='日报数据',index=False)
    df2.to_excel(writer, sheet_name='周报数据',index=False)
    df3.to_excel(writer, sheet_name='一居室面积数据',index=False)