House Price Prediction ML Model and Python and AWS Lambdas app

Francia Riesco
12 min readJan 20, 2022

We will predict the best market price for a property based on the location, property features such as size, bedroom, bathroom, amenities, and other attributes that can be relevant in predicting a property value. In the beginning, we will evaluate the dataset to create an effective predictive model for real estate prices. Then, we will create a web app to show similar results.

Github repository

Prediction Model

Create the python environment

pipenv install --python 3.8pipenv install Cython Flask gunicorn Jinja2 numpy pandas plotly python-dateutil requests scikit-learn matplotlib seaborn jupyter

Pipefile

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
matplotlib = "*"
cython = "*"
flask = "*"
gunicorn = "*"
jinja2 = "*"
numpy = "*"
pandas = "*"
plotly = "*"
python-dateutil = "*"
requests = "*"
scikit-learn = "*"
seaborn = "*"
jupyter = "*"
[dev-packages][requires]
python_version = "3.8"

Merge dataset

head -1 train/CON__OCT_2016.csv > train.csv
tail -n +2 train/*.csv >> train.csv
head -1 test/test_CON_Jan_2018.csv > test.csv
tail -n +2 test/*.csv >> test.csv

Load dataset

# Load data from CSV
train_dataset = pd.read_csv("static/data/train.csv", dtype={'zip':'category', 'mlsnum':'string', 'proptype':'string'})
test = pd.read_csv("static/data/test.csv", dtype={'zip':'category', 'mlsnum':'string','proptype':'string'})
train_dataset.head()

Column List

['otherfeatures', 'status', 'solddate', 'city', 'showinginstructions', 'garage', 'address', 'agentname', 'listdate', 'cooling', 'dom', 'soldprice', 'level', 'housenum1', 'mlsnum', 'baths', 'officename', 'style', 'expireddate', 'area', 'elementaryschool', 'zip', 'sqft', 'officephone', 'listprice', 'beds', 'highschool', 'heating', 'juniorhighschool', 'dto', 'age', 'remarks', 'photourl', 'row_id', 'proptype', 'state', 'streetname', 'housenum2', 'lotsize']

Dataset Description

train.describe()
test.describe()

Dataset dimensions

train_dataset.shape, test.shape
# ((500468, 39), (13346, 39))

Join Train and test

train_dataset.index = train_dataset['row_id']
test.index = test['row_id']
frames = [train_set, test]
df_combined = pd.concat(frames)
df_combined = df_combined.reset_index(drop=True)
df_combined.shape
# (513814, 39)

Create Binary Variables for Open Answers

df_combined['hasotherfeatures'] = np.where(df_combined['otherfeatures'].isnull(), 0, 1)
df_combined['hasproptype'] = np.where(df_combined['proptype'].isnull(), 0, 1)
df_combined['hasstreetname'] = np.where(df_combined['streetname'].isnull(), 0, 1)
df_combined['hashousenum1'] = np.where(df_combined['housenum1'].isnull(), 0, 1)
df_combined['hashousenum2'] = np.where(df_combined['housenum2'].isnull(), 0, 1)
df_combined['hasagentname'] = np.where(df_combined['agentname'].isnull(), 0, 1)
df_combined['hasofficename'] = np.where(df_combined['officename'].isnull(), 0, 1)
df_combined['hasofficephone'] = np.where(df_combined['officephone'].isnull(), 0, 1)
df_combined['hasshowinginstructions'] = np.where(df_combined['showinginstructions'].isnull(), 0, 1)
df_combined['hasstyle'] = np.where(df_combined['style'].isnull() , 0, 1)
df_combined['haslevel'] = np.where(df_combined['level'].isnull(), 0, 1)
df_combined['hasgarage'] = np.where(df_combined['garage'].isnull(), 0, 1)
df_combined['hasheating'] = np.where(df_combined['heating'].isnull(), 0, 1)
df_combined['hascooling'] = np.where(df_combined['cooling'].isnull(), 0, 1)
df_combined['haselementaryschool'] = np.where(df_combined['elementaryschool'].isnull(), 0, 1)
df_combined['hasjuniorhighschool'] = np.where(df_combined['juniorhighschool'].isnull(), 0, 1)
df_combined['hashighschool'] = np.where(df_combined['highschool'].isnull(), 0, 1)
df_combined['haslistprice'] = np.where(df_combined['listprice'].isnull(), 0, 1)
df_combined['haslistdate'] = np.where(df_combined['listdate'].isnull(), 0, 1)
df_combined['hasaddress'] = np.where(df_combined['address'].isnull(), 0, 1)
df_combined['hascity'] = np.where(df_combined['city'].isnull(), 0, 1)
df_combined['hasstate'] = np.where(df_combined['state'].isnull(), 0, 1)
df_combined['hasarea'] = np.where(df_combined['area'].isnull(), 0, 1)

New List of columns

df_combined.columnsIndex(['mlsnum', 'status', 'listprice', 'soldprice', 'listdate', 'solddate',
'expireddate', 'dom', 'dto', 'address', 'city', 'state', 'zip', 'area',
'beds', 'baths', 'sqft', 'age', 'lotsize', 'agentname', 'officename',
'officephone', 'showinginstructions', 'remarks', 'style', 'level',
'garage', 'heating', 'cooling', 'elementaryschool', 'juniorhighschool',
'highschool', 'otherfeatures', 'proptype', 'streetname', 'housenum1',
'housenum2', 'photourl', 'row_id', 'hasotherfeatures', 'hasproptype',
'hasstreetname', 'hashousenum1', 'hashousenum2', 'hasagentname',
'hasofficename', 'hasofficephone', 'hasshowinginstructions', 'hasstyle',
'haslevel', 'hasgarage', 'hasheating', 'hascooling',
'haselementaryschool', 'hasjuniorhighschool', 'hashighschool',
'haslistprice', 'haslistdate', 'hasaddress', 'hascity', 'hasstate',
'hasarea'],
dtype='object')

Convert Columns to respective numeric values and fill null with zeroes

df_combined['listprice'] =  df_combined['listprice'].replace(',','', regex=True).astype(float)
df_combined['listprice'] = df_combined['listprice'].astype('float32')
df_combined['baths'] = df_combined['baths'].astype('float32')
df_combined['beds'] = df_combined['beds'].astype('int')
df_combined['city'] = df_combined['city'].astype('string')
df_combined['state'] = df_combined['state'].astype('string') # category
df_combined['photourl'] = df_combined['photourl'].astype('string')
df_combined['zip'] = df_combined['zip'].astype('string')
df_combined.city.fillna("00000", inplace=True)
df_combined.zip.fillna("00000", inplace=True)
df_combined.state.fillna("NA", inplace=True)
df_combined.photourl.fillna("NA", inplace=True)
# keep these variables as original string for csv export
df_combined['proptype_str'] = df_combined['proptype'].astype('string')
df_combined['zip_str'] = df_combined['zip'].astype('string')
df_combined['listprice'] = pd.to_numeric(df_combined['listprice'], errors='ignore')
df_combined['beds'] = pd.to_numeric(df_combined['beds'], errors='ignore')
df_combined['baths'] = pd.to_numeric(df_combined['baths'], errors='ignore')
df_combined['sqft'] = pd.to_numeric(df_combined['sqft'], errors='ignore')
df_combined['zip'] = pd.to_numeric(df_combined['zip'], errors='ignore')
df_combined.listprice.fillna(0, inplace=True)
df_combined.zip.fillna(0, inplace=True)
df_combined.sqft.fillna(0, inplace=True)
df_combined.beds.fillna(0, inplace=True)
df_combined.baths.fillna(0, inplace=True)
df_combined.loc[df_combined.proptype == 'CC', 'proptype'] = '1'
df_combined.loc[df_combined.proptype == 'MF', 'proptype'] = '2'
df_combined.loc[df_combined.proptype == 'SF', 'proptype'] = '3'
df_combined.loc[df_combined.proptype == 'EXT', 'proptype'] = '4'
df_combined[df_combined.eq('00nan').any(1)]
df_combined.loc[df_combined['zip'] == '00nan', ['zip']] = 0
df_combined.loc[df_combined['zip'] == 'Centerville', ['zip']] = 0
df_combined['proptype'] = df_combined['proptype'].astype('int')

Create a new dataset csv

  • This will be used to create the database
df_combined.to_csv('df_combined.csv', sep=',', encoding='utf-8')

Drop to all not numeric columns

new_drop_columns = [ 'status','listdate', 'address','area',
'agentname','officename', 'officephone',
'showinginstructions', 'remarks', 'style',
'level', 'garage', 'heating', 'cooling',
'elementaryschool', 'juniorhighschool',
'highschool', 'otherfeatures', 'streetname',
'housenum1', 'housenum2', 'mlsnum', 'soldprice', 'solddate',
'dom', 'dto', 'expireddate', 'age', 'lotsize',
'mlsnum', 'row_id', 'proptype_str', 'zip_str', 'city','state', 'photourl']
df_combined.drop(new_drop_columns, axis=1, inplace=True)

Only Numerical columns

Index(['listprice', 'zip', 'beds', 'baths', 'sqft', 'proptype',
'hasotherfeatures', 'hasproptype', 'hasstreetname', 'hashousenum1',
'hashousenum2', 'hasagentname', 'hasofficename', 'hasofficephone',
'hasshowinginstructions', 'hasstyle', 'haslevel', 'hasgarage',
'hasheating', 'hascooling', 'haselementaryschool',
'hasjuniorhighschool', 'hashighschool', 'haslistprice', 'haslistdate',
'hasaddress', 'hascity', 'hasstate', 'hasarea'],
dtype='object')

Visualization

List Price

The column ListPrice is our target variable or dependent variable and all the other columns are our independent variable.

  • Minimum price: $1,000.00
  • Maximum price: $90,000,000.00
  • Mean price: $631,906.00
  • Median price $415,900.00
  • Standard deviation of prices: $1,073,829.25
  • First quartile of prices: $269,900.00
  • Second quartile of prices: $654,900.00
  • Interquartile (IQR) of prices: $385,000.00

Beds

The column beds is one of the independents' variables

  • Minimum beds: 0.00
  • Maximum beds: 20.00
  • Mean beds: 3.45
  • Median beds 3.00
  • The standard deviation of beds: 1.69
  • First quartile of beds: 3.00
  • Second quartile of beds: 4.00
  • Interquartile (IQR) of beds: 1.00

Baths

The column baths is one of the independents variables

  • Minimum baths: 0.00
  • Maximum baths: 15.00
  • Mean baths: 2.15
  • Median baths 2.00
  • Standard deviation of baths: 1.24
  • First quartile of baths: 1.00
  • Second quartile of baths: 2.50
  • Interquartile (IQR) of baths: 1.50

Plots

Beds vs List Price top 100

Baths vs List Price top 100

Beds vs List Price

Baths vs List Price

SQFT vs List Price

Beds vs Proptype vs Listprice

Baths vs Proptype vs Listprice

Boxplot

Boxplot List Price

Boxplot Beds

Boxplot Baths

Boxplot SQFT

Histogram

Beds Histogram

Baths Histogram

Model Analysis

We try Linear Regression and Random Forest as base model to analyze our dataset

Linear Regression

Due to the regressive predictive nature of The Price Prediction Project, we decided to approach our model methodology with a Linear Regression base model. R-squared is a measure of how well a linear regression model “fits” a dataset. In our output of the regression results, we see that R2 = 0.25. This indicates that 25% of the variance in the number of Price Prediction can be explained by the population size. In many cases, 26% or above values indicate high effect size. In this respect, our models are low and medium effect sizes.

Random forest

Run RandomForestRegressor with best params from the linear regression. In our output of the Random forest results, we see that R2 = -0.23 which means our model is having a very low R2 in general. This is not too surprising to see from a random forest in particular which loves to fit the training set extremely well due to how exhaustive the algorithm.

Model Tested

array([557572.21951763])

ProgresSQL Database

Properties table was created from the CSV file

df_combined.to_csv('df_combined.csv', sep=',', encoding='utf-8')

table structure

create table properties
(
row_id text,
soldprice text,
listdate text,
listprice numeric,
expireddate text,
dom numeric,
dto integer,
address text,
city text,
state text,
zip integer,
solddate text,
beds integer,
baths numeric,
sqft integer,
age integer,
lotsize integer,
agentname text,
officename text,
officephone text,
showinginstructions text,
area integer,
style text,
level integer,
garage integer,
heating text,
cooling text,
elementaryschool text,
juniorhighschool text,
highschool text,
remarks text,
proptype integer,
streetname text,
housenum1 text,
housenum2 numeric,
photourl text,
otherfeatures text,
hasotherfeatures integer,
hasproptype integer,
hasstreetname integer,
hashousenum1 integer,
hashousenum2 integer,
hasagentname integer,
hasofficename integer,
hasofficephone integer,
hasshowinginstructions integer,
hasstyle integer,
haslevel integer,
hasgarage integer,
hasheating integer,
hascooling integer,
haselementaryschool integer,
hasjuniorhighschool integer,
hashighschool integer,
haslistprice integer,
haslistdate integer,
hasaddress integer,
hascity integer,
hasstate integer,
hasarea integer,
proptype_str text,
zip_str text
);

Table Optimization

ALTER TABLE properties ADD PRIMARY KEY (row_id);CREATE INDEX search_price ON properties (listprice, zip, photourl);
CREATE INDEX search_price1 ON properties ( zip, photourl);
CREATE INDEX search_price2 ON properties (listprice, zip);
CREATE INDEX search_price3 ON properties (listprice, photourl);

Users table

create table users
(
uid serial
constraint users_pkey
primary key,
username text not null,
password text not null
);

Properties Table Query

Query Sample

zipcode = 2053
beds = 3
baths = 2.0
sqft = 200
proptype = 1
hasotherfeatures = 1
hasproptype = 1
hasstreetname = 1
hashousenum1 = 0
hashousenum2 = 1
hasagentname = 1
hasofficename = 1
hasofficephone = 1
hasshowinginstructions = 1
hasstyle = 1
haslevel = 1
hasgarage = 1
hasheating = 1
hascooling = 0
haselementaryschool = 0
hasjuniorhighschool = 0
hashighschool = 0
haslistprice = 1
haslistdate = 0
hasaddress = 1
hascity = 1
hasstate = 1
hasarea = 1
query = (
"select * from properties where 1=1 and "
"zip = {zipcode} and beds ={beds} and baths = {baths} and sqft = {sqft} and proptype = {proptype} and hasotherfeatures = {hasotherfeatures} and "
"hasproptype = {hasproptype} and hasstreetname = {hasstreetname} and hashousenum1 = {hashousenum1} and hashousenum2 = {hashousenum2} and "
"hasagentname = hasagentname and hasofficename = hasofficename and hasofficephone = hasofficephone and hasshowinginstructions = hasshowinginstructions"
" and hasstyle = {hasstyle} and haslevel = {haslevel} and hasgarage = {hasgarage} and hasheating = {hasheating} and hascooling = {hascooling} and "
"haselementaryschool = {haselementaryschool} and hasjuniorhighschool = {hasjuniorhighschool} and "
"hashighschool = {hashighschool} and haslistprice = {haslistprice} and haslistdate = {haslistdate} and hasaddress = {hasaddress} and "
"hascity = {hascity} and hasstate = {hasstate} and hasarea = {hasarea}".format(
zipcode=zipcode,
beds=beds,
baths=baths,
sqft=sqft,
proptype=proptype,
hasotherfeatures=hasotherfeatures,
hasproptype=hasproptype,
hasstreetname=hasstreetname,
hashousenum1=hashousenum1,
hashousenum2=hashousenum2,
hasagentname=hasagentname,
hasofficename=hasofficename,
hasofficephone=hasofficephone,
hasshowinginstructions=hasshowinginstructions,
hasstyle=hasstyle,
haslevel=haslevel,
hasgarage=hasgarage,
hasheating=hasheating,
hascooling=hascooling,
haselementaryschool=haselementaryschool,
hasjuniorhighschool=hasjuniorhighschool,
hashighschool=hashighschool,
haslistprice=haslistprice,
haslistdate=haslistdate,
hasaddress=hasaddress,
hascity=hascity,
hasstate=hasstate,
hasarea=hasarea,
)
)

Cloud Architecture

Network Topology

The Price prediction app is a two tier architecture where we use a database tier in a private subnetwork and an aws serverless lambda in a public subnetwork that can connect between the private subnet and internet through gateways.

AWS Network Configuration

The follow are the configuration of our application architecture.

Internet Gateway

Virtual private cloud (VPC)

Subnetworks

Relational Database Service

Amazon Simple Storage Service

Images in S3 Services

Serverless Platform

AWS Lambda is an event-driven, serverless computing platform

Github Actions

deploy to Lambda

name: AWS Deploy
defaults:
run:
working-directory: "./predictions-app"
on: [push]
env:
AWS_PROFILE: ${{ secrets.AWS_PROFILE }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }}
SYSTEM_VERSION_COMPAT: ${{ secrets.SYSTEM_VERSION_COMPAT }}
BUCKET_URI: ${{ secrets.BUCKET_URI}}
jobs:
test:
name: test develop
runs-on: ubuntu-latest
defaults:
run:
working-directory: "./predictions-app"
env:
working-directory: "./predictions-app"
if: github.ref == 'refs/heads/develop'
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install apt-get update
run: |
sudo apt-get update
- name: Install python app
run: |
python -m pip install --upgrade pipenv
pipenv install --deploy
working-directory: ${{env.working-directory}}
submit:
name: Submit to AWS
runs-on: ubuntu-latest
defaults:
run:
working-directory: "./predictions-app"
env:
working-directory: "./predictions-app"
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Set up Node.js
uses: actions/setup-node@v1
with:
node-version: "12"
- run: npm install
working-directory: ${{env.working-directory}}
- name: Install apt-get update
run: |
sudo apt-get update
- name: Install serverless
run: |
npm install serverless -g
working-directory: ${{env.working-directory}}
- name: Install awscli
run: |
sudo apt-get install awscli
- name: Install python app
run: |
python -m pip install --upgrade pipenv
pipenv install --deploy
working-directory: ${{env.working-directory}}
- name: Set up Docker
uses: docker-practice/actions-setup-docker@master
- name: Deploy Python App to AWS Lambda
run: |
aws configure set aws_access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
aws configure set aws_secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws configure set default.region ${{ secrets.AWS_REGION }}
serverless deploy
working-directory: ${{env.working-directory}}

Serverless

Web App Python is running on a serverless service that runs code in response to events and automatically manages the computing resources required by the python app.

Handle session stateless server

We have to handle the stateless of the lambda by adding a cookie to the serverless

  • Login
session["username"] = username
response = make_response(redirect(url_for("index")))
response.set_cookie('username', session["username"])
print('cookie login pass:', request.cookies.get('username'))
  • Logout
response = make_response(redirect(url_for("index")))
response.set_cookie('username', '', expires=0)
session.clear()
  • Navigation
print('cookie:', request.cookies.get('username'))
username_value = session['username'] if 'username' in session else request.cookies.get('username')
if username_value is not None:

YAML configuration

service: serverless-flask-pp-app
frameworkVersion: "2 || 3"
plugins:
- serverless-python-requirements
- serverless-apigwy-binary
- serverless-prune-plugin
- serverless-wsgi
custom:
apigwBinary:
types:
- "image/jpeg"
- "image/png"
prune:
automatic: true
number: 3
pythonRequirements:
slim: true
strip: false
zip: true
scriptable:
hooks:
after:package:createDeploymentArtifacts:
- ./shake.sh
tableName: "properties"
wsgi:
app: routes.app
packRequirements: false
provider:
name: aws
runtime: python3.8
stage: dev
region: us-east-1
lambdaHashingVersion: 20201221
environment:
PROPERTIES_TABLE: "properties"
USERS_TABLE: "users"
POSTGRES_HOST:
POSTGES_DATABASE_URI: ""
POSTGRES_DB: ""
POSTGRES_USER: "postgres"
POSTGRES_PASSW: ""
POSTGRES_PORT: "5432"
exclude:
- .git/**
- .vscode/**
- node_modules/**
- venv/**
iamRoleStatements:
- Effect: "Allow"
Action:
- ec2:CreateNetworkInterface
- ec2:DescribeNetworkInterfaces
- ec2:DeleteNetworkInterface
Resource: "*"
functions:
app:
handler: wsgi_handler.handler
events:
- http: ANY /
contentHandling: CONVERT_TO_BINARY
- http: POST /predict
- http: POST /logout
- http: ANY {proxy+}
contentHandling: CONVERT_TO_BINARY

Web Application

Registeration

The Price Prediction app allows user to register themselves as authorized users that let them manage their saved listings.

Register

The “Register” link at the top takes the guest user to register as a Price Predictor user. This simple form only requires user to enter a username and a password.

Login

Once a user has registered him/herself, he/she can login and save/review/remove the saved homes.

Search

This is the main page where user can search for homes using the options provided. ZipCode is required and rest of the fields are optional but preset with default values.

Search Results

Homes matching the search criteria are displayed once the user hits the search button. The listings are shown with the address and brief highlights.

Property Details

Details about the individual property can be viewed once the user clicks on the address from the search results page. This page displays the pricing and detailed information about the property.

Favorites

Registered users can mark a property as a favorite for future viewing.

--

--

Francia Riesco

Software engineer. Cosmology, and Computational Astrophysics, Data Science, Computer Engineering. MLA Harvard, PhD. (c) CSU.