House Price Prediction ML Model and Python and AWS Lambdas app
We will predict the best market price for a property based on the location, property features such as size, bedroom, bathroom, amenities, and other attributes that can be relevant in predicting a property value. In the beginning, we will evaluate the dataset to create an effective predictive model for real estate prices. Then, we will create a web app to show similar results.
Github repository
Prediction Model
Create the python environment
pipenv install --python 3.8pipenv install Cython Flask gunicorn Jinja2 numpy pandas plotly python-dateutil requests scikit-learn matplotlib seaborn jupyter
Pipefile
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"[packages]
matplotlib = "*"
cython = "*"
flask = "*"
gunicorn = "*"
jinja2 = "*"
numpy = "*"
pandas = "*"
plotly = "*"
python-dateutil = "*"
requests = "*"
scikit-learn = "*"
seaborn = "*"
jupyter = "*"[dev-packages][requires]
python_version = "3.8"
Merge dataset
head -1 train/CON__OCT_2016.csv > train.csv
tail -n +2 train/*.csv >> train.csvhead -1 test/test_CON_Jan_2018.csv > test.csv
tail -n +2 test/*.csv >> test.csv
Load dataset
# Load data from CSV
train_dataset = pd.read_csv("static/data/train.csv", dtype={'zip':'category', 'mlsnum':'string', 'proptype':'string'})
test = pd.read_csv("static/data/test.csv", dtype={'zip':'category', 'mlsnum':'string','proptype':'string'})train_dataset.head()
Column List
['otherfeatures', 'status', 'solddate', 'city', 'showinginstructions', 'garage', 'address', 'agentname', 'listdate', 'cooling', 'dom', 'soldprice', 'level', 'housenum1', 'mlsnum', 'baths', 'officename', 'style', 'expireddate', 'area', 'elementaryschool', 'zip', 'sqft', 'officephone', 'listprice', 'beds', 'highschool', 'heating', 'juniorhighschool', 'dto', 'age', 'remarks', 'photourl', 'row_id', 'proptype', 'state', 'streetname', 'housenum2', 'lotsize']
Dataset Description
train.describe()
test.describe()
Dataset dimensions
train_dataset.shape, test.shape
# ((500468, 39), (13346, 39))
Join Train and test
train_dataset.index = train_dataset['row_id']
test.index = test['row_id']
frames = [train_set, test]
df_combined = pd.concat(frames)
df_combined = df_combined.reset_index(drop=True)
df_combined.shape
# (513814, 39)
Create Binary Variables for Open Answers
df_combined['hasotherfeatures'] = np.where(df_combined['otherfeatures'].isnull(), 0, 1)
df_combined['hasproptype'] = np.where(df_combined['proptype'].isnull(), 0, 1)
df_combined['hasstreetname'] = np.where(df_combined['streetname'].isnull(), 0, 1)
df_combined['hashousenum1'] = np.where(df_combined['housenum1'].isnull(), 0, 1)
df_combined['hashousenum2'] = np.where(df_combined['housenum2'].isnull(), 0, 1)df_combined['hasagentname'] = np.where(df_combined['agentname'].isnull(), 0, 1)
df_combined['hasofficename'] = np.where(df_combined['officename'].isnull(), 0, 1)
df_combined['hasofficephone'] = np.where(df_combined['officephone'].isnull(), 0, 1)
df_combined['hasshowinginstructions'] = np.where(df_combined['showinginstructions'].isnull(), 0, 1)
df_combined['hasstyle'] = np.where(df_combined['style'].isnull() , 0, 1)
df_combined['haslevel'] = np.where(df_combined['level'].isnull(), 0, 1)df_combined['hasgarage'] = np.where(df_combined['garage'].isnull(), 0, 1)
df_combined['hasheating'] = np.where(df_combined['heating'].isnull(), 0, 1)
df_combined['hascooling'] = np.where(df_combined['cooling'].isnull(), 0, 1)
df_combined['haselementaryschool'] = np.where(df_combined['elementaryschool'].isnull(), 0, 1)
df_combined['hasjuniorhighschool'] = np.where(df_combined['juniorhighschool'].isnull(), 0, 1)
df_combined['hashighschool'] = np.where(df_combined['highschool'].isnull(), 0, 1)df_combined['haslistprice'] = np.where(df_combined['listprice'].isnull(), 0, 1)
df_combined['haslistdate'] = np.where(df_combined['listdate'].isnull(), 0, 1)
df_combined['hasaddress'] = np.where(df_combined['address'].isnull(), 0, 1)
df_combined['hascity'] = np.where(df_combined['city'].isnull(), 0, 1)
df_combined['hasstate'] = np.where(df_combined['state'].isnull(), 0, 1)
df_combined['hasarea'] = np.where(df_combined['area'].isnull(), 0, 1)
New List of columns
df_combined.columnsIndex(['mlsnum', 'status', 'listprice', 'soldprice', 'listdate', 'solddate',
'expireddate', 'dom', 'dto', 'address', 'city', 'state', 'zip', 'area',
'beds', 'baths', 'sqft', 'age', 'lotsize', 'agentname', 'officename',
'officephone', 'showinginstructions', 'remarks', 'style', 'level',
'garage', 'heating', 'cooling', 'elementaryschool', 'juniorhighschool',
'highschool', 'otherfeatures', 'proptype', 'streetname', 'housenum1',
'housenum2', 'photourl', 'row_id', 'hasotherfeatures', 'hasproptype',
'hasstreetname', 'hashousenum1', 'hashousenum2', 'hasagentname',
'hasofficename', 'hasofficephone', 'hasshowinginstructions', 'hasstyle',
'haslevel', 'hasgarage', 'hasheating', 'hascooling',
'haselementaryschool', 'hasjuniorhighschool', 'hashighschool',
'haslistprice', 'haslistdate', 'hasaddress', 'hascity', 'hasstate',
'hasarea'],
dtype='object')
Convert Columns to respective numeric values and fill null with zeroes
df_combined['listprice'] = df_combined['listprice'].replace(',','', regex=True).astype(float)
df_combined['listprice'] = df_combined['listprice'].astype('float32')
df_combined['baths'] = df_combined['baths'].astype('float32')
df_combined['beds'] = df_combined['beds'].astype('int')df_combined['city'] = df_combined['city'].astype('string')
df_combined['state'] = df_combined['state'].astype('string') # category
df_combined['photourl'] = df_combined['photourl'].astype('string')
df_combined['zip'] = df_combined['zip'].astype('string')
df_combined.city.fillna("00000", inplace=True)
df_combined.zip.fillna("00000", inplace=True)
df_combined.state.fillna("NA", inplace=True)
df_combined.photourl.fillna("NA", inplace=True)
# keep these variables as original string for csv export
df_combined['proptype_str'] = df_combined['proptype'].astype('string')
df_combined['zip_str'] = df_combined['zip'].astype('string')df_combined['listprice'] = pd.to_numeric(df_combined['listprice'], errors='ignore')
df_combined['beds'] = pd.to_numeric(df_combined['beds'], errors='ignore')
df_combined['baths'] = pd.to_numeric(df_combined['baths'], errors='ignore')
df_combined['sqft'] = pd.to_numeric(df_combined['sqft'], errors='ignore')
df_combined['zip'] = pd.to_numeric(df_combined['zip'], errors='ignore')df_combined.listprice.fillna(0, inplace=True)
df_combined.zip.fillna(0, inplace=True)
df_combined.sqft.fillna(0, inplace=True)
df_combined.beds.fillna(0, inplace=True)
df_combined.baths.fillna(0, inplace=True)df_combined.loc[df_combined.proptype == 'CC', 'proptype'] = '1'
df_combined.loc[df_combined.proptype == 'MF', 'proptype'] = '2'
df_combined.loc[df_combined.proptype == 'SF', 'proptype'] = '3'
df_combined.loc[df_combined.proptype == 'EXT', 'proptype'] = '4'
df_combined[df_combined.eq('00nan').any(1)]
df_combined.loc[df_combined['zip'] == '00nan', ['zip']] = 0
df_combined.loc[df_combined['zip'] == 'Centerville', ['zip']] = 0
df_combined['proptype'] = df_combined['proptype'].astype('int')
Create a new dataset csv
- This will be used to create the database
df_combined.to_csv('df_combined.csv', sep=',', encoding='utf-8')
Drop to all not numeric columns
new_drop_columns = [ 'status','listdate', 'address','area',
'agentname','officename', 'officephone',
'showinginstructions', 'remarks', 'style',
'level', 'garage', 'heating', 'cooling',
'elementaryschool', 'juniorhighschool',
'highschool', 'otherfeatures', 'streetname',
'housenum1', 'housenum2', 'mlsnum', 'soldprice', 'solddate',
'dom', 'dto', 'expireddate', 'age', 'lotsize',
'mlsnum', 'row_id', 'proptype_str', 'zip_str', 'city','state', 'photourl']
df_combined.drop(new_drop_columns, axis=1, inplace=True)
Only Numerical columns
Index(['listprice', 'zip', 'beds', 'baths', 'sqft', 'proptype',
'hasotherfeatures', 'hasproptype', 'hasstreetname', 'hashousenum1',
'hashousenum2', 'hasagentname', 'hasofficename', 'hasofficephone',
'hasshowinginstructions', 'hasstyle', 'haslevel', 'hasgarage',
'hasheating', 'hascooling', 'haselementaryschool',
'hasjuniorhighschool', 'hashighschool', 'haslistprice', 'haslistdate',
'hasaddress', 'hascity', 'hasstate', 'hasarea'],
dtype='object')
Visualization
List Price
The column ListPrice
is our target variable or dependent variable and all the other columns are our independent variable.
- Minimum price: $1,000.00
- Maximum price: $90,000,000.00
- Mean price: $631,906.00
- Median price $415,900.00
- Standard deviation of prices: $1,073,829.25
- First quartile of prices: $269,900.00
- Second quartile of prices: $654,900.00
- Interquartile (IQR) of prices: $385,000.00
Beds
The column beds
is one of the independents' variables
- Minimum beds: 0.00
- Maximum beds: 20.00
- Mean beds: 3.45
- Median beds 3.00
- The standard deviation of beds: 1.69
- First quartile of beds: 3.00
- Second quartile of beds: 4.00
- Interquartile (IQR) of beds: 1.00
Baths
The column baths
is one of the independents variables
- Minimum baths: 0.00
- Maximum baths: 15.00
- Mean baths: 2.15
- Median baths 2.00
- Standard deviation of baths: 1.24
- First quartile of baths: 1.00
- Second quartile of baths: 2.50
- Interquartile (IQR) of baths: 1.50
Plots
Beds vs List Price top 100
Baths vs List Price top 100
Beds vs List Price
Baths vs List Price
SQFT vs List Price
Beds vs Proptype vs Listprice
Baths vs Proptype vs Listprice
Boxplot
Boxplot List Price
Boxplot Beds
Boxplot Baths
Boxplot SQFT
Histogram
Beds Histogram
Baths Histogram
Model Analysis
We try Linear Regression and Random Forest as base model to analyze our dataset
Linear Regression
Due to the regressive predictive nature of The Price Prediction Project, we decided to approach our model methodology with a Linear Regression base model. R-squared is a measure of how well a linear regression model “fits” a dataset. In our output of the regression results, we see that R2 = 0.25. This indicates that 25% of the variance in the number of Price Prediction can be explained by the population size. In many cases, 26% or above values indicate high effect size. In this respect, our models are low and medium effect sizes.
Random forest
Run RandomForestRegressor with best params from the linear regression. In our output of the Random forest results, we see that R2 = -0.23 which means our model is having a very low R2 in general. This is not too surprising to see from a random forest in particular which loves to fit the training set extremely well due to how exhaustive the algorithm.
Model Tested
array([557572.21951763])
ProgresSQL Database
Properties table was created from the CSV file
df_combined.to_csv('df_combined.csv', sep=',', encoding='utf-8')
table structure
create table properties
(
row_id text,
soldprice text,
listdate text,
listprice numeric,
expireddate text,
dom numeric,
dto integer,
address text,
city text,
state text,
zip integer,
solddate text,
beds integer,
baths numeric,
sqft integer,
age integer,
lotsize integer,
agentname text,
officename text,
officephone text,
showinginstructions text,
area integer,
style text,
level integer,
garage integer,
heating text,
cooling text,
elementaryschool text,
juniorhighschool text,
highschool text,
remarks text,
proptype integer,
streetname text,
housenum1 text,
housenum2 numeric,
photourl text,
otherfeatures text,
hasotherfeatures integer,
hasproptype integer,
hasstreetname integer,
hashousenum1 integer,
hashousenum2 integer,
hasagentname integer,
hasofficename integer,
hasofficephone integer,
hasshowinginstructions integer,
hasstyle integer,
haslevel integer,
hasgarage integer,
hasheating integer,
hascooling integer,
haselementaryschool integer,
hasjuniorhighschool integer,
hashighschool integer,
haslistprice integer,
haslistdate integer,
hasaddress integer,
hascity integer,
hasstate integer,
hasarea integer,
proptype_str text,
zip_str text
);
Table Optimization
ALTER TABLE properties ADD PRIMARY KEY (row_id);CREATE INDEX search_price ON properties (listprice, zip, photourl);
CREATE INDEX search_price1 ON properties ( zip, photourl);
CREATE INDEX search_price2 ON properties (listprice, zip);
CREATE INDEX search_price3 ON properties (listprice, photourl);
Users table
create table users
(
uid serial
constraint users_pkey
primary key,
username text not null,
password text not null
);
Properties Table Query
Query Sample
zipcode = 2053
beds = 3
baths = 2.0
sqft = 200
proptype = 1
hasotherfeatures = 1
hasproptype = 1
hasstreetname = 1
hashousenum1 = 0
hashousenum2 = 1
hasagentname = 1
hasofficename = 1
hasofficephone = 1
hasshowinginstructions = 1
hasstyle = 1
haslevel = 1
hasgarage = 1
hasheating = 1
hascooling = 0
haselementaryschool = 0
hasjuniorhighschool = 0
hashighschool = 0
haslistprice = 1
haslistdate = 0
hasaddress = 1
hascity = 1
hasstate = 1
hasarea = 1
query = (
"select * from properties where 1=1 and "
"zip = {zipcode} and beds ={beds} and baths = {baths} and sqft = {sqft} and proptype = {proptype} and hasotherfeatures = {hasotherfeatures} and "
"hasproptype = {hasproptype} and hasstreetname = {hasstreetname} and hashousenum1 = {hashousenum1} and hashousenum2 = {hashousenum2} and "
"hasagentname = hasagentname and hasofficename = hasofficename and hasofficephone = hasofficephone and hasshowinginstructions = hasshowinginstructions"
" and hasstyle = {hasstyle} and haslevel = {haslevel} and hasgarage = {hasgarage} and hasheating = {hasheating} and hascooling = {hascooling} and "
"haselementaryschool = {haselementaryschool} and hasjuniorhighschool = {hasjuniorhighschool} and "
"hashighschool = {hashighschool} and haslistprice = {haslistprice} and haslistdate = {haslistdate} and hasaddress = {hasaddress} and "
"hascity = {hascity} and hasstate = {hasstate} and hasarea = {hasarea}".format(
zipcode=zipcode,
beds=beds,
baths=baths,
sqft=sqft,
proptype=proptype,
hasotherfeatures=hasotherfeatures,
hasproptype=hasproptype,
hasstreetname=hasstreetname,
hashousenum1=hashousenum1,
hashousenum2=hashousenum2,
hasagentname=hasagentname,
hasofficename=hasofficename,
hasofficephone=hasofficephone,
hasshowinginstructions=hasshowinginstructions,
hasstyle=hasstyle,
haslevel=haslevel,
hasgarage=hasgarage,
hasheating=hasheating,
hascooling=hascooling,
haselementaryschool=haselementaryschool,
hasjuniorhighschool=hasjuniorhighschool,
hashighschool=hashighschool,
haslistprice=haslistprice,
haslistdate=haslistdate,
hasaddress=hasaddress,
hascity=hascity,
hasstate=hasstate,
hasarea=hasarea,
)
)
Cloud Architecture
Network Topology
The Price prediction app is a two tier architecture where we use a database tier in a private subnetwork and an aws serverless lambda in a public subnetwork that can connect between the private subnet and internet through gateways.
AWS Network Configuration
The follow are the configuration of our application architecture.
Internet Gateway
Virtual private cloud (VPC)
Subnetworks
Relational Database Service
Amazon Simple Storage Service
Images in S3 Services
Serverless Platform
AWS Lambda is an event-driven, serverless computing platform
Github Actions
deploy to Lambda
name: AWS Deploy
defaults:
run:
working-directory: "./predictions-app"
on: [push]
env:
AWS_PROFILE: ${{ secrets.AWS_PROFILE }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }}
SYSTEM_VERSION_COMPAT: ${{ secrets.SYSTEM_VERSION_COMPAT }}
BUCKET_URI: ${{ secrets.BUCKET_URI}}
jobs:
test:
name: test develop
runs-on: ubuntu-latest
defaults:
run:
working-directory: "./predictions-app"
env:
working-directory: "./predictions-app"
if: github.ref == 'refs/heads/develop'
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install apt-get update
run: |
sudo apt-get update
- name: Install python app
run: |
python -m pip install --upgrade pipenv
pipenv install --deploy
working-directory: ${{env.working-directory}}
submit:
name: Submit to AWS
runs-on: ubuntu-latest
defaults:
run:
working-directory: "./predictions-app"
env:
working-directory: "./predictions-app"
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Set up Node.js
uses: actions/setup-node@v1
with:
node-version: "12"
- run: npm install
working-directory: ${{env.working-directory}}
- name: Install apt-get update
run: |
sudo apt-get update
- name: Install serverless
run: |
npm install serverless -g
working-directory: ${{env.working-directory}}
- name: Install awscli
run: |
sudo apt-get install awscli
- name: Install python app
run: |
python -m pip install --upgrade pipenv
pipenv install --deploy
working-directory: ${{env.working-directory}}
- name: Set up Docker
uses: docker-practice/actions-setup-docker@master
- name: Deploy Python App to AWS Lambda
run: |
aws configure set aws_access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
aws configure set aws_secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws configure set default.region ${{ secrets.AWS_REGION }}
serverless deploy
working-directory: ${{env.working-directory}}
Serverless
Web App Python is running on a serverless service that runs code in response to events and automatically manages the computing resources required by the python app.
Handle session stateless server
We have to handle the stateless of the lambda by adding a cookie to the serverless
- Login
session["username"] = username
response = make_response(redirect(url_for("index")))
response.set_cookie('username', session["username"])
print('cookie login pass:', request.cookies.get('username'))
- Logout
response = make_response(redirect(url_for("index")))
response.set_cookie('username', '', expires=0)
session.clear()
- Navigation
print('cookie:', request.cookies.get('username'))
username_value = session['username'] if 'username' in session else request.cookies.get('username')
if username_value is not None:
YAML configuration
service: serverless-flask-pp-app
frameworkVersion: "2 || 3"plugins:
- serverless-python-requirements
- serverless-apigwy-binary
- serverless-prune-plugin
- serverless-wsgicustom:
apigwBinary:
types:
- "image/jpeg"
- "image/png"
prune:
automatic: true
number: 3
pythonRequirements:
slim: true
strip: false
zip: true
scriptable:
hooks:
after:package:createDeploymentArtifacts:
- ./shake.sh
tableName: "properties"
wsgi:
app: routes.app
packRequirements: falseprovider:
name: aws
runtime: python3.8
stage: dev
region: us-east-1
lambdaHashingVersion: 20201221
environment:
PROPERTIES_TABLE: "properties"
USERS_TABLE: "users"
POSTGRES_HOST:
POSTGES_DATABASE_URI: ""
POSTGRES_DB: ""
POSTGRES_USER: "postgres"
POSTGRES_PASSW: ""
POSTGRES_PORT: "5432"
exclude:
- .git/**
- .vscode/**
- node_modules/**
- venv/**
iamRoleStatements:
- Effect: "Allow"
Action:
- ec2:CreateNetworkInterface
- ec2:DescribeNetworkInterfaces
- ec2:DeleteNetworkInterface
Resource: "*"functions:
app:
handler: wsgi_handler.handler
events:
- http: ANY /
contentHandling: CONVERT_TO_BINARY
- http: POST /predict
- http: POST /logout
- http: ANY {proxy+}
contentHandling: CONVERT_TO_BINARY
Web Application
Registeration
The Price Prediction app allows user to register themselves as authorized users that let them manage their saved listings.
Register
The “Register” link at the top takes the guest user to register as a Price Predictor user. This simple form only requires user to enter a username and a password.
Login
Once a user has registered him/herself, he/she can login and save/review/remove the saved homes.
Search
This is the main page where user can search for homes using the options provided. ZipCode is required and rest of the fields are optional but preset with default values.
Search Results
Homes matching the search criteria are displayed once the user hits the search button. The listings are shown with the address and brief highlights.
Property Details
Details about the individual property can be viewed once the user clicks on the address from the search results page. This page displays the pricing and detailed information about the property.
Favorites
Registered users can mark a property as a favorite for future viewing.