House Price Prediction ML Model and Python and AWS Lambdas app

12 min readJan 20, 2022

We will predict the best market price for a property based on the location, property features such as size, bedroom, bathroom, amenities, and other attributes that can be relevant in predicting a property value. In the beginning, we will evaluate the dataset to create an effective predictive model for real estate prices. Then, we will create a web app to show similar results.

Github repository

GitHub - Fr4nc3/predictions_housing: Housing prediction in a great area of Boston using Python Pandas…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Prediction Model

Create the python environment

pipenv install --python 3.8pipenv install Cython Flask gunicorn Jinja2 numpy pandas plotly python-dateutil requests scikit-learn matplotlib seaborn jupyter

Pipefile

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"[packages]
matplotlib = "*"
cython = "*"
flask = "*"
gunicorn = "*"
jinja2 = "*"
numpy = "*"
pandas = "*"
plotly = "*"
python-dateutil = "*"
requests = "*"
scikit-learn = "*"
seaborn = "*"
jupyter = "*"[dev-packages][requires]
python_version = "3.8"

Merge dataset

head -1 train/CON__OCT_2016.csv > train.csv
tail -n +2  train/*.csv >> train.csvhead -1 test/test_CON_Jan_2018.csv > test.csv
tail -n +2  test/*.csv >> test.csv

Load dataset

# Load data from CSV
train_dataset = pd.read_csv("static/data/train.csv", dtype={'zip':'category', 'mlsnum':'string', 'proptype':'string'})
test = pd.read_csv("static/data/test.csv",  dtype={'zip':'category', 'mlsnum':'string','proptype':'string'})train_dataset.head()

Column List

['otherfeatures', 'status', 'solddate', 'city', 'showinginstructions', 'garage', 'address', 'agentname', 'listdate', 'cooling', 'dom', 'soldprice', 'level', 'housenum1', 'mlsnum', 'baths', 'officename', 'style', 'expireddate', 'area', 'elementaryschool', 'zip', 'sqft', 'officephone', 'listprice', 'beds', 'highschool', 'heating', 'juniorhighschool', 'dto', 'age', 'remarks', 'photourl', 'row_id', 'proptype', 'state', 'streetname', 'housenum2', 'lotsize']

Dataset Description

train.describe()
test.describe()

Dataset dimensions

train_dataset.shape, test.shape
# ((500468, 39), (13346, 39))

Join Train and test

train_dataset.index = train_dataset['row_id']
test.index = test['row_id']
frames = [train_set, test]
df_combined = pd.concat(frames)
df_combined = df_combined.reset_index(drop=True)
df_combined.shape
# (513814, 39)

Create Binary Variables for Open Answers

df_combined['hasotherfeatures'] = np.where(df_combined['otherfeatures'].isnull(), 0, 1)
df_combined['hasproptype'] = np.where(df_combined['proptype'].isnull(), 0, 1)
df_combined['hasstreetname'] = np.where(df_combined['streetname'].isnull(), 0, 1)
df_combined['hashousenum1'] = np.where(df_combined['housenum1'].isnull(), 0, 1)
df_combined['hashousenum2'] = np.where(df_combined['housenum2'].isnull(), 0, 1)df_combined['hasagentname'] = np.where(df_combined['agentname'].isnull(), 0, 1)
df_combined['hasofficename'] = np.where(df_combined['officename'].isnull(), 0, 1)
df_combined['hasofficephone'] = np.where(df_combined['officephone'].isnull(), 0, 1)
df_combined['hasshowinginstructions'] = np.where(df_combined['showinginstructions'].isnull(), 0, 1)
df_combined['hasstyle'] = np.where(df_combined['style'].isnull() , 0, 1)
df_combined['haslevel'] = np.where(df_combined['level'].isnull(), 0, 1)df_combined['hasgarage'] = np.where(df_combined['garage'].isnull(), 0, 1)
df_combined['hasheating'] = np.where(df_combined['heating'].isnull(), 0, 1)
df_combined['hascooling'] = np.where(df_combined['cooling'].isnull(), 0, 1)
df_combined['haselementaryschool'] = np.where(df_combined['elementaryschool'].isnull(), 0, 1)
df_combined['hasjuniorhighschool'] = np.where(df_combined['juniorhighschool'].isnull(), 0, 1)
df_combined['hashighschool'] = np.where(df_combined['highschool'].isnull(), 0, 1)df_combined['haslistprice'] = np.where(df_combined['listprice'].isnull(), 0, 1)
df_combined['haslistdate'] = np.where(df_combined['listdate'].isnull(), 0, 1)
df_combined['hasaddress'] = np.where(df_combined['address'].isnull(), 0, 1)
df_combined['hascity'] = np.where(df_combined['city'].isnull(), 0, 1)
df_combined['hasstate'] = np.where(df_combined['state'].isnull(), 0, 1)
df_combined['hasarea'] = np.where(df_combined['area'].isnull(), 0, 1)

New List of columns

df_combined.columnsIndex(['mlsnum', 'status', 'listprice', 'soldprice', 'listdate', 'solddate',
       'expireddate', 'dom', 'dto', 'address', 'city', 'state', 'zip', 'area',
       'beds', 'baths', 'sqft', 'age', 'lotsize', 'agentname', 'officename',
       'officephone', 'showinginstructions', 'remarks', 'style', 'level',
       'garage', 'heating', 'cooling', 'elementaryschool', 'juniorhighschool',
       'highschool', 'otherfeatures', 'proptype', 'streetname', 'housenum1',
       'housenum2', 'photourl', 'row_id', 'hasotherfeatures', 'hasproptype',
       'hasstreetname', 'hashousenum1', 'hashousenum2', 'hasagentname',
       'hasofficename', 'hasofficephone', 'hasshowinginstructions', 'hasstyle',
       'haslevel', 'hasgarage', 'hasheating', 'hascooling',
       'haselementaryschool', 'hasjuniorhighschool', 'hashighschool',
       'haslistprice', 'haslistdate', 'hasaddress', 'hascity', 'hasstate',
       'hasarea'],
      dtype='object')

Convert Columns to respective numeric values and fill null with zeroes

df_combined['listprice'] =  df_combined['listprice'].replace(',','', regex=True).astype(float)
df_combined['listprice'] = df_combined['listprice'].astype('float32')
df_combined['baths'] = df_combined['baths'].astype('float32')
df_combined['beds'] = df_combined['beds'].astype('int')df_combined['city'] = df_combined['city'].astype('string')
df_combined['state'] = df_combined['state'].astype('string') # category
df_combined['photourl'] = df_combined['photourl'].astype('string')
df_combined['zip'] = df_combined['zip'].astype('string')
df_combined.city.fillna("00000", inplace=True)
df_combined.zip.fillna("00000", inplace=True)
df_combined.state.fillna("NA", inplace=True)
df_combined.photourl.fillna("NA", inplace=True)
# keep these variables as original string for csv export
df_combined['proptype_str'] = df_combined['proptype'].astype('string')
df_combined['zip_str'] =  df_combined['zip'].astype('string')df_combined['listprice'] = pd.to_numeric(df_combined['listprice'], errors='ignore')
df_combined['beds'] = pd.to_numeric(df_combined['beds'], errors='ignore')
df_combined['baths'] = pd.to_numeric(df_combined['baths'], errors='ignore')
df_combined['sqft'] = pd.to_numeric(df_combined['sqft'], errors='ignore')
df_combined['zip'] = pd.to_numeric(df_combined['zip'], errors='ignore')df_combined.listprice.fillna(0, inplace=True)
df_combined.zip.fillna(0, inplace=True)
df_combined.sqft.fillna(0, inplace=True)
df_combined.beds.fillna(0, inplace=True)
df_combined.baths.fillna(0, inplace=True)df_combined.loc[df_combined.proptype == 'CC', 'proptype'] = '1'
df_combined.loc[df_combined.proptype == 'MF', 'proptype'] = '2'
df_combined.loc[df_combined.proptype == 'SF', 'proptype'] = '3'
df_combined.loc[df_combined.proptype == 'EXT', 'proptype'] = '4'
df_combined[df_combined.eq('00nan').any(1)]
df_combined.loc[df_combined['zip'] == '00nan', ['zip']] = 0
df_combined.loc[df_combined['zip'] == 'Centerville', ['zip']] = 0
df_combined['proptype'] = df_combined['proptype'].astype('int')

Create a new dataset csv

This will be used to create the database

df_combined.to_csv('df_combined.csv', sep=',', encoding='utf-8')

Drop to all not numeric columns

new_drop_columns = [ 'status','listdate', 'address','area',
                    'agentname','officename', 'officephone',
                    'showinginstructions', 'remarks', 'style',
                    'level', 'garage', 'heating', 'cooling',
                    'elementaryschool', 'juniorhighschool',
                    'highschool', 'otherfeatures', 'streetname',
                    'housenum1', 'housenum2', 'mlsnum', 'soldprice', 'solddate',
                   'dom', 'dto', 'expireddate', 'age', 'lotsize',
                    'mlsnum', 'row_id', 'proptype_str', 'zip_str',  'city','state', 'photourl']
df_combined.drop(new_drop_columns, axis=1, inplace=True)

Only Numerical columns

Index(['listprice', 'zip', 'beds', 'baths', 'sqft', 'proptype',
       'hasotherfeatures', 'hasproptype', 'hasstreetname', 'hashousenum1',
       'hashousenum2', 'hasagentname', 'hasofficename', 'hasofficephone',
       'hasshowinginstructions', 'hasstyle', 'haslevel', 'hasgarage',
       'hasheating', 'hascooling', 'haselementaryschool',
       'hasjuniorhighschool', 'hashighschool', 'haslistprice', 'haslistdate',
       'hasaddress', 'hascity', 'hasstate', 'hasarea'],
      dtype='object')

Visualization

List Price

The column ListPrice is our target variable or dependent variable and all the other columns are our independent variable.

Minimum price: $1,000.00
Maximum price: $90,000,000.00
Mean price: $631,906.00
Median price $415,900.00
Standard deviation of prices: $1,073,829.25
First quartile of prices: $269,900.00
Second quartile of prices: $654,900.00
Interquartile (IQR) of prices: $385,000.00

Beds

The column beds is one of the independents' variables

Minimum beds: 0.00
Maximum beds: 20.00
Mean beds: 3.45
Median beds 3.00
The standard deviation of beds: 1.69
First quartile of beds: 3.00
Second quartile of beds: 4.00
Interquartile (IQR) of beds: 1.00

Baths

The column baths is one of the independents variables

Minimum baths: 0.00
Maximum baths: 15.00
Mean baths: 2.15
Median baths 2.00
Standard deviation of baths: 1.24
First quartile of baths: 1.00
Second quartile of baths: 2.50
Interquartile (IQR) of baths: 1.50

Plots

Beds vs List Price top 100

Baths vs List Price top 100

Beds vs List Price

Baths vs List Price

SQFT vs List Price

Beds vs Proptype vs Listprice

Baths vs Proptype vs Listprice

Boxplot

Boxplot List Price

Boxplot Beds

Boxplot Baths

Boxplot SQFT

Histogram

Beds Histogram

Baths Histogram

Model Analysis

We try Linear Regression and Random Forest as base model to analyze our dataset

Linear Regression

Due to the regressive predictive nature of The Price Prediction Project, we decided to approach our model methodology with a Linear Regression base model. R-squared is a measure of how well a linear regression model “fits” a dataset. In our output of the regression results, we see that R2 = 0.25. This indicates that 25% of the variance in the number of Price Prediction can be explained by the population size. In many cases, 26% or above values indicate high effect size. In this respect, our models are low and medium effect sizes.

Random forest

Run RandomForestRegressor with best params from the linear regression. In our output of the Random forest results, we see that R2 = -0.23 which means our model is having a very low R2 in general. This is not too surprising to see from a random forest in particular which loves to fit the training set extremely well due to how exhaustive the algorithm.

Model Tested

array([557572.21951763])

ProgresSQL Database

Properties table was created from the CSV file

df_combined.to_csv('df_combined.csv', sep=',', encoding='utf-8')

table structure

create table properties
(
    row_id                 text,
    soldprice              text,
    listdate               text,
    listprice              numeric,
    expireddate            text,
    dom                    numeric,
    dto                    integer,
    address                text,
    city                   text,
    state                  text,
    zip                    integer,
    solddate               text,
    beds                   integer,
    baths                  numeric,
    sqft                   integer,
    age                    integer,
    lotsize                integer,
    agentname              text,
    officename             text,
    officephone            text,
    showinginstructions    text,
    area                   integer,
    style                  text,
    level                  integer,
    garage                 integer,
    heating                text,
    cooling                text,
    elementaryschool       text,
    juniorhighschool       text,
    highschool             text,
    remarks                text,
    proptype               integer,
    streetname             text,
    housenum1              text,
    housenum2              numeric,
    photourl               text,
    otherfeatures          text,
    hasotherfeatures       integer,
    hasproptype            integer,
    hasstreetname          integer,
    hashousenum1           integer,
    hashousenum2           integer,
    hasagentname           integer,
    hasofficename          integer,
    hasofficephone         integer,
    hasshowinginstructions integer,
    hasstyle               integer,
    haslevel               integer,
    hasgarage              integer,
    hasheating             integer,
    hascooling             integer,
    haselementaryschool    integer,
    hasjuniorhighschool    integer,
    hashighschool          integer,
    haslistprice           integer,
    haslistdate            integer,
    hasaddress             integer,
    hascity                integer,
    hasstate               integer,
    hasarea                integer,
    proptype_str           text,
    zip_str                text
);

Table Optimization

ALTER TABLE properties ADD PRIMARY KEY (row_id);CREATE INDEX search_price ON properties (listprice, zip, photourl);
CREATE INDEX search_price1 ON properties ( zip, photourl);
CREATE INDEX search_price2 ON properties (listprice, zip);
CREATE INDEX search_price3 ON properties (listprice,  photourl);

Users table

create table users
(
    uid      serial
        constraint users_pkey
            primary key,
    username text not null,
    password text not null
);

Properties Table Query

Query Sample

zipcode = 2053
beds = 3
baths = 2.0
sqft = 200
proptype = 1
hasotherfeatures = 1
hasproptype = 1
hasstreetname = 1
hashousenum1 = 0
hashousenum2 = 1
hasagentname = 1
hasofficename = 1
hasofficephone = 1
hasshowinginstructions = 1
hasstyle = 1
haslevel = 1
hasgarage = 1
hasheating = 1
hascooling = 0
haselementaryschool = 0
hasjuniorhighschool = 0
hashighschool = 0
haslistprice = 1
haslistdate = 0
hasaddress = 1
hascity = 1
hasstate = 1
hasarea = 1
query = (
    "select * from properties where 1=1 and "
    "zip = {zipcode} and beds ={beds} and baths = {baths} and sqft = {sqft} and proptype = {proptype} and hasotherfeatures = {hasotherfeatures} and "
    "hasproptype = {hasproptype} and hasstreetname = {hasstreetname} and hashousenum1 = {hashousenum1} and hashousenum2 = {hashousenum2} and "
    "hasagentname = hasagentname and hasofficename = hasofficename and hasofficephone = hasofficephone and hasshowinginstructions = hasshowinginstructions"
    " and hasstyle = {hasstyle} and haslevel = {haslevel} and hasgarage = {hasgarage} and hasheating = {hasheating} and hascooling = {hascooling} and "
    "haselementaryschool = {haselementaryschool} and hasjuniorhighschool = {hasjuniorhighschool} and "
    "hashighschool = {hashighschool} and haslistprice = {haslistprice} and haslistdate = {haslistdate} and hasaddress = {hasaddress} and "
    "hascity = {hascity} and hasstate = {hasstate} and hasarea = {hasarea}".format(
        zipcode=zipcode,
        beds=beds,
        baths=baths,
        sqft=sqft,
        proptype=proptype,
        hasotherfeatures=hasotherfeatures,
        hasproptype=hasproptype,
        hasstreetname=hasstreetname,
        hashousenum1=hashousenum1,
        hashousenum2=hashousenum2,
        hasagentname=hasagentname,
        hasofficename=hasofficename,
        hasofficephone=hasofficephone,
        hasshowinginstructions=hasshowinginstructions,
        hasstyle=hasstyle,
        haslevel=haslevel,
        hasgarage=hasgarage,
        hasheating=hasheating,
        hascooling=hascooling,
        haselementaryschool=haselementaryschool,
        hasjuniorhighschool=hasjuniorhighschool,
        hashighschool=hashighschool,
        haslistprice=haslistprice,
        haslistdate=haslistdate,
        hasaddress=hasaddress,
        hascity=hascity,
        hasstate=hasstate,
        hasarea=hasarea,
    )
)

Cloud Architecture

Network Topology

The Price prediction app is a two tier architecture where we use a database tier in a private subnetwork and an aws serverless lambda in a public subnetwork that can connect between the private subnet and internet through gateways.

AWS Network Configuration

The follow are the configuration of our application architecture.

Internet Gateway

Virtual private cloud (VPC)

Subnetworks

Relational Database Service

Amazon Simple Storage Service

Images in S3 Services

Serverless Platform

AWS Lambda is an event-driven, serverless computing platform

Github Actions

deploy to Lambda

name: AWS Deploy
defaults:
  run:
    working-directory: "./predictions-app"
on: [push]
env:
  AWS_PROFILE: ${{ secrets.AWS_PROFILE }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  AWS_DEFAULT_REGION: ${{ secrets.AWS_REGION }}
  SYSTEM_VERSION_COMPAT: ${{ secrets.SYSTEM_VERSION_COMPAT }}
  BUCKET_URI: ${{ secrets.BUCKET_URI}}
jobs:
  test:
    name: test develop
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: "./predictions-app"
    env:
      working-directory: "./predictions-app"
    if: github.ref == 'refs/heads/develop'
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.8
      - name: Install apt-get update
        run: |
          sudo apt-get update
      - name: Install python app
        run: |
          python -m pip install --upgrade pipenv
          pipenv install --deploy
        working-directory: ${{env.working-directory}}
  submit:
    name: Submit to AWS
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: "./predictions-app"
    env:
      working-directory: "./predictions-app"
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.8
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}
      - name: Set up Node.js
        uses: actions/setup-node@v1
        with:
          node-version: "12"
      - run: npm install
        working-directory: ${{env.working-directory}}
      - name: Install apt-get update
        run: |
          sudo apt-get update
      - name: Install serverless
        run: |
          npm install serverless -g
        working-directory: ${{env.working-directory}}
      - name: Install awscli
        run: |
          sudo apt-get install awscli
      - name: Install python app
        run: |
          python -m pip install --upgrade pipenv
          pipenv install --deploy
        working-directory: ${{env.working-directory}}
      - name: Set up Docker
        uses: docker-practice/actions-setup-docker@master
      - name: Deploy Python App to AWS Lambda
        run: |
          aws configure set aws_access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws configure set aws_secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws configure set default.region  ${{ secrets.AWS_REGION }}
          serverless deploy
        working-directory: ${{env.working-directory}}

Serverless

Web App Python is running on a serverless service that runs code in response to events and automatically manages the computing resources required by the python app.

Handle session stateless server

We have to handle the stateless of the lambda by adding a cookie to the serverless

session["username"] = username
    response = make_response(redirect(url_for("index")))
    response.set_cookie('username', session["username"])
    print('cookie login pass:', request.cookies.get('username'))

Logout

response = make_response(redirect(url_for("index")))
    response.set_cookie('username', '', expires=0)
    session.clear()

Navigation

print('cookie:', request.cookies.get('username'))
    username_value = session['username'] if 'username' in session else request.cookies.get('username')
    if username_value  is not None:

YAML configuration

service: serverless-flask-pp-app
frameworkVersion: "2 || 3"plugins:
  - serverless-python-requirements
  - serverless-apigwy-binary
  - serverless-prune-plugin
  - serverless-wsgicustom:
  apigwBinary:
    types:
      - "image/jpeg"
      - "image/png"
  prune:
    automatic: true
    number: 3
  pythonRequirements:
    slim: true
    strip: false
    zip: true
  scriptable:
    hooks:
      after:package:createDeploymentArtifacts:
        - ./shake.sh
  tableName: "properties"
  wsgi:
    app: routes.app
    packRequirements: falseprovider:
  name: aws
  runtime: python3.8
  stage: dev
  region: us-east-1
  lambdaHashingVersion: 20201221
  environment:
    PROPERTIES_TABLE: "properties"
    USERS_TABLE: "users"
    POSTGRES_HOST:
    POSTGES_DATABASE_URI: ""
    POSTGRES_DB: ""
    POSTGRES_USER: "postgres"
    POSTGRES_PASSW: ""
    POSTGRES_PORT: "5432"
  exclude:
    - .git/**
    - .vscode/**
    - node_modules/**
    - venv/**
  iamRoleStatements:
    - Effect: "Allow"
      Action:
        - ec2:CreateNetworkInterface
        - ec2:DescribeNetworkInterfaces
        - ec2:DeleteNetworkInterface
      Resource: "*"functions:
  app:
    handler: wsgi_handler.handler
    events:
      - http: ANY /
        contentHandling: CONVERT_TO_BINARY
      - http: POST /predict
      - http: POST /logout
      - http: ANY {proxy+}
        contentHandling: CONVERT_TO_BINARY

Web Application

Registeration

The Price Prediction app allows user to register themselves as authorized users that let them manage their saved listings.

Register

The “Register” link at the top takes the guest user to register as a Price Predictor user. This simple form only requires user to enter a username and a password.

Login

Once a user has registered him/herself, he/she can login and save/review/remove the saved homes.

Search

This is the main page where user can search for homes using the options provided. ZipCode is required and rest of the fields are optional but preset with default values.

Search Results

Homes matching the search criteria are displayed once the user hits the search button. The listings are shown with the address and brief highlights.

Property Details

Details about the individual property can be viewed once the user clicks on the address from the search results page. This page displays the pricing and detailed information about the property.

Favorites

Registered users can mark a property as a favorite for future viewing.