Open Data Stack: Data-driven Real Estate Insights in Ho Chi Minh City

An end-to-end open-source data stack for crawling and visualizing real estate data, to provide insights into market trends.

Preview

Introduction
- Technologies used
Prerequisites
Setup and Run
Architecture
- Purpose
Components of the Data Stack
- Data Crawling: Requests
- Data Transformation: DBT, Apache Spark, Trino
- Data Warehousing and Storage: Rustfs, Iceberg, PostgreSQL
- Data Visualization and Analysis: Metabase, Jupyter Notebook
- Project Orchestration: Dagster
Project Overview
Visualization
Acknowledgements

Introduction

This project is a holistic open-source data solution designed to systematically collect real estate data from Ho Chi Minh City and present it through interactive visualizations, enabling users to gain insights into market trends.

By leveraging this data stack, users can efficiently collect and analyze real-time data from multiple sources within the local real estate market. The system provides capabilities for data acquisition, processing, storage, and visualization, allowing users to explore market dynamics, track property trends, and identify potential investment opportunities.

Technologies Used

Below is a list of technologies used in this project:

Component	Description	URL
Docker	Containerization	-
Spark	Big Data processing framework	http://localhost:8061 `Master` http://localhost:18080 `History`
Jupyter Notebook	Interactive computing and data analysis	http://localhost:8888
Rustfs	Object storage service	http://localhost:9001
Iceberg	Table format for large-scale data	-
Data Build Tool (DBT)	Data transformation and modeling	-
Dagster	Data orchestrator	http://localhost:3070
Trino	Distributed SQL query engine	-
PostgreSQL	OLAP database	-

Prerequisites

Docker is installed with at least 8GB RAM.

Setup and Run

Pull the project from the repository.

git clone https://github.com/Quocc1/OpenStack

Start the Docker engine.
CD to the project directory then spin up the docker-compose:
```
cd OpenStack
```
Then run:
```
make run
```
Note: Run make help or refer to the Makefile for details on commands and execution. Use make down to stop the containers.

If you encounter issues running the Makefile on Windows, refer to this Stack Overflow post for potential solutions.
Run end-to-end job in Dagster Select the end_to_end job and click Materialize All.

Architecture

The diagram illustrates the conceptual view of the data pipeline (from bottom to top).

Real estate advertisements are obtained through an API.
The advertisements are then stored in Rustfs S3, leveraging Apache Iceberg for efficient data management.
The data undergoes transformation through each medallion stage: bronze, silver, and gold, ensuring quality and consistency.
Gold standard data is stored in PostgreSQL for persistent storage.
Data is visualized with Metabase for analysis and insights, and Jupyter Notebook is utilized for machine learning.

The orchestration of these steps is managed by Dagster, while data transformation is handled by DBT.

Purpose

The purpose of this project is to offer a comprehensive end-to-end open-source data stack tailored for analyzing real estate trends in Ho Chi Minh City, Vietnam. It aims to seamlessly acquire, process, store, and visualize real estate data specific to the city.

By leveraging this data stack, users can gain valuable insights into the dynamic real estate market of Ho Chi Minh City, enabling informed decision-making, trend analysis, and identification of investment opportunities in the region.

(See details in the Visualization section below)

Components of the Data Stack

Data Crawling

Data crawling represents the preliminary phase in which raw data is gathered from diverse sources. Within our infrastructure, we employ the following technology:

Requests: This Python library streamlines the process of making HTTP requests, thereby enabling seamless retrieval of data from APIs and web pages.

API Endpoint: gateway.chotot.com

Here is an example response to a request:

Data Transformation

Data transformation involves processing and refining raw data into a structured format suitable for analysis. We leverage the following technologies for this purpose:

DBT (Data Build Tool): DBT is utilized for orchestrating the transformation process, enabling the creation of data models and the execution of SQL transformations.
Apache Spark: As a powerful distributed computing framework, Apache Spark assists in processing large-scale data efficiently, facilitating complex transformations and computations.
Trino (formerly Presto): Trino serves as a distributed SQL query engine, enabling interactive analysis across various data sources.

Representation of Data Flow:

Data Warehousing and Storage

Data warehousing and storage form the foundation for storing and managing processed data. Our data stack incorporates the following technologies:

Rustfs: Rustfs provides object storage capabilities, offering a scalable and cost-effective solution for storing large volumes of data.
Iceberg: Iceberg is utilized for managing structured data tables in cloud object stores efficiently, providing features like atomic commits and time travel.
PostgreSQL: PostgreSQL serves as our relational database management system, offering robust data storage and querying capabilities.

Connect to PostgreSQL using DBeaver (username: postgres, password: postgres):

Connect to Rustfs via localhost:9001 (username: admin, password: password):

Data Visualization and Analysis

Data visualization and analysis is paramount in aiding data exploration and decision-making processes. Our preferred tools for visualization and analysis are:

Metabase (Community Edition): Metabase provides a user-friendly interface, facilitating the creation of interactive dashboards and visualizations. This empowers users to effortlessly derive insights from their data.
Jupyter Notebook: Jupyter Notebook is another essential tool for data visualization and analysis. It allows users to create and share documents containing live code, equations, visualizations, and narrative text, providing a versatile environment for data exploration and experimentation.

Examples of machine learning in Jupyter Notebook:

Project Orchestration

Project orchestration involves coordinating and managing the various components and processes within our data pipeline. We employ:

Dagster: Dagster serves as our project orchestration tool, enabling the definition, scheduling, and monitoring of data workflows with a focus on data quality and reliability.

End-to-end pipeline illustration:

Project Overview

OpenStack/
├── assets/
│   └── pictures
├── code/
│   ├── dbt_real_estate/
│   │   ├── bronze/
│   │   │   └── models/
│   │   │       └── bronze_raw_data.sql
│   │   ├── silver/
│   │   │   └── models/
│   │   │       └── silver_refined_data.sql
│   │   └── gold/
│   │       └── models/
│   │           └── gold_analytics_data.sql
│   └── dagster_real_estate/
│       └── src\dagster_real_estate/
│           └── defs/
│               ├── crawl.py
│               ├── database.py
│               ├── dbt.py
│               ├── jobs.py
│               └── ...
└── data/
│   ├── notebooks/
│   │   └── Predict_Price_Real_Estate.ipynb
│   └── rustfs/
│       └── warehouse/raw_input/houses.csv
├── docker/
│   ├── dagster_dbt/
│   ├── metabase/
│   ├── spark_iceberg_kyuubi/
│   └── trino/
├── docker-compose.yaml
├── Makefile
└── README.md

Overview

defs/
├── crawl.py
├── database.py
├── dbt.py
├── jobs.py
└── ...

crawl.py: A Dagster job responsible for retrieving data via an API and storing it into Rustfs warehouse/raw_input/houses.csv.

database.py: A Dagster job utilized for initializing databases for Rustfs, Iceberg, and PostgreSQL.

dbt.py: A Dagster job employed for executing DBT models.

end_to_end.py: This file combines all Dagster jobs, including database.py, crawl.py, and dbt.py, to orchestrate an end-to-end data pipeline.

dbt/
├── bronze/
│   └── model/
│       └── bronze_raw_data.sql
├── silver/
│   └── model/
│       └── silver_refined_data.sql
└── gold/
   └── model/
      └── gold_analytics_data.sql

bronze_raw_data.sql: SQL model defining transformations for raw data in the bronze layer.

silver_refined_data.sql: SQL model defining transformations for refined data in the silver layer.

gold_analytics_data.sql: SQL model defining transformations for analytics-ready data in the gold layer.

data/
├── notebooks/
│   └── Predict_Price_Real_Estate.ipynb
└── rustfs/
   └── warehouse/raw_input/houses.csv

Predict_Price_Real_Estate.ipynb: Jupyter Notebook containing code for predicting real estate prices using Spark.

houses.csv: CSV file containing staged real estate data.

Visualization

For visualization using Metabase, access localhost:3030 (username caobinhoh@gmail.com and password password123456).

After accessing Metabase with the provided credentials, choose the "HCMC Real Estate Insights" dashboard for viewing.

Acknowledgements

This project draws inspiration and guidance from the following sources:

ngods-stocks for its valuable insights and inspiration.
hcmc-houses-analysis for generously providing code for data crawling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Data Stack: Data-driven Real Estate Insights in Ho Chi Minh City

Preview

Table of Contents

Introduction

Technologies Used

Prerequisites

Setup and Run

Architecture

Purpose

Components of the Data Stack

Data Crawling

Data Transformation

Data Warehousing and Storage

Data Visualization and Analysis

Project Orchestration

Project Overview

Overview

Visualization

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
code		code
data		data
docker		docker
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Folders and files

Latest commit

History

Repository files navigation

Open Data Stack: Data-driven Real Estate Insights in Ho Chi Minh City

Preview

Table of Contents

Introduction

Technologies Used

Prerequisites

Setup and Run

Architecture

Purpose

Components of the Data Stack

Data Crawling

Data Transformation

Data Warehousing and Storage

Data Visualization and Analysis

Project Orchestration

Project Overview

Overview

Visualization

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages