An end-to-end open-source data stack for crawling and visualizing real estate data, to provide insights into market trends.
- Introduction
- Prerequisites
- Setup and Run
- Architecture
- Components of the Data Stack
- Data Crawling: Requests
- Data Transformation: DBT, Apache Spark, Trino
- Data Warehousing and Storage: Rustfs, Iceberg, PostgreSQL
- Data Visualization and Analysis: Metabase, Jupyter Notebook
- Project Orchestration: Dagster
- Project Overview
- Visualization
- Acknowledgements
This project is a holistic open-source data solution designed to systematically collect real estate data from Ho Chi Minh City and present it through interactive visualizations, enabling users to gain insights into market trends.
By leveraging this data stack, users can efficiently collect and analyze real-time data from multiple sources within the local real estate market. The system provides capabilities for data acquisition, processing, storage, and visualization, allowing users to explore market dynamics, track property trends, and identify potential investment opportunities.
Below is a list of technologies used in this project:
| Component | Description | URL |
|---|---|---|
| Docker | Containerization | - |
| Spark | Big Data processing framework | http://localhost:8061 Master http://localhost:18080 History |
| Jupyter Notebook | Interactive computing and data analysis | http://localhost:8888 |
| Rustfs | Object storage service | http://localhost:9001 |
| Iceberg | Table format for large-scale data | - |
| Data Build Tool (DBT) | Data transformation and modeling | - |
| Dagster | Data orchestrator | http://localhost:3070 |
| Trino | Distributed SQL query engine | - |
| PostgreSQL | OLAP database | - |
Docker is installed with at least 8GB RAM.
-
Pull the project from the repository.
git clone https://github.com/Quocc1/OpenStack
-
Start the Docker engine.
-
CD to the project directory then spin up the docker-compose:
cd OpenStack -
Then run:
make run
Note: Run
make helpor refer to the Makefile for details on commands and execution. Usemake downto stop the containers.If you encounter issues running the Makefile on Windows, refer to this Stack Overflow post for potential solutions.
-
Run end-to-end job in Dagster Select the end_to_end job and click Materialize All.

The diagram illustrates the conceptual view of the data pipeline (from bottom to top).
- Real estate advertisements are obtained through an API.
- The advertisements are then stored in Rustfs S3, leveraging Apache Iceberg for efficient data management.
- The data undergoes transformation through each medallion stage:
bronze,silver, andgold, ensuring quality and consistency. - Gold standard data is stored in PostgreSQL for persistent storage.
- Data is visualized with Metabase for analysis and insights, and Jupyter Notebook is utilized for machine learning.
The orchestration of these steps is managed by Dagster, while data transformation is handled by DBT.
The purpose of this project is to offer a comprehensive end-to-end open-source data stack tailored for analyzing real estate trends in Ho Chi Minh City, Vietnam. It aims to seamlessly acquire, process, store, and visualize real estate data specific to the city.
By leveraging this data stack, users can gain valuable insights into the dynamic real estate market of Ho Chi Minh City, enabling informed decision-making, trend analysis, and identification of investment opportunities in the region.
(See details in the Visualization section below)
Data crawling represents the preliminary phase in which raw data is gathered from diverse sources. Within our infrastructure, we employ the following technology:
- Requests: This Python library streamlines the process of making HTTP requests, thereby enabling seamless retrieval of data from APIs and web pages.
API Endpoint: gateway.chotot.com
Here is an example response to a request:
Data transformation involves processing and refining raw data into a structured format suitable for analysis. We leverage the following technologies for this purpose:
-
DBT (Data Build Tool): DBT is utilized for orchestrating the transformation process, enabling the creation of data models and the execution of SQL transformations.
-
Apache Spark: As a powerful distributed computing framework, Apache Spark assists in processing large-scale data efficiently, facilitating complex transformations and computations.
-
Trino (formerly Presto): Trino serves as a distributed SQL query engine, enabling interactive analysis across various data sources.
Representation of Data Flow:
Data warehousing and storage form the foundation for storing and managing processed data. Our data stack incorporates the following technologies:
-
Rustfs: Rustfs provides object storage capabilities, offering a scalable and cost-effective solution for storing large volumes of data.
-
Iceberg: Iceberg is utilized for managing structured data tables in cloud object stores efficiently, providing features like atomic commits and time travel.
-
PostgreSQL: PostgreSQL serves as our relational database management system, offering robust data storage and querying capabilities.
Connect to PostgreSQL using DBeaver (username: postgres, password: postgres):
Connect to Rustfs via localhost:9001 (username: admin, password: password):
Data visualization and analysis is paramount in aiding data exploration and decision-making processes. Our preferred tools for visualization and analysis are:
-
Metabase (Community Edition): Metabase provides a user-friendly interface, facilitating the creation of interactive dashboards and visualizations. This empowers users to effortlessly derive insights from their data.
-
Jupyter Notebook: Jupyter Notebook is another essential tool for data visualization and analysis. It allows users to create and share documents containing live code, equations, visualizations, and narrative text, providing a versatile environment for data exploration and experimentation.
Examples of machine learning in Jupyter Notebook:
Project orchestration involves coordinating and managing the various components and processes within our data pipeline. We employ:
- Dagster: Dagster serves as our project orchestration tool, enabling the definition, scheduling, and monitoring of data workflows with a focus on data quality and reliability.
End-to-end pipeline illustration:
OpenStack/
├── assets/
│ └── pictures
├── code/
│ ├── dbt_real_estate/
│ │ ├── bronze/
│ │ │ └── models/
│ │ │ └── bronze_raw_data.sql
│ │ ├── silver/
│ │ │ └── models/
│ │ │ └── silver_refined_data.sql
│ │ └── gold/
│ │ └── models/
│ │ └── gold_analytics_data.sql
│ └── dagster_real_estate/
│ └── src\dagster_real_estate/
│ └── defs/
│ ├── crawl.py
│ ├── database.py
│ ├── dbt.py
│ ├── jobs.py
│ └── ...
└── data/
│ ├── notebooks/
│ │ └── Predict_Price_Real_Estate.ipynb
│ └── rustfs/
│ └── warehouse/raw_input/houses.csv
├── docker/
│ ├── dagster_dbt/
│ ├── metabase/
│ ├── spark_iceberg_kyuubi/
│ └── trino/
├── docker-compose.yaml
├── Makefile
└── README.md
defs/
├── crawl.py
├── database.py
├── dbt.py
├── jobs.py
└── ...
crawl.py: A Dagster job responsible for retrieving data via an API and storing it into Rustfs warehouse/raw_input/houses.csv.
database.py: A Dagster job utilized for initializing databases for Rustfs, Iceberg, and PostgreSQL.
dbt.py: A Dagster job employed for executing DBT models.
end_to_end.py: This file combines all Dagster jobs, including database.py, crawl.py, and dbt.py, to orchestrate an end-to-end data pipeline.
dbt/
├── bronze/
│ └── model/
│ └── bronze_raw_data.sql
├── silver/
│ └── model/
│ └── silver_refined_data.sql
└── gold/
└── model/
└── gold_analytics_data.sql
bronze_raw_data.sql: SQL model defining transformations for raw data in the bronze layer.
silver_refined_data.sql: SQL model defining transformations for refined data in the silver layer.
gold_analytics_data.sql: SQL model defining transformations for analytics-ready data in the gold layer.
data/
├── notebooks/
│ └── Predict_Price_Real_Estate.ipynb
└── rustfs/
└── warehouse/raw_input/houses.csv
Predict_Price_Real_Estate.ipynb: Jupyter Notebook containing code for predicting real estate prices using Spark.
houses.csv: CSV file containing staged real estate data.
For visualization using Metabase, access localhost:3030 (username caobinhoh@gmail.com and password password123456).
After accessing Metabase with the provided credentials, choose the "HCMC Real Estate Insights" dashboard for viewing.
This project draws inspiration and guidance from the following sources:
- ngods-stocks for its valuable insights and inspiration.
- hcmc-houses-analysis for generously providing code for data crawling.









