Data Lakehouse Architecture on GCP | S-Logix

List of Topics:

Data Lakehouse Architecture on GCP: Performance Evaluation of Querying Iceberg Tables with Spark on Dataproc vs. BigQuery

Performance Evaluation of Querying Iceberg

Performance Evaluation of Querying Iceberg Tables with Spark

Use Case : Organizations working with petabyte-scale structured and semi-structured data need to adopt a data lakehouse architecture to unify the flexibility of data lakes (open formats like Iceberg/Parquet) with the query performance of data warehouses (like BigQuery).This use case focuses on evaluating query performance and cost-efficiency when accessing Iceberg tables using Spark on Dataproc versus querying directly in BigQuery.

Objective

To build and compare a Data Lakehouse architecture on GCP using Apache Iceberg.

To evaluate query performance, cost, and scalability between Dataproc Spark (open-source, compute-intensive) and BigQuery (serverless, optimized warehouse).

To derive guidelines for workload placement: when to use Spark on Dataproc vs. BigQuery for ETL, analytics, and interactive queries.

Project Description

Data Ingestion & Storage :

Large datasets (transactional logs, IoT sensor data, or clickstream data) are ingested into Cloud Storage in raw formats (CSV, JSON, or Parquet).

Data is registered as Iceberg tables stored on Cloud Storage for schema evolution and ACID transactions.

Lakehouse Querying (Spark on Dataproc)

Deploy a Dataproc cluster with Spark + Iceberg runtime.

Run ETL transformations and SQL queries on Iceberg tables.

Measure performance metrics (execution time, cluster utilization, scalability).

Warehouse Querying (BigQuery)

Load the same Iceberg tables into BigLake (BigQuery’s unified lakehouse layer).

Run equivalent SQL queries directly in BigQuery.

Benchmark query latency, concurrency, and cost efficiency.

Performance Evaluation

Compare query latency, scalability under concurrent users, and total cost of ownership (TCO) between Spark (Dataproc) and BigQuery.

Identify best-fit scenarios (e.g., Dataproc for complex ETL, BigQuery for fast analytics).

Visualization & Insights

Store query results in BigQuery.
Use Looker Studio or BigQuery BI Engine for interactive dashboards.

Google Cloud Services & Technologies :

Google Cloud Service / Technology	Role
Cloud Storage	Acts as the data lake for storing raw datasets and Iceberg tables.
Dataproc (Spark)	Provides distributed processing and querying of Iceberg tables using Spark clusters.
Apache Iceberg	Open table format ensuring ACID compliance, schema evolution, and time travel for lakehouse datasets.
BigQuery + BigLake	Enables serverless querying and federated access to Iceberg tables stored in Cloud Storage.
Looker Studio / BigQuery BI Engine	Provides data visualization dashboards and performance insights for queries.
Cloud Monitoring & Logging	Captures benchmarking metrics, query latency, job execution times, and system performance logs.
IAM & Cloud KMS	Handles data access control, permissions, and encryption of data at rest and in transit.

Related Links

PhD Guidance and Support Enquiry

Final Year Project Enquiry

General Internship Inquiry

Project Internship Inquiry

Research Internship Inquiry

Training Inquiry

Research Topics in Cloud Computing

PhD Research Proposal in Cloud Computing

Latest Research Papers in Cloud Computing

Literature Survey in Cloud Computing

PhD Thesis in Cloud Computing

PhD Projects in Cloud Computing

Leading Journals in Cloud Computing

Leading Research Books in Cloud Computing

Research Topics in Computer Science

PhD Thesis Writing Services in Computer Science

PhD Paper Writing Services in Computer Science

How to Write a PhD Research Proposal in Computer Science