List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

Data Lakehouse Architecture on GCP: Performance Evaluation of Querying Iceberg Tables with Spark on Dataproc vs. BigQuery

Performance Evaluation of Querying Iceberg

Performance Evaluation of Querying Iceberg Tables with Spark

  • Use Case : Organizations working with petabyte-scale structured and semi-structured data need to adopt a data lakehouse architecture to unify the flexibility of data lakes (open formats like Iceberg/Parquet) with the query performance of data warehouses (like BigQuery).This use case focuses on evaluating query performance and cost-efficiency when accessing Iceberg tables using Spark on Dataproc versus querying directly in BigQuery.

Objective

  • To build and compare a Data Lakehouse architecture on GCP using Apache Iceberg.

    To evaluate query performance, cost, and scalability between Dataproc Spark (open-source, compute-intensive) and BigQuery (serverless, optimized warehouse).

    To derive guidelines for workload placement: when to use Spark on Dataproc vs. BigQuery for ETL, analytics, and interactive queries.

Project Description

  • Data Ingestion & Storage :

    Large datasets (transactional logs, IoT sensor data, or clickstream data) are ingested into Cloud Storage in raw formats (CSV, JSON, or Parquet).

    Data is registered as Iceberg tables stored on Cloud Storage for schema evolution and ACID transactions.

Lakehouse Querying (Spark on Dataproc)

  • Deploy a Dataproc cluster with Spark + Iceberg runtime.

    Run ETL transformations and SQL queries on Iceberg tables.

    Measure performance metrics (execution time, cluster utilization, scalability).

Warehouse Querying (BigQuery)

  • Load the same Iceberg tables into BigLake (BigQuery’s unified lakehouse layer).

    Run equivalent SQL queries directly in BigQuery.

    Benchmark query latency, concurrency, and cost efficiency.

Performance Evaluation

  • Compare query latency, scalability under concurrent users, and total cost of ownership (TCO) between Spark (Dataproc) and BigQuery.

    Identify best-fit scenarios (e.g., Dataproc for complex ETL, BigQuery for fast analytics).

Visualization & Insights

  • Store query results in BigQuery.
  • Use Looker Studio or BigQuery BI Engine for interactive dashboards.
  • Google Cloud Services & Technologies :
    Google Cloud Service / Technology Role
    Cloud Storage Acts as the data lake for storing raw datasets and Iceberg tables.
    Dataproc (Spark) Provides distributed processing and querying of Iceberg tables using Spark clusters.
    Apache Iceberg Open table format ensuring ACID compliance, schema evolution, and time travel for lakehouse datasets.
    BigQuery + BigLake Enables serverless querying and federated access to Iceberg tables stored in Cloud Storage.
    Looker Studio / BigQuery BI Engine Provides data visualization dashboards and performance insights for queries.
    Cloud Monitoring & Logging Captures benchmarking metrics, query latency, job execution times, and system performance logs.
    IAM & Cloud KMS Handles data access control, permissions, and encryption of data at rest and in transit.