List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

Energy-Efficient Data Lake on AWS

Data Lake on AWS

Energy-Efficient Data Lake on AWS

  • Use Case: Organizations in finance, healthcare, genomics, and IoT generate petabytes of structured, semi-structured, and unstructured data.A data lake (built on AWS S3) stores this data for real-time analytics.However, storing everything in S3 Standard is costly, while moving data to Glacier saves cost but increases query latency.

Objective

  • Build a multi-tier data lake using S3 Standard, S3 Intelligent-Tiering, and Glacier.

    Study performance trade-offs when querying cold vs. hot data.

    Design an ML-based intelligent data placement policy to:

    Predict which data will be “hot” (frequently queried).

    Automatically move less-used data to cheaper storage.

    Evaluate energy savings, cost reduction, and query performance.

Project Description

  • Data Ingestion: Load datasets (IoT logs, healthcare records, financial transactions) into S3.

    Storage Tiering: Store frequently queried data in S3 Standard, historical data in Glacier, and use Intelligent-Tiering for automatic movement.

Query Processing

  • Use Athena (serverless SQL queries on S3).

    Use Redshift Spectrum (query data across S3 and Redshift).

    Performance Monitoring: Collect metrics (query latency, cost, energy consumed).

ML Model

  • Train a predictor (based on query logs) to forecast hot vs. cold data.
  • Automate movement between tiers (Standard ↔ Glacier ↔ Intelligent-Tiering).
  • AWS Services & Purpose :
    Service Purpose
    Amazon S3 (Standard, Glacier, Intelligent-Tiering) Core data lake storage across multiple cost-performance tiers.
    AWS Glue ETL (Extract, Transform, Load) – prepare datasets, build metadata catalog.
    Amazon Athena Serverless query engine to query S3 data directly using SQL.
    Amazon Redshift + Spectrum Data warehouse for complex queries; Spectrum queries data in S3 without loading.