Apache Hadoop is one of the most widely used frameworks for handling Big Data. It offers a robust ecosystem of tools that allow organizations to store, process, and analyze large datasets distributed across many machines. Hadoop’s architecture allows horizontal scaling, fault tolerance, and high availability, making it an ideal choice for Big Data operations.HDFS is the foundational storage system in Hadoop. It allows data to be stored across many nodes in a distributed cluster. It’s designed for high throughput access to large data files, making it ideal for applications with large datasets.
Apache Accumulo is a distributed key-value store designed to handle large amounts of structured data efficiently. It is built on top of Apache Hadoop and leverages its capabilities, making it suitable for big data applications.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : NoSQL wide column store
Version: Apache Accumulo 2.1.3
Availability : Open Source
Area of Research : Big Data
Scalable Storage: Accumulo supports horizontal scaling, allowing it to handle massive datasets across distributed systems.
Strong Consistency and Security: It provides strong consistency guarantees for read and write operations.
Integration with Hadoop Ecosystem: As part of the Apache Hadoop ecosystem, Accumulo integrates seamlessly with other big data tools like Apache Spark, Hive, and MapReduce.
Rich Data Model: It supports a flexible data model that allows users to store and manage complex data structures.
Built-in Compression and Indexing: The system features built-in data compression and automatic indexing, which enhance storage efficiency and improve query performance.
• Server side programming.
• Cell-based access control.
• Designed to scale.
• Stable.
• Column-Oriented Storage.
It is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Distributed computing
Version: Apache Ambari 2.7.8
Availability : Open Source
Area of Research : Big Data
Centralized Management Interface: Apache Ambari provides a user-friendly web interface that centralizes the management of Hadoop clusters.
Real-Time Monitoring: Ambari allows for real-time monitoring of cluster health, including the performance of nodes and services.
Simplified Cluster Provisioning: The tool simplifies the provisioning of Hadoop clusters, allowing users to deploy new services and nodes with minimal effort.
Alerting and Notifications: The tool includes an alerting system that notifies administrators of potential issues, such as service failures or performance degradation.
Configuration Management: With Ambari, administrators can manage configurations for Hadoop services in a centralized manner.
• Platform independent.
• Pluggable component.
• Version management and upgrade.
• Extensibility.
• Failure recovery.
Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Metadata Management
Version: Apache Atlas 2.0.0
Availability : Open Source
Area of Research : Big Data
Metadata Management: It allows users to define, store, and manage metadata associated with data assets, providing a centralized repository for all data-related information.
Data Lineage Tracking: Users can visualize the flow of data through various transformations and processes, helping to understand the origin and journey of data across the system.
Integration with Hadoop Ecosystem: Apache Atlas integrates seamlessly with various components of the Hadoop ecosystem, including Apache Hive, Apache HBase, and Apache Kafka.
Search and Query Capabilities: This functionality allows data stewards and data scientists to quickly find relevant data assets and understand their relationships.
Classification and Tagging: Apache Atlas allows users to classify and tag data assets based on predefined taxonomies. This capability helps in organizing data assets and making them easily discoverable.
• Metadata types & instances.
• Classification.
• Lineage.
• Search/Discovery.
• Security & Data Masking.
Falcon is a feed processing and feed management system aimed at making it easier for end consumers to onboard their feed processing and feed management on hadoop clusters.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Data Governance and Integration
Version: Apache Falcon 0.11
Availability : Open Source
Area of Research : Big Data
Data Lifecycle Management: Falcon provides a comprehensive framework for managing the lifecycle of data, including ingestion, processing, retention, and archival.
Workflow Orchestration: This orchestration capability streamlines the process of managing data pipelines across different Hadoop components.
Integration with Hadoop Ecosystem: Apache Falcon integrates well with various Hadoop components, including Apache Hive, Apache Pig, and HDFS, facilitating seamless data movement and processing within the ecosystem.
Data Replication and Archiving: The tool provides features for data replication across clusters and archiving data based on retention policies.
Data Ingestion: Falcon supports data ingestion from multiple sources, enabling organizations to pull in data from various systems and formats.
• Server side extension.
• Operability.
• Entity Specification.
• Hive Integration.
• Security.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Data transfer into HDFS
Version: Apache Flume 1.11.0
Availability : Open Source
Area of Research : Big Data
Stream Data Collection: Flume is primarily used to collect and aggregate large amounts of streaming data from multiple sources, such as web servers, IoT devices, and various application logs.
Architecture: Flume operates on a flexible and extensible architecture that consists of three main components: Sources, Channels, and Sinks.
Integration with Hadoop: As part of the Hadoop ecosystem, Flume integrates seamlessly with other components like HDFS and Apache Hive.
Reliability: Ensuring data reliability during ingestion and transmission can be a challenge for Flume, necessitating fault tolerance and data recovery mechanisms to minimize data loss or corruption.
Complexity: Flume's configuration and deployment can be complex, especially for users with limited experience, demanding simplified interfaces or tools to streamline the setup process.
• Data Transformation.
• Real time processing.
• Interoperability.
• Flexible Data Routing.
• Configuration Management.
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume.It is used for batch/offline processing.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Batch Processing
Version: Apache Hadoop 3.4.1
Availability : Open Source
Area of Research : Big Data
HDFS Operations: Hadoop Distributed File System.Google published its paper GFS and on the basis of that HDFS was developed.It states that the files will be broken into blocks and stored in nodes over the distributed architecture.
Yarn Operations: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
Map Reduce Operations: This is a framework which helps Java programs to do the parallel computation on data using key value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.
Hadoop Common Operations: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
• Highly Scalable.
• Hadoop Distributed File System (HDFS).
• Map reduce.
• Distributed data storage.
• Replication.
Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes performs the role of a slave.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Bigdata Storage
Version: Apache Hadoop HDFS 3.1.4
Availability : Open Source
Area of Research : Big Data
Handling the Hardware Failure: The HDFS contains multiple server machines. Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
Streaming Data Access: The HDFS applications usually run on the general-purpose file system.This application requires streaming access to their data sets.
Coherence Model: The application that runs on HDFS require to follow the write-once-ready-many approach. So, a file once created need not to be changed. However, it can be appended and truncate.
Hadoop Ecosystem: The Hadoop ecosystem consists of various modules, including Hadoop Distributed File System (HDFS) for storage, MapReduce for processing, and additional tools like Apache Hive, Apache Hbase.
Distributed Computing Framework: Hadoop allows for the distributed processing of data across clusters of computers.
• Highly Scalable.
• Distributed data storage.
• Replication.
• Fault Tolerance.
• Portable.
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time bringing great benefits for manageability and cluster utilization.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Data Management
Version: Apache Hadoop YARN 3.4.0
Availability : Open Source
Area of Research : Big Data
Decoupled Processing Framework: Unlike the original Hadoop MapReduce, which tightly coupled resource management and job scheduling, YARN separates these concerns, allowing different data processing frameworks (e.g., Apache Spark, Apache Flink) to run on the same cluster.
Resource Management System: YARN acts as a centralized resource manager that allocates resources (CPU, memory, etc.) to various applications running in a Hadoop cluster, optimizing resource utilization.
Centralized Resource Management: It allows multiple data processing engines to share resources dynamically, making it highly efficient for big data applications.
Reservation System: The Reservation System tracks resources over-time, performs admission control for reservations, and dynamically instruct the underlying scheduler to ensure that the reservation is fulfilled.
Federation: Federation allows to transparently wire together multiple yarn (sub-)clusters, and make them appear as a single massive cluster.
• Scalablity.
• Utilization.
• Multitenancy.
• ResourceManager and NodeManage.
• Scheduling Policies.
HBase is a distributed, scalable, NoSQL database built on top of the Hadoop Distributed File System (HDFS). It is designed to handle large volumes of data across a cluster of commodity hardware. HBase is particularly well-suited for real-time read/write access to large datasets, making it an integral part of big data analytics workflows.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : NoSQL Database
Version: Apache HBase 3.3.1
Availability : Open Source
Area of Research : Big Data
Read Operation: HBase stores data in tables, where each table is organized by rows. Reading data involves accessing these rows using row keys.
Write Operation: HBase uses a Write-Ahead Log to ensure data durability. When a write operation is performed, the data is first written to the WAL, which logs changes before they are applied to the MemStore.
Memory Management: The MemStore holds recently written data in memory, providing fast access for read operations. It improves performance by reducing the need for disk I/O during frequent updates.
Multiple MemStores: HBase maintains a MemStore for each column family in a table, allowing for fine-grained control over memory usage and write performance.
Compaction: HBase periodically performs compaction to merge smaller HFiles into larger ones, optimizing storage and read performance.
• Horizontally scalable.
• Automatic Failover.
• Integrations with Map/Reduce framework.
• Multidimensional sorted map.
• Storing and Retrieving data.
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop.Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Relational Database
Version: Apache Hive 3.2.4
Availability : Open Source
Area of Research : Big Data
SQL-Like Query Language: Hive uses a query language called HiveQL, which resembles SQL, making it easier for users familiar with relational databases to interact with big data.
Data Warehousing: It allows users to summarize, query, and analyze large datasets efficiently.
Schema on Read: Unlike traditional databases that enforce a schema on write, Hive applies a schema on read, meaning data can be stored in its raw form and interpreted during query execution.
Partitioning and Bucketing: Hive supports partitioning and bucketing, which helps in organizing large datasets for efficient querying and data retrieval, thereby improving performance.
Batch Processing: Hive is optimized for batch processing rather than real-time queries. It is suitable for ETL (Extract, Transform, Load) operations and periodic data analysis tasks.
• Fast and scalable.
• Provides SQL-like queries.
• Indexing.
• Analyzing large datasets.
• User-Defined Functions (UDFs).
Apache Kafka is an open-source stream-processing software platform which is used to handle the real-time data storage. It works as a broker between two parties, i.e., a sender and a receiver. It can handle about trillions of data events in a day.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Stream Processing
Version: Apache Kafka 3.8.0
Availability : Open Source
Area of Research : Big Data
Topics: In Kafka, the word topic refers to a category or a common name used to store and publish a particular stream of data.
Partitions: A topic is split into several parts which are known as the partitions of the topic. These partitions are separated in an order.
Brokers: A broker is a container that holds several topics with their multiple partitions.
Producers: A producer is the one which publishes or writes data to the topics within different partitions. Producers automatically know that, what data should be written to which partition and broker.
Customer: A consumer is the one that consumes or reads data from the Kafka cluster via a topic. A consumer also knows that from which broker, it should read the data.
• Publish-Subscribe Messaging.
• Durability and Reliability.
• High Throughput.
• Stream Processing.
• Distributed System.
It is a REST API gateway that authenticates users and acts as a single access point for a Hadoop cluster.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Security Entry Point
Version: Apache Knox 2.1.0
Availability : Open Source
Area of Research : Big Data
Proxying Services: Primary goals of the Apache Knox project is to provide access to Apache Hadoop via proxying of HTTP resources.
Authentication Services: Authentication for REST API access as well as WebSSO flow for UIs. LDAP/AD, Header based PreAuth, Kerberos, SAML, OAuth are all available options.
Client Services: Client development can be done with scripting through DSL or using the Knox Shell classes directly as SDK.
Single Sign-On (SSO): Knox enables single sign-on capabilities, allowing users to authenticate once and gain access to multiple Hadoop services without needing to log in repeatedly.
KnoxSSO Service: It is an integration service that provides a normalized SSO token for representing the authenticated user.
• Protocol Transformation.
• Token-Based Authentication.
• Audit Logging.
• Gateway for Hadoop Services.
• Enhances Security.
Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. It allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Workflow Scheduler
Version: Apache Oozie 5.2.0
Availability : Open Source
Area of Research : Big Data
Workflow Management: Oozie allows users to define workflows that can include a mix of different types of Hadoop jobs, such as MapReduce, Pig, Hive, and more, providing a unified approach to job scheduling.
Job Scheduling: It supports both time-based scheduling and dependency-based scheduling, allowing jobs to be triggered based on time intervals or upon the completion of other jobs.
Integration with YARN: Oozie leverages YARN's capabilities to manage resources across the cluster efficiently.
Coordinators and Bundles: It allow multiple workflows to be grouped and managed as a single entity, facilitating large-scale job management.
Extensibility: Users can extend Oozie's capabilities by creating custom actions and supporting new job types, making it adaptable to various data processing needs.
• Support for Multiple Job Types.
• Error Handling and Retry Mechanisms.
• Dynamic Job Configuration.
• Support Hadoop jobs.
• Event and time triggers.
Apache Phoenix is an open source, massively parallel relational database layer built on Apache HBase. Phoenix allows you to use SQL-like queries over HBase.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : SQL database
Version: Apache Phoenix 5.2.0
Availability : Open Source
Area of Research : Big Data
SQL Layer for HBase: Apache Phoenix provides a SQL abstraction over HBase, making it easy for developers familiar with SQL to work with large datasets stored in HBase.
Performance Optimization: Phoenix optimizes query performance by compiling SQL queries into native HBase scans and operations.
Indexing for Fast Retrieval: It supports secondary indexing, allowing faster retrieval of data.
Columnar Storage: Phoenix leverages HBase's columnar storage model, enabling efficient storage and retrieval of sparse data in wide tables, which is particularly useful in big data.
Extensibility: Transactions While HBase provides row-level transactions, Phoenix integrates with Tephra to add cross-row and cross-table transaction support with full ACID semantics.
• Multi-tenancy.
• Schema at read-time.
• Dynamic Job Configuration.
• Real time queries.
• Bulid on top of HBase database.
Apache Pig is an open-source. It is part of the Apache Software Foundation and is used for processing and analyzing large data sets in Hadoop. Pig provides a high-level scripting language, called Pig Latin, which simplifies the development of complex data processing tasks and workflows.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : High level scripting language
Version: Apache Pig 0.17.0
Availability : Open Source
Area of Research : Big Data
Data Flow Model: Pig allows users to define a data flow, specifying how data should be loaded, transformed, and stored.
Support for Structured and Unstructured Data: Unlike SQL, which is limited to structured data, Pig can handle unstructured and semi-structured data sources like logs, text files, and JSON, giving it flexibility in various big data scenarios.
Integration with Hadoop: Apache Pig is tightly integrated with Hadoop, running on top of the Hadoop Distributed File System (HDFS) and converting Pig Latin scripts into MapReduce jobs.
Optimization: Pig scripts are optimized by automatically converting them into an efficient series of MapReduce tasks.
Iterative Data Processing: Pig excels in iterative data processing tasks where multiple passes over the same data are required, making it suitable for machine learning pipelines or complex analytical tasks in big data.
• Support rich se of operations.
• Optimization opportunities.
• Extensibility.
• In-built operators.
• Schema Flexibility.
Apache Ranger is a tool to manage access control policies for Hadoop/Hive and related object storage systems such as Delta Lake.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Data Security Framework
Version: Apache ranger 2.5.0
Availability : Open Source
Area of Research : Big Data
Centralized Security Administration: Ranger offers a single, centralized interface for managing security policies across various big data services.
Fine-Grained Access Control: It enables granular permissions, allowing organizations to set specific access rules for users, roles, and groups at various levels (e.g., database, table, column, or file level) for different Hadoop components.
Audit and Reporting: Ranger maintains a detailed audit trail of all access events. This makes it easier to track who accessed what data, when, and how, which is crucial for compliance with regulatory standards like GDPR and HIPAA.
Support for Role-Based Access Control (RBAC): It enables organizations to assign roles to users and manage permissions at the role level, streamlining the process of access control and making it easier to manage large numbers of users.
Policy Enforcement: Ranger enforces security policies dynamically, ensuring that access control rules are applied in real-time without interrupting ongoing data operations.
• Centralized Policy Management.
• Dynamic Role-Based Access Control (RBAC).
• Audit and Monitoring.
• LDAP and Active Directory Integration.
• Integration with Apache Atlas.
Apache Slider is a robust, open-source framework developed by the Apache Software Foundation. It is designed to deploy existing distributed applications over Apache Hadoop YARN turning them into long-running YARN services.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Framework for YARN
Version: Apache Slider 0.80.0
Availability : Open Source
Area of Research : Big Data
Dynamic Resource Allocation: Slider manages resources efficiently, providing dynamic scaling of applications based on workload.
Application Monitoring and Management: Slider keeps track of application health and provides fault detection and recovery functionality.
Support for multiple distributed storage systems: Slider supports a variety of distributed storage systems like HDFS and MapReduce.
Easy Integration: It allows easy integration with applications, enabling them to be deployed as YARN applications without any code modifications.
YARN Integration: Apache Slider uses YARN’s capabilities to allocate resources dynamically, ensuring that applications get the required compute and memory resources from the Hadoop cluster.
• Configure different application instances.
• Stop or Restart application instances.
• Shrink Application Instances.
• Expand or Shrink application instances.
• Deploying Long-Running Applications.
Apache Solr is an open-source search platform built on Apache Lucene, designed to handle large-scale data and provide advanced search capabilities.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Search platform
Version: Apache Solr 9.7.0
Availability : Open Source
Area of Research : Big Data
Distributed Search & Indexing (SolrCloud): Apache Solr can handle distributed search across multiple servers using SolrCloud, which enables scaling to accommodate large datasets while maintaining high performance.
Real-Time Indexing: It supports near real-time indexing of data, ensuring that search results are always up-to-date, which is crucial in fast-moving big data environments.
Integration with Big Data Tools: Solr integrates well with other big data processing tools such as Hadoop, Apache Spark, and Apache Nutch.
Rest API Support: Solr provides comprehensive REST APIs, making it easy to integrate with other applications and services in big data environments.
Faceted Search: Faceted search allows the categorization of search results into multiple dimensions.
• Advanced full-text search.
• Near real-time indexing.
• Standards-based open interface like xml.
• Linearly Scalable.
• Auto-index replication.
Apache Spark is an open-source cluster computing framework. Its primary purpose is to handle the real-time generated data. Spark was built on the top of the Hadoop MapReduce. It was optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes data to and from computer hard drives. So, Spark process the data much quicker than other alternatives.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Hybrid Framework(Batch and stream)
Version: Apache Spark 3.5.2
Availability : Open Source
Area of Research : Big Data
Data integration: The data generated by systems are not consistent enough to combine for analysis. To fetch consistent data from systems we can use processes like Extract, transform, and load (ETL). Spark is used to reduce the cost and time required for this ETL process.
Stream processing: It is always difficult to handle the real-time generated data such as log files. Spark is capable enough to operate streams of data and refuses potentially fraudulent operations.
Interactive analytics: Spark is able to generate the respond rapidly. So, instead of running pre-defined queries, we can handle the data interactively.
In-Memory Processing: By keeping data in memory between operations, Spark achieves high throughput and lower latency for iterative algorithms.
Distributed Data Processing: Spark distributes data and processing tasks across a cluster of machines, enabling it to process massive datasets that would be infeasible for a single machine.
• Speed.
• Support multiple languages.
• Advanced Analytics.
• Generality.
• Lightweight.
Apache Sqoop is an open-source. It is part of the Apache Software Foundation's ecosystem and is used for efficiently transferring large amounts of data between relational databases and Hadoop. Sqoop is designed to import data from relational databases (such as MySQL, PostgreSQL, Oracle, etc.) into Hadoop's HDFS or Hive, and to export data from Hadoop back to relational databases.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Data transfer tool
Version: Apache Sqoop 1.4.7
Availability : Open Source
Area of Research : Big Data
Data Import: It translates the SQL queries from RDBMS into equivalent MapReduce jobs for parallel execution, optimizing the transfer for large datasets.
Data Export: Sqoop allows for exporting data from HDFS, Hive, or HBase back into relational databases, enabling the results of big data analytics to be integrated back into structured data systems.
Bulk Transfer: It supports bulk transfers, handling large datasets efficiently, and optimizing data import and export processes by splitting the data into chunks and using parallel processing.
Incremental Data Transfer: Sqoop supports incremental data import, allowing it to import only new or modified data from relational databases since the last import.
Support for Multiple Databases: Sqoop is compatible with many popular databases, including MySQL, Oracle, PostgreSQL, SQL Server, DB2, and others.
• Fast data analysis.
• Balancing of load.
• parallel data transferring.
• Import result of SQL querying.
• Supports for accumulo.
Apache Storm is a free and open-source distributed real-time computation system. Developed by the Apache Software Foundation, it's designed for processing large volumes of high-velocity data. It's capable of processing over a million tuples per second per node, making it highly suitable for real-time analytics.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Distributed Stram Processing
Version: Apache Storm 2.6.3
Availability : Open Source
Area of Research : Big Data
Storm UI: Storm has a built-in web-based user interface that provides real-time statistics about topologies, including throughput, latency, and error rates..
Metrics Collection: Storm supports built-in metrics for monitoring tuple processing, including throughput, latency, and error rates. Metrics can be sent to external systems like Graphite or Prometheus for monitoring.
Authentication and Authorization: Apache Storm can be integrated with Kerberos for securing communication between components like Nimbus and Supervisor, and between the cluster and the client.
Rolling Restart: Storm supports rolling restarts of components to ensure that topologies continue processing without a full cluster restart.
Acking and Failing: Storm provides mechanisms for acknowledging or failing tuples. If a bolt processes a tuple successfully, it can acknowledge it. If there’s a failure in processing, the tuple can be marked as failed and replayed.
• Easy to operate.
• Integrate with any programming language.
• Latency of milliseconds.
• Fault Tolerance.
• Scalable.
Apache Tez is a framework that creates a complex directed acyclic graph (DAG) of tasks for processing data.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Framework for YARN
Version: Apache Tez 0.9. 2
Availability : Open Source
Area of Research : Big Data
DAG Execution: Tez represents data processing jobs as directed acyclic graphs, enabling more complex workflows than the traditional MapReduce model, which relies on a rigid two-stage processing structure.
Optimized Performance: By optimizing task execution and reducing overhead associated with MapReduce, Tez enhances performance for batch processing, reducing latency and improving resource utilization.
Support for Different Data Processing Paradigms: Tez supports a variety of processing paradigms, including batch processing, streaming, and iterative processing, making it versatile for different big data applications.
Execution Planning: Tez includes a sophisticated execution planning mechanism that optimizes the execution of jobs based on resource availability and data locality, further enhancing performance.
Interoperability: Tez is designed to work seamlessly with existing Hadoop components, including HDFS, Hive, and Pig.
• Directed Acyclic Graph (DAG) Execution.
• Support for Complex Workflows.
• Optimized Execution Plans.
• Compatibility with Hive and Pig.
• Dynamic Task Scheduling.
Apache Zeppelin is an open-source web-based notebook that enables interactive data analytics and collaborative data science.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Web based Notebook
Version: Apache Zeppelin 0.10.0
Availability : Open Source
Area of Research : Big Data
Web-Based Notebook: Apache Zeppelin is a web-based notebook that enables interactive data exploration and visualization, allowing users to create, share, and collaborate on data analysis projects.
Dynamic Data Visualization: The platform provides built-in support for dynamic visualizations, enabling users to create interactive graphs and charts to better understand their data.
Collaboration Features: The notebook format encourages collaboration among data scientists and analysts, allowing them to work together on data exploration and share insights in real-time.
Cloud and On-Premises Deployment: The tool can be deployed in various environments, whether on-premises or in the cloud, making it adaptable to different infrastructure needs.
Interactive Data Analysis: Users can perform interactive data analysis through code execution, enabling rapid experimentation and adjustment based on immediate feedback from their data.
• Interactive Notebook Interface.
• Dynamic Visualization.
• Extensibility.
• Dynamic Forms.
• Collaboration and Sharing.
ZooKeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and managing a service in a distributed environment is a complicated process.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Distributed Computing
Version: Apache Zooker 3.8.x
Availability : Open Source
Area of Research : Big Data
Configuration Management: Apache ZooKeeper helps manage configuration settings for distributed applications by providing a centralized repository.
Naming Service: It acts as a naming registry, allowing distributed applications to discover and access services using unique identifiers, simplifying service management.
Leader Election: In distributed systems, ZooKeeper can help elect a leader among nodes, facilitating coordination and management tasks among the cluster.
Group Management: It allows for dynamic group management, where applications can join or leave groups, and ZooKeeper maintains the list of current members, enabling efficient communication and coordination.
Persistent and Ephemeral Nodes: ZooKeeper supports both persistent and ephemeral nodes.
• Distributed Coordinations.
• Atomic Operations.
• Real-Time Notifications.
• Hierarchical Namespace.
• Simple Architecture.
Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Data Store
Version: Apache Druid 30.0.0
Availability : Open Source
Area of Research : Big Data
Data Ingestion: Apache Druid supports various ingestion methods, including batch ingestion from files and real-time ingestion from streaming sources like Apache Kafka.
Data Storage: Druid uses a column-oriented storage format, which optimizes performance for analytical queries.
Query Execution: Druid provides a powerful SQL-like query language that allows users to perform complex analytical queries efficiently.
Real-Time Analytics: Druid is designed for real-time analytics, allowing users to run queries on data as it is ingested.
Cluster Management: Druid includes capabilities for managing and monitoring cluster health, performance, and resource utilization.
• Interactive analytics.
• Real-time dashboards.
• High concurrency workloads.
• Real-time data ingestion.
• Sub-second query performance.
Apache Samza is a distributed stream processing framework, similar to Apache Storm or Apache Flink, that processes data in real-time from a variety of sources
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Stream Processing
Version: Apache Samza 1.5.0
Availability : Open Source
Area of Research : Big Data
Stream Processing: Apache Samza is designed for stream processing, enabling real-time data processing from sources like Apache Kafka.
Windowing and Time-based Operations: Samza allows users to define time-based windows for data processing, enabling the aggregation of events that occur within a specific timeframe.
State Management: Apache Samza maintains state across different processing tasks, allowing it to handle complex workflows and stateful operations.
Asynchronous Processing: Samza supports asynchronous processing, which allows tasks to handle incoming data streams without blocking, improving overall throughput and system responsiveness.
Metrics and Monitoring: Apache Samza provides metrics for monitoring task performance and system health.
• Real-time stream processing.
• Advanced data processing features.
• Simple API.
• Pluggable.
• Process Isolation.
Apache Flink is a real-time processing framework which can process streaming data. It is an open source stream processing framework for high-performance, scalable, and accurate real-time applications. It has true streaming model and does not take input data as batch or micro-batches.
Development Language : Java
Tools : Apache NetBeans IDE 22
Operating System : Ubuntu 20.04 LTS 64bit / Windows 10
Types : Stream Processing
Version: Apache Flink 1.20.0
Availability : Open Source
Area of Research : Big Data
Stream Processing: Apache Flink excels in real-time stream processing, enabling users to handle continuous data streams efficiently.
Stateful Computations: Flink provides robust state management, allowing users to maintain state across distributed processing tasks.
Windowing and Aggregation: Users can define time windows for processing streams, allowing for aggregations and analytics based on specified time intervals.
Complex Event Processing (CEP): Flink provides a CEP library that allows users to detect patterns and trends in streaming data.
Complex Event Processing (CEP): Flink provides a CEP library that allows users to detect patterns and trends in streaming data.
• Unified Batch and Stream Processing.
• High Throughput.
• High Level API.
• Stateful Processing.
• SQL support.