It is a framework used to simplify data pipeline processing and management of Hadoop clusters
It simplifies complicated data management workflows into generalized entity definitions whereas a cluster, feed, process are the three types of entities
Its tooling can be set up to manage dependencies between the system infrastructure, data, and processing logic
It is available for declarative programming through simple APIs.
Example: Assume, it processes the hourly arrived raw input data with a Pig script and saves the results for further processing. Even though an Oozie workflow manages the task, there is a need for process automation due to the lack of availability of high-level features in Oozie. Initially, it provides 90 days retention policy for input data after discarding the old data. After the occurrence of the process fail during a certain number of retries in the processing step, the output data have three years of the retention policy.
Centralized data lifecycle management
Compliance and audit
Database replication and archiva