Pravega Hadoop Connectors

Enable Apache Hadoop to read and write Pravega streams

Read Post

This post introduces connectors to read and write Pravega Streams with Apache Hadoop stream processing framework.

The connectors can be used to build end-to-end stream processing pipelines (see Samples) that use Pravega as the stream storage and message bus, and Apache Hadoop for computation over the streams.

Purpose

Implements both the input and the output format interfaces for Hadoop. It leverages Pravega batch client to read existing events in parallel; and uses write API to write events to Pravega stream.

Build and Usage

The build script handles Pravega as a source dependency, meaning that the connector is linked to a specific commit of Pravega (as opposed to a specific release version) in order to facilitate co-development. This is accomplished with a combination of a git submodule and the use of Gradle’s composite build feature.

More details about build and usage are showing in the README of the hadoop connector

Pravega Hadoop Connector Examples

You can find the Pravega Hadoop connector examples from the pravega-samples repo. There are two code examples to give you some basic ideas on how to use hadoop-connectors for Pravega.

  • Wordcount: Counts the words from a Pravega Stream filled with random text to demonstrate the usage of Hadoop connector for Pravega.
  • Terasort: Sort events from an input Pravega Stream and then write sorted events to one or more streams.

Before running the application, you may also need to set up the following Environment Prerequisites:
1. Pravega running (see here for instructions)
2. Build pravega-samples repository
3. Setup and start HDFS (refer this document from Apache Hadoop)

Source

Documentation

To learn more about how to build and use the Flink Connector library, follow the connector documentation here.

More examples on how to use the connectors with Flink application can be found in Pravega Samples repository.

License

Flink connectors for Pravega is 100% open source and community-driven. All components are available under Apache 2 License on GitHub.

Post on 06 Oct 2020