MarkLogic Connector for Hadoop

Download

Release 1.0-2 zip package 1.3MB

Hadoop

The MarkLogic Connector for Hadoop enables you to run Hadoop MapReduce jobs on data in a MarkLogic Server cluster. You can

  • Leverage existing MapReduce and Java libraries to process MarkLogic data
  • Operate on data as Documents, Nodes, or Values
  • Access MarkLogic text, geospatial, value, and document structure indexes to send only the most relevant data to Hadoop for processing
  • Send Hadoop Reduce results to multiple MarkLogic forests in parallel
  • Rely on the connector to optimize data access (for both locality and streaming IO) across MarkLogic forests

The Connector is a drop-in set of Java classes that includes:

  • MarkLogic-specific implementations of the
  • Sample code for a variety of use cases

Apache Hadoop is a framework for distributed, scalable, and reliable computing. From the Hadoop site:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
Hadoop is often used for computationally complex bulk processing and cheap offline storage of long-tail data. It provides complimentary services to MarkLogic's real-time analytics, full-text search, delivery, and updates.

Documentation

Comments