Big Data
Using Hazelcast on Hadoop cluster with YARN
October 28, 2015
4
, , , , ,

We recently encountered the problem where a complex simulation program employing Hazelcast in-memory computation in some cases stored so much data on individual nodes that it resulted in an out of memory error. Besides the obvious solution of increasing the number of cluster nodes we decided to take advantage of the powerful resource negotiator abilities of Apache YARN. This post will provide an overview of the two technologies, the challenges developers may face during the integration process, and last, but not least, our solution.

Hazelcast provides a great open source in-memory grid framework that our program uses to cache large amounts of data and to compute optimization solutions in memory. Hazelcast is flexible, provides fast response time and supports complex processing that our simulation requires (but there is also some evidence that the recently graduated Apache Ignite may provide better results in certain use cases). It also makes it easy to setup and configure our Hazelcast cluster. Pretty much all you have to do is to download Hazelcast, modify an xml configuration file and launch a shell script.

I’m pretty sure you’re already familiar with the Apache Hadoop framework, but it’s important to reiterate the advantages (and disadvantages) of Apache YARN. Hadoop-0.23 provided a major overhaul of the MapReduce framework in response to serious limitations in scalability, reliability, availability, programming model support and resource allocation of the previous versions, giving birth to MapReduce 2.0 or Yet Another Resource Negotiator (YARN). YARN conducts resource management and job scheduling/monitoring in separate daemons. It is highly scalable, available, reliable and serviceable. YARN supports multiple tenants to operate on your cluster simultaneously (multitenancy), moves the process of computation to where your data is (locality awareness), utilizes the physical resources of your cluster flexibly and amazingly, as well as operating securely while enabling great auditability. Furthermore, it supports programming model diversity, provides complete backward compatibility and also support programs that are not MapReduce applications. Probably the greatest disadvantage of YARN is that application development is difficult and requires multiple hours of the programmer’s time to gain an understanding of the framework. There are tools (like Cloudera Kitten and Apache Twill) to simplify the process, but the wide range of issues we encountered while using these tools persuaded us to stick with plain Java. We recommend reading the Great YARN book written by Arun C. Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec and Jeff Markham. Apache YARN’s official page about application development also provides a nice overview.

We modified YARN Distributed Shell’s source code to setup our Hazelcast cluster and to manage its resources. As mentioned before, YARN supports applications that fall outside the MapReduce category. In a nutshell, Distributed Shell launches shell scripts on the cluster nodes as well as making the necessary resources available to them. In our case, the shell script was the one that launces the Hazelcast server and the necessary resources were the Hazelcast server files. All of them were located in a zip file that had to be made available to all of our cluster nodes. This can be done using Distributed Shell ‘s Client class, which is responsible for launching the command-line interface, managing local resources and setting up the ApplicationMaster environment. We modified Client to upload our Hazelcast zip file to the YARN local file environment (which is pretty much a temporary HDFS folder). Distributed Shell’s ApplicationMaster decompiles the zip file and launches the final containers where the shell script needed to launch Hazelcast is executed. All of the containers successfully launch Hazelcast instances and they quickly discover each other to form a cluster (we use discovery by TCP to start our Hazelcast Cluster). This means that the Hazelcast has been setup without the need to install Hazelcast on all the nodes beforehand. All we needed was an operational Hadoop cluster and the YARN-Hazelcast application did the rest. The modified configuration files and the jar file had to be included in the zip file, of course. The application supports long-running sessions and manages the resources required by Hazelcast using the powerful YARN resource management capabilities.

You will find the source codes in our Github project page. Feel free to download, compile and use. Detailed instructions are provided in the readme file. Do not hesitate to contact us if you have questions or suggestions.

Balint Kubik
Balint Kubik

Latest posts by Balint Kubik (see all)

  • David Brimley

    Hi Balint,

    Great post, really interesting to see how Hazelcast and Hadoop work together in this way.

    One thing that intrigued me was your comment…

    “but there is also some evidence that the recently graduated Apache Ignite provides better results while not requiring you to pay for exceeding a certain number of nodes like Hazelcast does”

    I thought you should know that Hazelcast Open Source allows you to use as many nodes as you wish, there is no limit, you do not have to pay, and I’m aware of o/s clusters running into the high hundreds of gigabytes.

    Secondly, your statement about Apache Ignite performance may be a little wide of the mark, at the risk of getting into a benchmark war I think you’ll find that Hazelcast is quite a bit faster in most use cases than Apache Ignite. I believe the Gridgain team circulated some benchmarks that are pretty badly out of date now. In response Hazelcast published some newer ones using Gridgains own benchmarking tool. You can find them here …

    https://hazelcast.com/resources/benchmark-gridgain/

    BTW. In full disclosure, I’m a Senior Solutions Architect for Hazelcast. I personally think vendor benchmarks are dumb, never trust a vendor, always run your own tests on your own use cases 🙂

    Cheers
    David.

    • Bálint Kubik

      Hello David!
      Thank you for your interest and comment!
      In my comment I was referring to users having to pay to be able to monitor a cluster above a certain size in Hazelcast Management Center. We do not want to participate in the benchmarking war either, but our experiences led us to believe that there certainly are cases where Ignite may be “better” in terms of performance.
      Regards,

      Balint

    • Jon L

      David, Balint: Any chance this integration could be added to Hadoop somehow? It would be really great to be able to run HZ on top of Hadoop more easily. I think both ecosystems would benefit.

  • Jon L

    Love it. Thanks!