Hadoop Application Packaging

  • 27 May 2016 Andriy Kashcheyev 1872

Relatively complex Hadoop Application usually consists of multiple components and flows. In Enterprise environments application is often logically encapsulated into one single project at management level, but more importantly into a single deployable artifact at deliverable level. In simple words Hadoop Application must be released as a single deployable package to Implementation or DevOps team 

At the time when we were engaging into BigData projects there was no standard in Hadoop ecosystem how Hadoop applications are packaged. We considered utilizing Cloudera Parcel and Ansible scripts in RPM or just plain tarball with custom python installation scripts. The last option was actually used as a quick solution for some time until we have implemented this functionality in Hadoopware product:

Applications are called Tenants in Hadoopware's parlance and can be either created visually in the Web Interface or packaged from command line with Maven. The second option is specifically fit for Dev teams who practice of non-visual tools. Tenants are stored in the private repository of Hadoopware, although we're contemplating an option to integrate with Binary artifact repository like Nexus or Artifactory.

Tenants provide not only packaging but also cleaner organization and taxonomy of applications management since everything is accessable, searchable, configured and deployed from a single interface and visible across all teams:

First of all Hadoopware supports different Components like Flume Agents and Oozie Workflows in a single package with all required metadata for Release Manager, DevOps and Dev team: Component name, version, description, technology type. build time and build number, configurations, dependancies. Metadata information is just a JSON file with mandatory and optional attributes so future extension is plain simple. For example we plan to track relation to JIRA tickets and Iterations for Continuous Delivery flow tracking. This metadata informaton is provided for each component within a package. Here is an example of metadata (component.json descriptor):

{ 
  "type": "oozie-workflow", 
  "groupId": "com.intropro.hed.bigdata", 
  "artifactId": "whatshot-wf", 
  "name": "whatshot-wf", 
  "version": "2.0.0-SNAPSHOT", 
  "appConfigs": [ 
    "config-default.xml", 
    "job.properties", 
    "coordinator.properties" 
  ], 
  "dependencies": [ 
    { 
      "groupId": "com.intropro.hed.bigdata", 
      "artifactId": "whatshot-spark", 
      "version": "2.0.0-SNAPSHOT" 
    } 
  ] 
} 

Configuration information is used by Hadoopware to provide configuration properties managements out of the box. Even if some component uses its private configuration file, if it is referenced in metadata descriptor, Hadoopware will be able to visualize it and expose it for management. For example, here is job.properties is added to configuration pane (It is used in the workflow in Actions together with a default config-default.xml): 

It was mentioned before that Development Team work patterns and preferences are taken into serious account, so Hadoopware is designed around reusing common and familiar tools as much as possible. All our BigData projects are built with Maven so component metadata is a single json file which can be generated with maven plugin and tenant packaging is quite familiar maven assembly file:

<assembly                                                                                                                                
        xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"                                                     
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"                                                                            
        xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2                                         
          http://maven.apache.org/xsd/assembly-1.1.2.xsd">                                                                               
    <id>bin</id>                                                                                                                         
    <formats>                                                                                                                            
        <format>tar.bz2</format>                                                                                                         
    </formats>                                                                                                                           
    <fileSets>                                                                                                                           
        <fileSet>                                                                                                                        
            <directory>../bookmarks/bookmarks-agent/target/package/</directory>                                                          
            <outputDirectory>/bookmarks</outputDirectory>                                                                                
        </fileSet>                                                                                                                       
        <fileSet>                                                                                                                        
            <directory>../kafka-to-hdfs/kafka-to-hdfs-agent/target/package/</directory>                                                  
            <outputDirectory>/kafka-to-hdfs</outputDirectory>                                                                            
        </fileSet>                                                                                                                       
        <fileSet>                                                                                                                        
            <directory>../ratings/ratings-agent/target/package/</directory>                                                              
            <outputDirectory>/ratings</outputDirectory>                                                                                  
        </fileSet>                                                                                                                       
        <fileSet>                                                                                                                        
            <directory>../title-history/title-history-agent/target/package/</directory>                                                  
            <outputDirectory>/title-history</outputDirectory>                                                                            
        </fileSet>                                                                                                                       
        <fileSet>                                                                                                                        
            <directory>../whatshot/whatshot-wf/target/package/</directory>                                                               
            <outputDirectory>/whatshot-wf</outputDirectory>                                                                              
        </fileSet>                                                                                                                       
    </fileSets>                                                                                                                          
</assembly> 

As can be seen from the assembly file the final package file is just a tarball. We are trying to „keep it simple“ for the benefit of troubleshooting, maintenance, compatibility and just KISS principle.

The last but not least is integrating Hadoop Applications deployment into Continuous Delivery pipeline which requires automation of Tenant (Application) upload, configuration and deployment. Since the server side of Hadoopware is pure REST interface, it is used out of the box. The API was initially designed for interfacing with Web Console, to simplify operations there is a command line client which can wrap the process of authorization, tenant upload, deployment and URI pathes complexities. This is how our DevOps team deploys full Hadoop Application in a single step: 

./HwRestClient.py http://dap01.ea.intropro.com:8089/ user pass bundle deploy HED-Recommendation "HED Recommendation Engine" HED-recommnedation-tenant-2.0.tar.bz2 HED-Cluster „Cluster 1“ hed/recommendation

All environment specific attributes like Namenode, Jobtracker, Oozie sever and customer properties are applied by Hadoopware automatically from Cluster configurations.