Cloudera Docker

  • 11 December 2015 Danylo Denyshchenko 4194

Nowadays, companies, which are developing software for Big Data, face the problem of competition for computing resources between participants of the development process. On certain stages developers use their own environment deployed on their computers, which requires installing all necessary services locally or, more often, running virtual machines.

One of the major players in the Hadoop ecosystem is Cloudera. It provides CDH (Cloudera Distribution Including Apache Hadoop) that includes Hadoop platform core (HDFS, MapReduce, Hadoop Commons) and integrated open source projects, such as: Apache Spark, Apache HBase, Apache Pig, Apache Hive and others. Also they have Cloudera Manager console for administering and managing Hadoop ecosystem.

Cloudera prepares virtual machine and docker QuickStart image, which include Cloudera Manager and CDH.

Docker is a great technology that provides you with better performance, flexibility, control and task automation in comparison with classical virtualization platforms, such as: vmware, kvm, virtualbox etc . The benefit of using docker is: 1) containers use less memory than virtual machines, 2) have better performance due to sharing resources with the host on which these containers are run. Moreover, it is a great tool for infrastructure automation. Docker uses simple language for describing image creation in special file – Dockerfile. This is a very easy and flexible way for developing environment according to your requirements.

Cloudera QuickStart docker is still unstable and they don’t provide the source Dockerfile, so it is still being kept secret how they built it. That is why, it is difficult to modify it according to your needs.

In our example, we will show how to create your own docker image for Cloudera Manager with CDH 5.x.x., with ability to choose CDH version, JDK version and even an option for creating kerberized environment. We also provide you with instructions on setup of the last version of docker and docker-compose.

Prerequisites

If you do not have an experience of working with docker, we recommend you to get a brief insight into Docker Cheat Sheet and Overview of Docker Compose.

In this guide we will use Ubuntu 14.04 for our host and Centos6 as base docker image.

First of all, you should install docker, docker-compose and git on your machine.

Docker installation

Follow the official guide on Docker installation that you can find via the following link: https://docs.docker.com/installation/ubuntulinux/ ; or you can just run the following commands in your console:

# get root privileges
$ sudo su -
# add new gpg key
$ apt-key adv --keyserver hkp://pgp.mit.edu:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
# add APT docker repository
$ echo 'deb https://apt.dockerproject.org/repo ubuntu-trusty main' > /etc/apt/sources.list.d/docker.list
# Update the APT package index
$ apt-get update
# Install Docker
$ apt-get install docker-engine

Docker Compose installation

Docker-compose is a very useful tool for predefining container settings, managing multi-container environment and simplifying management of docker containers. The best way to install it is using pip.

# Install pip
$ apt-get install python-pip
# Install docker-compose
$ pip install -U docker-compose 

Git installation

$ apt-get install git

Single node Cloudera Manager docker

Getting docker source

The next step is downloading docker source on your machine.

$ git clone https://github.com/intropro/single-node-cloudera-env.git

Docker source content

Go to the folder with downloaded docker source. You can see there following files:

FILE DESCRIPTION
scripts/autodeploy.sh the main script that starts Cloudera service, deploys them and uploads configuration file in json format
scripts/configcollector.sh, scripts/users-config-builder.py after Cloudera is launched and all services are deployed this script downloads the most often used configuration and creates simple web page to access them
scripts/start_web.sh is used for running python SimpleHTTPServer
templates/dockerfile.tmpl part of Dockerfile that contains instruction for automatic building of docker image
templates/etalon.json contains description of Cloudera configuration that will be uploaded via REST API
profiles/* contains profiles with lists of variables for jdk, cdh and paths for downloading
make_dockerfile.sh script for adding variables from profile to templates/dockerfile.tmpl and create Dockerfile for image building
README.md docker description
docker-compose.yml configuration file with docker containers options that will be managed by docker-compose
configs/supervisord.conf supervisord configuration file, describes the parameters of service ran inside container

Profiles variables

If you look inside Dockerfile, you can see what docker does while building image. There are several variables that you can change in order to get image with another version of JDK or CDH version. To avoid making changes manually in Dockerfile, the profile was introduced. These variables can be added to Dockerfile from one of prepared profile by running make_dockerfile.sh <path to profile>.

For example:

$ ./make_dockerfile.sh profiles/jdk1.8_cdh5.3.3

In the repository you can find several predefined profiles but you can create your own profile with needed jdk and cdh version and paths for downloading. To do this you should just create text file in profile directory with parameters described below and run make_dockerfile.sh script.

Profile should contain the following variables:

PARAMETER NAME DESCRIPTION
CDH_VERSION full version of CDH in dot-separated format
PARCEL_DIR_DOWNLOAD path to Cloudera repository folder where cdh parcel is located
PARCEL_FILE_NAME full name of parcel in PARCELDIRDOWNLOAD folder
JDK_VERSION full version of JDK
JDK_DOWNLOAD_PATH full path to jdk rpm package
JCE_POLICY_DOWNLOAD_PATH full path to JDK JCE policy package for correct working of kerberized environment (will be used later)

Getting docker image

For getting docker image with Cloudera Hadoop you need to build or download docker image.

docker-compose.yml file is configured to download image from github by default. If you want to build it manually with your own parameters, you should edit docker-compose.yml in the following way:

server:
    build: .
    # image: intropro/single-node-cloudera-env:5.3.3

…

Choose profile:

$ ./make_dockerfile.sh profiles/jdk1.8_cdh5.3.3

After that run:

$ docker-compose build

or

$ docker build -t single-node-cloudera-env:5.3.3

After the process is successfully completed, you can check new image by running:

$ docker images

REPOSITORY                     TAG      MAGE ID        CREATED          VIRTUAL SIZE
singlenodeclouderaenv_server   5.3.3    8840da9463ba   21 seconds ago   3.745 GB

Starting and checking docker container

So now we are ready to run our docker container:

# Run on the background
$ docker-compose up -d

# Check docker status
$ docker-compose ps

# Check deployment process
$ docker-compose logs

Wait for message:

INFO exited: ClouderaManagerDeploy (exit status 0; expected)

Go to the http://localhost to see the start page or to http://localhost:7180 to get Cloudera Manager.

In the next part of this article we will show you how to run kerberized environment and available options for sharing your environment to other team members