Nowadays, companies, which are developing software for Big Data, face the problem of competition for computing resources between participants of the development process. On certain stages developers use their own environment deployed on their computers, which requires installing all necessary services locally or, more often, running virtual machines.
One of the major players in the Hadoop ecosystem is Cloudera. It provides CDH (Cloudera Distribution Including Apache Hadoop) that includes Hadoop platform core (HDFS, MapReduce, Hadoop Commons) and integrated open source projects, such as: Apache Spark, Apache HBase, Apache Pig, Apache Hive and others. Also they have Cloudera Manager console for administering and managing Hadoop ecosystem.
Cloudera prepares virtual machine and docker QuickStart image, which include Cloudera Manager and CDH.
Docker is a great technology that provides you with better performance, flexibility, control and task automation in comparison with classical virtualization platforms, such as: vmware, kvm, virtualbox etc . The benefit of using docker is: 1) containers use less memory than virtual machines, 2) have better performance due to sharing resources with the host on which these containers are run. Moreover, it is a great tool for infrastructure automation. Docker uses simple language for describing image creation in special file – Dockerfile. This is a very easy and flexible way for developing environment according to your requirements.
Cloudera QuickStart docker is still unstable and they don’t provide the source Dockerfile, so it is still being kept secret how they built it. That is why, it is difficult to modify it according to your needs.
In our example, we will show how to create your own docker image for Cloudera Manager with CDH 5.x.x., with ability to choose CDH version, JDK version and even an option for creating kerberized environment. We also provide you with instructions on setup of the last version of docker and docker-compose.
In this guide we will use Ubuntu 14.04 for our host and Centos6 as base docker image.
First of all, you should install docker, docker-compose and git on your machine.
Follow the official guide on Docker installation that you can find via the following link: https://docs.docker.com/installation/ubuntulinux/ ; or you can just run the following commands in your console:
# get root privileges
$ sudo su -
# add new gpg key
$ apt-key adv --keyserver hkp://pgp.mit.edu:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
# add APT docker repository
$ echo 'deb https://apt.dockerproject.org/repo ubuntu-trusty main' > /etc/apt/sources.list.d/docker.list
# Update the APT package index
$ apt-get update
# Install Docker
$ apt-get install docker-engine
Docker Compose installation
Docker-compose is a very useful tool for predefining container settings, managing multi-container environment and simplifying management of docker containers. The best way to install it is using pip.
# Install pip $ apt-get install python-pip # Install docker-compose $ pip install -U docker-compose
$ apt-get install git
Single node Cloudera Manager docker
Getting docker source
The next step is downloading docker source on your machine.
$ git clone https://github.com/intropro/single-node-cloudera-env.git
Docker source content
Go to the folder with downloaded docker source. You can see there following files:
|scripts/autodeploy.sh||the main script that starts Cloudera service, deploys them and uploads configuration file in json format|
|scripts/configcollector.sh, scripts/users-config-builder.py||after Cloudera is launched and all services are deployed this script downloads the most often used configuration and creates simple web page to access them|
|scripts/start_web.sh||is used for running python SimpleHTTPServer|
|templates/dockerfile.tmpl||part of Dockerfile that contains instruction for automatic building of docker image|
|templates/etalon.json||contains description of Cloudera configuration that will be uploaded via REST API|
|profiles/*||contains profiles with lists of variables for jdk, cdh and paths for downloading|
|make_dockerfile.sh||script for adding variables from profile to templates/dockerfile.tmpl and create Dockerfile for image building|
|docker-compose.yml||configuration file with docker containers options that will be managed by docker-compose|
|configs/supervisord.conf||supervisord configuration file, describes the parameters of service ran inside container|
If you look inside Dockerfile, you can see what docker does while building image. There are several variables that you can change in order to get image with another version of JDK or CDH version. To avoid making changes manually in Dockerfile, the profile was introduced. These variables can be added to Dockerfile from one of prepared profile by running make_dockerfile.sh <path to profile>.
$ ./make_dockerfile.sh profiles/jdk1.8_cdh5.3.3
In the repository you can find several predefined profiles but you can create your own profile with needed jdk and cdh version and paths for downloading. To do this you should just create text file in profile directory with parameters described below and run make_dockerfile.sh script.
Profile should contain the following variables:
|CDH_VERSION||full version of CDH in dot-separated format|
|PARCEL_DIR_DOWNLOAD||path to Cloudera repository folder where cdh parcel is located|
|PARCEL_FILE_NAME||full name of parcel in PARCELDIRDOWNLOAD folder|
|JDK_VERSION||full version of JDK|
|JDK_DOWNLOAD_PATH||full path to jdk rpm package|
|JCE_POLICY_DOWNLOAD_PATH||full path to JDK JCE policy package for correct working of kerberized environment (will be used later)|
Getting docker image
For getting docker image with Cloudera Hadoop you need to build or download docker image.
docker-compose.yml file is configured to download image from github by default. If you want to build it manually with your own parameters, you should edit docker-compose.yml in the following way:
server: build: . # image: intropro/single-node-cloudera-env:5.3.3 …
$ ./make_dockerfile.sh profiles/jdk1.8_cdh5.3.3
After that run:
$ docker-compose build
$ docker build -t single-node-cloudera-env:5.3.3
After the process is successfully completed, you can check new image by running:
$ docker images REPOSITORY TAG MAGE ID CREATED VIRTUAL SIZE singlenodeclouderaenv_server 5.3.3 8840da9463ba 21 seconds ago 3.745 GB
Starting and checking docker container
So now we are ready to run our docker container:
# Run on the background
$ docker-compose up -d
# Check docker status
$ docker-compose ps
# Check deployment process
$ docker-compose logs
Wait for message:
INFO exited: ClouderaManagerDeploy (exit status 0; expected)
In the next part of this article we will show you how to run kerberized environment and available options for sharing your environment to other team members