How to create a real Hadoop cluster in 10 minutes?

How to create a real Hadoop cluster in 10 minutes?

Last update February 2nd 2017

We recently launched demo.gethue.com, which in one click lets you try out a real Hadoop cluster. We followed the exact same process as building a production ready cluster. Here is how we did it.

Before getting started, you will need to get your hands on some machines. Hadoop runs on commodity hardware, so any regular computer with a major linux distribution will work. To follow along with the demo, take a look at Amazon Cloud Computing service. If you already have a server or two, or don’t mind running Hadoop on your local linux box, then go straight to Machine Setup!

Here is a video demoing how easy it is to boot your own cluster and start crunching data!

Machine setup

We picked AWS and started 4 r3.large instances with Ubuntu 14.04 and 100 GB storage (instead of the default 8GB). If you need less performance, one xlarge instance is enough or you can install less services on an even smaller instance.

Then configure the security group like below. We allow everything between the instances (the first row, don’t forget it on multi machine cluster!) and open up Cloudera Manager and Hue ports to the outside.

All TCP

TCP

0 – 65535

sg-e2db7777 (hue-demo)

SSH

TCP

22

0.0.0.0/0

Custom TCP Rule

TCP

7180

0.0.0.0/0

Custom TCP Rule

TCP

8888

0.0.0.0/0

Custom ICMP Rule

Echo Reply

N/A

0.0.0.0/0

Hadoop Setup

Now that we have some machines, let’s install Hadoop. We used Cloudera Manager as it installs everything for us and just followed this guide. Moreover, post install monitoring and configuration are also greatly simplified with the administration interface.

Start first by connecting to one of the machine:

ssh -i ~/demo.pem ubuntu@ec2-11-222-333-444.compute-1.amazonaws.com

 

Retrieve and start Cloudera Manager:

wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
chmod +x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin

After, login with the default credentials admin/admin (note: you might need to wait 5 minutes before http://ec2-54-178-21-60.compute-1.amazonaws.com:7180/ becomes available).

Then enter all the Public DNS IP (e.g. ec2-11-222-333-444.compute-1.amazonaws.com) of your machines in the Install Wizard and click go! Et voila, Cloudera Manager will setup your whole cluster automatically for you!

Assign a dynamic IP to your machine with Hue and then go to IP:8888 and start playing with your fully functional Hadoop cluster and its examples!

As usual feel free to comment on the hue-user list or @gethue!

Note

If you are getting a “Bad Request (400)” error, you will need to enter in the hue.ini or CM safety valve:

[desktop]
allowed_hosts=*

Note

If you have several machines, it is recommended to move the services around in order to homogenize the memory/CPU usage. For example split HBase, Oozie, Hive and Solr on different hosts.

Note

When running some MapReduce jobs with YARN, if all the jobs deadlock in ACCEPTED or READY states, you might be hitting this YARN bug.

The solution is to use a low number like 2 or 3 for the Dynamic resource manager pools. Go to CM → Clusters → Other → Dynamic Resource Pools → Configuration → Edit → YARN and set ‘Max Running Apps’ to 2.

cm-yarn-pool

You can also try to decrease yarn.nodemanager.resource.memory-mb and the task memory and bump the memory of yarn.app.mapreduce.am.resource.mb.

56 Comments

  1. JeanClaude 3 years ago

    Very useful post!
    By default, Cloudera Manager has a AWS option to automatically launch and setup instances.
    However, when creating security groups, the setup is wrong and silently fails which leads to a lot of errors once your cluster is started.

  2. Amr 3 years ago

    or Should I use any specific version of CM to be able to configure everything for me in a good way ?
    or Is there any check that needs to be done?
    Kindly note that I have added EBS volume “100GB” and formatted it and mounted it to /var
    do I need to add it to another directory / ?

    • Hue Team 3 years ago

      CM5.0+ does this for you (latest is 5.1):
      Hue: Thrift Server role must be configured in HBase service to use the Hue HBase Browser application.

      Make sure you distribute the services evenly in the cluster and allow enough memory to monitoring.

      When doing the install I added the 100GB before installing the cluster, so make sure that the log paths are pointing to it.

  3. shivom 2 years ago

    This is great video please follow all the steps i used this steps on cent OS that’s going in right way.
    Thanks

  4. muddywaterous 2 years ago

    All was well until I tried to create an AMI and terminate the original instance. When I launched a new instance none of the services started. What did I do wrong? Do I need to shut everything down before I terminate? Yikes. That was a lot of work to see it all go up in flames.

  5. Doriane 2 years ago

    Hello,
    I’m trying to install HUE on my instance (on AWS EC2), I followed the tutorial but when I want to see the cloudera manager of my instance I’m getting the “Unaccessible WebPage” error.
    Any idea to help me please ?

    Thanks

    Doriane

    • Hue Team 2 years ago

      Hi Doriane,
      did you open the CM (7180) and Hue (8888) ports to the external world?

  6. Lakers 2 years ago

    Very good tutorial. Appreciate HUE effort

    I followed the video without missing a step. And I am able to install HUE successfully. However when I try to use PIG and load a file (600 MB) it gets struck in “Accepted” status without providing any information to debug. Below message is displayed.

    Loading …
    The application might not be running yet or there is no Node Manager or Container available. This page will bee automatically refreshed.

    I tried with different instances m3.xlarge & m4.xlarge (min 100GB) multiple times, gets stuck at the same point. May be the steps in video needs some update. Please consider and give some specific tips for newbies like me.

    Don’t want this beautiful tool to get struck like this after fully installing it. Please help !!

  7. Sid 2 years ago

    Internal error while querying the Host Monitor

    I’m getting this error message

    How to resolve this issue ??

    Sid

    • Hue Team 2 years ago

      This means the CM Service Monitoring is not up, you should check that it is started properly

  8. Ram Kishor Tak 2 years ago

    please help..i had set up a aws instance and install cloudera manager on that instance using putty. but i cant able to connect cloudera dashboard using port no 7180 or 8888. i mention all security rules for instance which are given above..plz help to start dashboard so i can set up cluster..

  9. Ram Kishor Tak 2 years ago

    now i successfully installed cm5 and i add a cluster..but while setup cluster i m facing a error in step 6.the error is like-
    Installation failed. Failed to receive heartbeat from agent.
    Ensure that the host’s hostname is configured properly.

    please help me i am using ubuntu image on ec2.what should i configure in hosts file..help asap..

  10. Matt 2 years ago

    Great tutorial – followed it to the letter – and successfully got the ubuntu instance up and the CM install installed successfully, but i can’t connect to it remotely or on the aws instance.

    Checked the running services and the database is running fine, but the CM Manager service didn’t start. No big deal, so I started the service manually ” sudo service cloudera-scm-server start” which got it running, but still can’t connect. Firewall rules are all set properly, i even opened all ports to all traffic from all ip addresses to no avail. Just to be sure, i tried again to connect from the ssh terminal “nc -zv localhost 7180” also, nothing. Then i ran a netstat and there’s nothing listening on port 7180.. what am i missing?

    My instance is Ubuntu 14.x and the CM version is 5.x (whatever came down from “wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin

    any help would be appreciated!

    Cheers,
    Matt

    • Hue Team 2 years ago

      And shen you do
      sudo service cloudera-scm-server status

      Does it say running?
      If might not boot and should should see the errors in /var/log/cloudera-scm…

  11. Matt 2 years ago

    Checking for service cloudera-scm-server: * cloudera-scm-server is running

    but still can’t connect via browser.. so checked with nc..

    ubuntu@ip-10-0-0-24:~$ nc -zv localhost 7180
    nc: connect to localhost port 7180 (tcp) failed: Connection refused

    sudo cat cloudera-scm-server/cloudera-scm-server.log
    2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: Creating new metric: total_oozie_coord_action_query_executor_update_coord_action_for_push_inputcheck_duration_timer_15min_rate_across_oozie_servers
    2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: Creating new metric: oozie_coord_action_query_executor_update_coord_action_for_modified_date_duration_timer_min_across_oozie_servers
    2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: Creating new metric: total_oozie_coord_action_query_executor_update_coord_action_for_modified_date_duration_timer_stddev_across_oozie_servers
    2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 15662 new metrics
    2015-08-30 23:54:11,271 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 0 updated metrics
    2015-08-30 23:54:11,278 INFO MainThread:com.cloudera.cmon.components.MetricSchemaManager: Cross entity aggregates processed.

    so it looks like it started ok.. ugh

    • Hue Team 2 years ago

      If it is running, it could be a /etc/host not setup correctly for localhost or a firewall issue

  12. Matt 2 years ago

    yeah, i was thinking it was probably something with the firewall but i just can’t figure it out (the AWS console shows the ports all open for in and out).. anyone have other thoughts?
    Cheers,
    Matt

  13. Matt 2 years ago

    I’ve been looking through the log files and i see that there a entries for not enough heap for JAVA to run – and the cloudera_scm-server dies after 30 seconds or so..
    I’m running a free EC2 instance “micro” with 8 gig of disk. How do i adjust the heap parm? i can’t find an scm server conf file..

    Cheers,
    Matt

  14. Matt 2 years ago

    *My Answer!*: OK.. so i’ve been wrestling with this for two days.. here’s the answer (at least for me)
    1) you run a *service –status-all* and you see the “cloudera-scm-server” service isn’t running, so you start it
    2) you start the service with a “sudu service cloudera-scm-server start” and it starts, when you check with service status (as in step one) you see it started, but after 30 seconds its dead again
    3) snoop around and find in the syslogs that there isn’t enough heap for the java stack to run

    *So.. I changed the running EC2 instance from a *micro* to a *small* (have to stop the instance and then change it) and now it’s running fine!!*

    so I think the “small” instances and above aren’t free.. so we’ll see what happens. maybe resize the memory allocation in the /etc/default/cloudera-scm-server.conf file.. we’ll see.

    • Hue Team 2 years ago

      Nice! For demo.gethue.com we also use instances optimized/with larger memory specs (but less CPU) and it works pretty well (r3-large)

  15. Ankur 2 years ago

    Hello Team,

    Is there anyway we can install hue on EMR 4.0.0?

    Thanks
    Ankur

    • Hue Team 2 years ago

      I think it is do it yourself for now (cf. Ubuntu guide of the Configure section for example) until Amazon bundles it in 4.0

  16. shermila 2 years ago

    Hello,
    I’m trying to install HUE on my instance (on AWS EC2), I followed the tutorial but when I finally get install the cloudera manager by Hue in the level Cluster Setup show message “Failed to perform First Run of services.” and the command progress show the message “command (49) has falled” .
    Cheers,
    Shermila.

    • Hue Team 2 years ago

      We need more information, the full logs, if the machine has enough space, memory etc..

  17. Sumit Gupta 2 years ago

    Hi,

    I am experiencing this error
    “The application might not be running yet or there is no Node Manager or Container available. This page will be automatically refreshed”
    on demo.gethue.com as well as in locally setup cluster in VMplayer. Please help me out.

    • Hue Team 2 years ago

      Did you check if YARN was up? demo.gethue.com works well, I just looked

  18. erond 2 years ago

    Hey there, nice guide.

    I managed to install CDH 5.4.8 (Parcels), the latest available today, on an Ubuntu single m4.xlarge EC2 instance. I installed and started Hue successfully. BTW, I really wanted to play around with the Spark notebook which seems to be introduced starting from 3.8. Since it’s not gonna be a production system, I wanted to ask for your help on how to upgrade/install the latest Hue available which seems to be 3.9.

    Thanks in advance.

    • Hue Team 2 years ago

      The Spark Notebook will be available in CDH 5.7, it is currently a beta and not supported there. We recommend playing with it manually on the side from master: http://gethue.com/spark/

      • erond 2 years ago

        Thanks Hue Team,

        So isn’t there a way to manually instal/upgrade Hue (up to 3.9 in this case) over a 5.4.8 CDH installation?

        • Hue Team 2 years ago

          Sorry, you are on your own until it is officially supported in CDH 😉

          • erond 2 years ago

            got it, thanks. any ETA on CDH 5.7 release date?

          • Hue Team 2 years ago

            AFAIK ETA is ~ early Q2 2016

  19. supriya 1 year ago

    Hi Team,

    I have to set up ahadoop cluster with fully distributed on physical servers not on any cloud env.

    Once the cluster is ready i have to install CDH and cloudera manager , which we don have internet access for the servers.
    Can you please help me with the installation steps.

    Thank you

  20. supriya 1 year ago

    Hi team thax for the reply …,

    can u please let me know , Cloudera supports CentOS 6.7 to install cloudera manager and CDH.
    Because, as per the ClouderaDocument:- Supported Operating Systems for Cloudera Manager
    Cloudera Manager supports a range of operating systems including:

    Red Hat-compatible systems
    Red Hat Enterprise Linux 5.7 and CentOS 5.7, 64-bit
    Red Hat Enterprise Linux 6.2 and 6.4, and CentOS 6.2 and 6.4, 64-bit

    I am trying to install Cloudera manger on CentOS 6.7 but i am not able to bring up the agent.

  21. supriya 1 year ago

    HI Team thanks for your support,
    After cloudera installation, if we have to give access like Pig and Hive for only this particular user and spark and Hue to some other user…how can we do that.

    and i came to know that we cannot re-strict users to access application like HDFS, Hive, Pig, Spark, but for HUE, you need to create users manually in the HUE UI using admin account .

    can you please help me with steps i need to follow in order to acheive this…

  22. David 1 year ago

    Hi Team,

    When I run
    ssh -i ~/demo.pem ubuntu@ec2-11-222-333-444.compute-1.amazonaws.com
    with my data, I getting the error:

    No such file or directory.
    Permission denied (publickey).

    Can you help me, please?

    • Hue Team 1 year ago

      There is no data with ‘ssh’ command, are you confusing it with the ‘scp’ command?

  23. Melissa 1 year ago

    To install and deploy Hadoop using this method, is the free Cloudera Express enough ? Or do you need Cloudera Enterprise to keep using clusters built using this method.

    • Hue Team 1 year ago

      You can use it for free, but AFAIK the Express edition later won’t let you fully manage Spark / Impala / Search and use the monitoring in general.

      • Melissa 1 year ago

        Would that be a major impediment if my use case is running ElasticSearch / Nutch ?

        • Hue Team 1 year ago

          Currently Hue only supports Solr API

  24. Nilesh 1 year ago

    facing the error “host monitor is not running”
    i did all installation part of cloudera but still facing this.

Leave a reply

Your email address will not be published. Required fields are marked *

*