Last update February 2nd 2017
We recently launched demo.gethue.com, which in one click lets you try out a real Hadoop cluster. We followed the exact same process as building a production ready cluster. Here is how we did it.
Before getting started, you will need to get your hands on some machines. Hadoop runs on commodity hardware, so any regular computer with a major linux distribution will work. To follow along with the demo, take a look at Amazon Cloud Computing service. If you already have a server or two, or don’t mind running Hadoop on your local linux box, then go straight to Machine Setup!
Here is a video demoing how easy it is to boot your own cluster and start crunching data!
Machine setup
We picked AWS and started 4 r3.large instances with Ubuntu 14.04 and 100 GB storage (instead of the default 8GB). If you need less performance, one xlarge instance is enough or you can install less services on an even smaller instance.
Then configure the security group like below. We allow everything between the instances (the first row, don’t forget it on multi machine cluster!) and open up Cloudera Manager and Hue ports to the outside.
All TCP |
TCP |
0 – 65535 |
sg-e2db7777 (hue-demo) |
SSH |
TCP |
22 |
0.0.0.0/0 |
Custom TCP Rule |
TCP |
7180 |
0.0.0.0/0 |
Custom TCP Rule |
TCP |
8888 |
0.0.0.0/0 |
Custom ICMP Rule |
Echo Reply |
N/A |
0.0.0.0/0 |
Hadoop Setup
Now that we have some machines, let’s install Hadoop. We used Cloudera Manager as it installs everything for us and just followed this guide. Moreover, post install monitoring and configuration are also greatly simplified with the administration interface.
Start first by connecting to one of the machine:
ssh -i ~/demo.pem ubuntu@ec2-11-222-333-444.compute-1.amazonaws.com
Retrieve and start Cloudera Manager:
wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin chmod +x cloudera-manager-installer.bin sudo ./cloudera-manager-installer.bin
After, login with the default credentials admin/admin (note: you might need to wait 5 minutes before http://ec2-54-178-21-60.compute-1.amazonaws.com:7180/ becomes available).
Then enter all the Public DNS IP (e.g. ec2-11-222-333-444.compute-1.amazonaws.com) of your machines in the Install Wizard and click go! Et voila, Cloudera Manager will setup your whole cluster automatically for you!
Assign a dynamic IP to your machine with Hue and then go to IP:8888 and start playing with your fully functional Hadoop cluster and its examples!
As usual feel free to comment on the hue-user list or @gethue!
Note
If you are getting a “Bad Request (400)” error, you will need to enter in the hue.ini or CM safety valve:
[desktop] allowed_hosts=*
Note
If you have several machines, it is recommended to move the services around in order to homogenize the memory/CPU usage. For example split HBase, Oozie, Hive and Solr on different hosts.
Note
When running some MapReduce jobs with YARN, if all the jobs deadlock in ACCEPTED or READY states, you might be hitting this YARN bug.
The solution is to use a low number like 2 or 3 for the Dynamic resource manager pools. Go to CM → Clusters → Other → Dynamic Resource Pools → Configuration → Edit → YARN and set ‘Max Running Apps’ to 2.
You can also try to decrease yarn.nodemanager.resource.memory-mb and the task memory and bump the memory of yarn.app.mapreduce.am.resource.mb.
56 Comments
-
Very useful post!
By default, Cloudera Manager has a AWS option to automatically launch and setup instances.
However, when creating security groups, the setup is wrong and silently fails which leads to a lot of errors once your cluster is started.-
Please, I have a big problem with Cloudera Manager with AWS instances,
I configure everything right and CM says that everything is successfully installed and configured and once the cluster starts it fails !
I need your help with this issue please.
-
Hi, what fails? Are all the services up and running (green status icon) on CM?
-
Dear sir,
I have tried to install on 5 instances of AWS “r3.2xlarge” nodes.
I have used the same steps and configured ssh for easier communication between nodes.
I have installed as demonstrated in the video tutorial but no luck.All services with bad health and have all some errors such as :
Log Directory Free Space, Scratch Directory Free Space
Memory Overcommit Validation Threshold
Hue: Thrift Server role must be configured in HBase service to use the Hue HBase Browser application.I hope you can guide me through that issue.
-
A thing that could help you fine tune/configuring your clusters is the official documentation of Cloudera Manager http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/Cloudera-Manager-Installation-Guide.html and the CDH documentation too http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_cdh5_install.html
-
-
-
-
-
or Should I use any specific version of CM to be able to configure everything for me in a good way ?
or Is there any check that needs to be done?
Kindly note that I have added EBS volume “100GB” and formatted it and mounted it to /var
do I need to add it to another directory / ?-
CM5.0+ does this for you (latest is 5.1):
Hue: Thrift Server role must be configured in HBase service to use the Hue HBase Browser application.Make sure you distribute the services evenly in the cluster and allow enough memory to monitoring.
When doing the install I added the 100GB before installing the cluster, so make sure that the log paths are pointing to it.
-
-
This is great video please follow all the steps i used this steps on cent OS that’s going in right way.
Thanks -
All was well until I tried to create an AMI and terminate the original instance. When I launched a new instance none of the services started. What did I do wrong? Do I need to shut everything down before I terminate? Yikes. That was a lot of work to see it all go up in flames.
-
Hello,
I’m trying to install HUE on my instance (on AWS EC2), I followed the tutorial but when I want to see the cloudera manager of my instance I’m getting the “Unaccessible WebPage” error.
Any idea to help me please ?Thanks
Doriane
-
Hi Doriane,
did you open the CM (7180) and Hue (8888) ports to the external world?
-
-
Very good tutorial. Appreciate HUE effort
I followed the video without missing a step. And I am able to install HUE successfully. However when I try to use PIG and load a file (600 MB) it gets struck in “Accepted” status without providing any information to debug. Below message is displayed.
Loading …
The application might not be running yet or there is no Node Manager or Container available. This page will bee automatically refreshed.I tried with different instances m3.xlarge & m4.xlarge (min 100GB) multiple times, gets stuck at the same point. May be the steps in video needs some update. Please consider and give some specific tips for newbies like me.
Don’t want this beautiful tool to get struck like this after fully installing it. Please help !!
-
If you have a single node, could you check if you can tweak that and also see if there is enough resource for multiple containers to run? Pig App requires 2 YARN application to run. Also check if your Node Manager is running. You can also quick test if YARN is correct first by running a Sleep job from the Oozie example
-
-
Internal error while querying the Host Monitor
I’m getting this error message
How to resolve this issue ??
Sid
-
This means the CM Service Monitoring is not up, you should check that it is started properly
-
-
please help..i had set up a aws instance and install cloudera manager on that instance using putty. but i cant able to connect cloudera dashboard using port no 7180 or 8888. i mention all security rules for instance which are given above..plz help to start dashboard so i can set up cluster..
-
now i successfully installed cm5 and i add a cluster..but while setup cluster i m facing a error in step 6.the error is like-
Installation failed. Failed to receive heartbeat from agent.
Ensure that the host’s hostname is configured properly.please help me i am using ubuntu image on ec2.what should i configure in hosts file..help asap..
-
Great tutorial – followed it to the letter – and successfully got the ubuntu instance up and the CM install installed successfully, but i can’t connect to it remotely or on the aws instance.
Checked the running services and the database is running fine, but the CM Manager service didn’t start. No big deal, so I started the service manually ” sudo service cloudera-scm-server start” which got it running, but still can’t connect. Firewall rules are all set properly, i even opened all ports to all traffic from all ip addresses to no avail. Just to be sure, i tried again to connect from the ssh terminal “nc -zv localhost 7180” also, nothing. Then i ran a netstat and there’s nothing listening on port 7180.. what am i missing?
My instance is Ubuntu 14.x and the CM version is 5.x (whatever came down from “wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin”
any help would be appreciated!
Cheers,
Matt-
And shen you do
sudo service cloudera-scm-server statusDoes it say running?
If might not boot and should should see the errors in /var/log/cloudera-scm…
-
-
Checking for service cloudera-scm-server: * cloudera-scm-server is running
but still can’t connect via browser.. so checked with nc..
ubuntu@ip-10-0-0-24:~$ nc -zv localhost 7180
nc: connect to localhost port 7180 (tcp) failed: Connection refusedsudo cat cloudera-scm-server/cloudera-scm-server.log
2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: Creating new metric: total_oozie_coord_action_query_executor_update_coord_action_for_push_inputcheck_duration_timer_15min_rate_across_oozie_servers
2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: Creating new metric: oozie_coord_action_query_executor_update_coord_action_for_modified_date_duration_timer_min_across_oozie_servers
2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: Creating new metric: total_oozie_coord_action_query_executor_update_coord_action_for_modified_date_duration_timer_stddev_across_oozie_servers
2015-08-30 23:54:03,880 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 15662 new metrics
2015-08-30 23:54:11,271 INFO MainThread:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 0 updated metrics
2015-08-30 23:54:11,278 INFO MainThread:com.cloudera.cmon.components.MetricSchemaManager: Cross entity aggregates processed.so it looks like it started ok.. ugh
-
If it is running, it could be a /etc/host not setup correctly for localhost or a firewall issue
-
-
yeah, i was thinking it was probably something with the firewall but i just can’t figure it out (the AWS console shows the ports all open for in and out).. anyone have other thoughts?
Cheers,
Matt -
I’ve been looking through the log files and i see that there a entries for not enough heap for JAVA to run – and the cloudera_scm-server dies after 30 seconds or so..
I’m running a free EC2 instance “micro” with 8 gig of disk. How do i adjust the heap parm? i can’t find an scm server conf file..Cheers,
Matt -
*My Answer!*: OK.. so i’ve been wrestling with this for two days.. here’s the answer (at least for me)
1) you run a *service –status-all* and you see the “cloudera-scm-server” service isn’t running, so you start it
2) you start the service with a “sudu service cloudera-scm-server start” and it starts, when you check with service status (as in step one) you see it started, but after 30 seconds its dead again
3) snoop around and find in the syslogs that there isn’t enough heap for the java stack to run*So.. I changed the running EC2 instance from a *micro* to a *small* (have to stop the instance and then change it) and now it’s running fine!!*
so I think the “small” instances and above aren’t free.. so we’ll see what happens. maybe resize the memory allocation in the /etc/default/cloudera-scm-server.conf file.. we’ll see.
-
Nice! For demo.gethue.com we also use instances optimized/with larger memory specs (but less CPU) and it works pretty well (r3-large)
-
-
Hello Team,
Is there anyway we can install hue on EMR 4.0.0?
Thanks
Ankur-
I think it is do it yourself for now (cf. Ubuntu guide of the Configure section for example) until Amazon bundles it in 4.0
-
-
Hello,
I’m trying to install HUE on my instance (on AWS EC2), I followed the tutorial but when I finally get install the cloudera manager by Hue in the level Cluster Setup show message “Failed to perform First Run of services.” and the command progress show the message “command (49) has falled” .
Cheers,
Shermila.-
We need more information, the full logs, if the machine has enough space, memory etc..
-
-
Hi,
I am experiencing this error
“The application might not be running yet or there is no Node Manager or Container available. This page will be automatically refreshed”
on demo.gethue.com as well as in locally setup cluster in VMplayer. Please help me out.-
Did you check if YARN was up? demo.gethue.com works well, I just looked
-
-
Hey there, nice guide.
I managed to install CDH 5.4.8 (Parcels), the latest available today, on an Ubuntu single m4.xlarge EC2 instance. I installed and started Hue successfully. BTW, I really wanted to play around with the Spark notebook which seems to be introduced starting from 3.8. Since it’s not gonna be a production system, I wanted to ask for your help on how to upgrade/install the latest Hue available which seems to be 3.9.
Thanks in advance.
-
The Spark Notebook will be available in CDH 5.7, it is currently a beta and not supported there. We recommend playing with it manually on the side from master: http://gethue.com/spark/
-
Thanks Hue Team,
So isn’t there a way to manually instal/upgrade Hue (up to 3.9 in this case) over a 5.4.8 CDH installation?
-
Sorry, you are on your own until it is officially supported in CDH 😉
-
got it, thanks. any ETA on CDH 5.7 release date?
-
AFAIK ETA is ~ early Q2 2016
-
-
-
-
-
Hi Team,
I have to set up ahadoop cluster with fully distributed on physical servers not on any cloud env.
Once the cluster is ready i have to install CDH and cloudera manager , which we don have internet access for the servers.
Can you please help me with the installation steps.Thank you
-
We suggest you to refer to the Cloudera documentation regarding CDH and Cloudera Manager: http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest.html
-
-
Hi team thax for the reply …,
can u please let me know , Cloudera supports CentOS 6.7 to install cloudera manager and CDH.
Because, as per the ClouderaDocument:- Supported Operating Systems for Cloudera Manager
Cloudera Manager supports a range of operating systems including:Red Hat-compatible systems
Red Hat Enterprise Linux 5.7 and CentOS 5.7, 64-bit
Red Hat Enterprise Linux 6.2 and 6.4, and CentOS 6.2 and 6.4, 64-bitI am trying to install Cloudera manger on CentOS 6.7 but i am not able to bring up the agent.
-
Centos 6.7 is supported in CDH5.5 http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_ig_cm_requirements.html#cmig_topic_4_1_unique_1
-
-
HI Team thanks for your support,
After cloudera installation, if we have to give access like Pig and Hive for only this particular user and spark and Hue to some other user…how can we do that.and i came to know that we cannot re-strict users to access application like HDFS, Hive, Pig, Spark, but for HUE, you need to create users manually in the HUE UI using admin account .
can you please help me with steps i need to follow in order to acheive this…
-
Did you see http://gethue.com/how-to-manage-permissions-in-hue/ ?
-
-
Hi Team,
When I run
ssh -i ~/demo.pem ubuntu@ec2-11-222-333-444.compute-1.amazonaws.com
with my data, I getting the error:No such file or directory.
Permission denied (publickey).Can you help me, please?
-
There is no data with ‘ssh’ command, are you confusing it with the ‘scp’ command?
-
-
To install and deploy Hadoop using this method, is the free Cloudera Express enough ? Or do you need Cloudera Enterprise to keep using clusters built using this method.
-
You can use it for free, but AFAIK the Express edition later won’t let you fully manage Spark / Impala / Search and use the monitoring in general.
-
Would that be a major impediment if my use case is running ElasticSearch / Nutch ?
-
Currently Hue only supports Solr API
-
-
-
-
facing the error “host monitor is not running”
i did all installation part of cloudera but still facing this.-
You could look at the logs of the process in CM or ask on the CM forum http://community.cloudera.com/t5/Cloudera-Manager-Installation/bd-p/CMInstall
-