Installing Jupyter (IPython Notebook) on HDP 2.4
Update: I revised the old article from January 2016 to work with the currently available Hortonworks Dataplatform HDP 2.4 and Jupyter. Thanks to Carolyn Duby for mentioning the updated download location for pypa setup tools!
Since we're using the Hortonworks Data Plattform at work, I toyed around with the HDP 2.4 Sandbox to see whats inside. One thing I've found is, they upgraded the Apache Spark Version from 1.5 to the more recently 1.6 release. Needless to say, it sparked my interest... (you see, what I did there?)
Being more comfortable with the Python language than Scala, I decided to fire up the included pyspark shell, but found it not to be that engaging. I always liked the REPL interface from Python, the direct way of programming, but it laked the sustainability of "normal" written scripts. So to get a better grip on how everything works, I decided to upgrade the distribution and go ahead to install Jupyter, which lets me program like on a shell, but has the longevity of written code. I reaaaaally like that interface :)
So here is a step by step manual, of how to get Jupyter running on HDP 2.4 Sandbox:
Step 01: Download and Install HDP
- Download the Sandbox here
- Download Virtualbox here
- After installing Virtualbox and importing the virtual machine into it, you have to add a new portforwarding rule. To do this, you have to right click on the imported VM, go to settings/network settings, extended and click on port-forwarding. Add the following rule: ipython notebook, 127.0.0.1, Port 8889, Port 8889
- After that, you may boot the VM
Step 02: Connect via ssh and install needed libraries
- You may connect via ssh: root@127.0.0.1:2222 / password: hadoop
- install needed libraries:
Step 03: Install Python 2.7
Since HDP 2.4 Sandbox only comes with Python 2.6 installed, but Jupyter requires at least python 2.7, we have to install it manually. To do without recompiling it ourselves, we can use the CentOS Software Collections Repository, which we installed in the previews step (centos-release-SCL). So just type
to get python 2.7.
Now we have to activate it for this session:
Step 04: Install pip for Python 2.7
To get Jupyter and some more python libraries, we will use pip - the python package manager. So let's install pip and upgrade pip to the latest version:
Step 05: Install Jupyter (IPython Notebook)
First, we will install some more "standard" python libraries for data scientist to have something to toy around with. Afterwards, we will install Jupyter (IPython Notebook). This might take a while.
Step 06: Create Jupyter startup script
Create a new file in your home directory
and add the following content to it.
Afterwards make this script executable
Step 07: Start IPython Notebook
You're done! You may start Jupyter from your home directory with the command
After that you may open your webbrowser and open http://127.0.0.1:8889 to get to the Jupyter startpage or http://127.0.0.1:4040 to open Spark UI, which gets you some insight into memory consumption and duration of our Apache Spark jobs.
Cheers!
Troubleshooting guide
Error: Can't connect to Jupyter notebook via IP-Adress 127.0.0.1
Solution: Try to connect to the IP-adress of your virtual machine instead. To get the correct IP-adress, connect to the HDP virutal machine via ssh and use the command:
Take a note on the entry "inet addr: XXX.XXX.XXX.XXX" from the device eth0 and write your IP-adress down. Afterwards, you should be able to connect through typing XXX.XXX.XXX.XXX:8889 in your browser. (For me it was 172.16.37.1:8889)