Installing IPython Notebook on HDP 2.3
Since we're using the Hortonworks Data Plattform at work, I toyed around with the HDP 2.3 Sandbox to see whats inside. One thing I've found is, they upgraded the Apache Spark Version from 1.2 to the more recently 1.3.1 release. Needless to say, it sparked my interest... (you see, what I did there?)
Being more comfortable with the Python language than Scala, I decided to fire up the included pyspark shell, but found it not to be that engaging. I always liked the REPL interface from Python, the direct way of programming, but it laked the sustainability of "normal" written scripts. So to get a better grip on how everything works, I decided to upgrade the distribution and go ahead to install IPython notebook, which lets me program like on a shell, but has the longevity of written code. I reaaaaally like that interface :)
So here is a step by step manual, of how to get IPython Notebook running on HDP 2.3 Sandbox:
Step 01: Download and Install HDP
- Download the Sandbox here
- Download Virtualbox here
- After installing Virtualbox and importing the virtual machine into it, you have to add a new portforwarding rule. To do this, you have to right click on the imported VM, go to settings/network settings, extended and click on port-forwarding. Add the following rule: ipython notebook, 127.0.0.1, Port 8889, Port 8889
- After that, you may boot the VM
Step 02: Connect via ssh and install needed libraries
- You may connect via ssh: root@127.0.0.1:2222 password: hadoop
- install needed libraries:
Step 03: Install Python 2.7
Since HDP 2.3 Preview Sandbox only comes with Python 2.6 installed, but IPython Notebook requires at least python 2.7, we have to install it manually. To do without recompiling it ourselves, we can use the CentOS Software Collections Repository, which we installed in the previews step (centos-release-SCL). So just type
to get python 2.7.
Now we have to activate it for this session:
Step 04: Install pip for Python 2.7
To get IPython Notebook and some more python libraries, we will use pip - the python package manager. So let's install pip:
Step 05: Install IPython Notebook
First, we will install some more "standard" python libraries for data scientist to have something to toy around with. Afterwards, we will install IPython Notebook.
Step 06: Configure IPython Notebook
To get IPython Notebook running, we first have to create a new profile:
and change some values in the file (don't forget to remove the #-Symbol in front of the lines!)
new values:
- c.NotebookApp.ip = '0.0.0.0'
- c.NotebookApp.open_browser = False
- c.NotebookApp.port = 8889
- c.NotebookApp.notebook_dir = u'/usr/hdp/2.3.0.0-2557/spark/'
You may also use this script to to the changes for you.
Step 07: Create startup script
Create a new file in your home directory
and add the following content to it.
Afterwars make this script executable
Step 08: Start IPython Notebook
You're done! You may start IPython Notebook from your home directory with the command
After that you may open your webbrowser and open http://127.0.0.1:8889 to get to the IPython Notebook startpage or http://127.0.0.1:4040 to open Spark UI, which gets you some insight into memory consumption and duration of our Apache Spark jobs.
Cheers!