Setting up a Python Data Science Environment

Python
You can set up your environment using either method with three commands

Here are instructions for getting a Python stack installed on Windows, Mac, and Linux. These examples will use Python 3 (which I recommend as Python 2 is no longer being developed).

You have two main options for installing a Python environment. You can use the Python executable that you can get from the Python website, or you can use the Anaconda distribution. I will show you both. If you are new to Python, only doing "data science" work, or using Windows, I recommend using Anaconda.

Using Anaconda

Conda TLDR
TLDR: Use these to create an environment using Anaconda

Anaconda is both a product and a company. The company supports Python data tooling and makes the distribution available as a free product, but also as a supported enterprise paid for product. This page will demo the free option, but I'm assuming that the paid product is similar after install. There are versions for both Python 3.x and Python 2.7. Make sure you use the Python 3.x version.

The Anaconda distribution includes "Python" it just has some extras. It uses its own package system to make it simple to create working Python environments on most platforms. The steps you need to take if you want to use it are:

  • Install Anaconda (or Miniconda)
  • Create an environment
  • Install packages into the environment

Install Anaconda

You have two options for installing Anaconda (the product). The first is to install the whole kitchen sink, which is referred to as the "Anaconda Distribution" a 600MB+ download that includes many common libraries for machine learning, data analysis, and visualization. You can download it here:

The other option is to use Miniconda. This is a smaller download that does not include various libraries but rather lets you download and install them on demand. You can download it here:

(Many of my clients work inside a firewall and should use the Anaconda distribution that is available locally inside of their firewall.)

After you have installed Anaconda or Miniconda, you should have an executable called conda available. Test out that you have it. On UNIX platforms (Mac and Linux) open a terminal (if you are on Mac, you can hit Command-space and then type "Terminal" and hit enter). On Windows, go to the start menu and search for "Conda Prompt". Type into the terminal or command prompt:

conda --version

And it should print out the version, something like conda 4.7.5. If it does not do that, then you did not install Anaconda successfully.

Create an Environment

After you have installed the conda utility, you will need to make an "environment." An environment is a sandboxed installation of Python libraries and utilities. The purpose of an environment is to allow per project library dependencies.

For example, I consult with different clients. If I have multiple clients over the years, they might use different versions of libraries. Say, for example, client A is using Pandas 0.19, client B Pandas 0.25, and client C Pandas 1.0.1. Pandas in generally is pretty good about backward compatibility, but it does have breaking changes. Environments allow me to install specific packages for each client project. When I need to work on client A's work, I activate the environment for that project and have access to Pandas 0.19. When I need to shift to client C, I activate the corresponding environment and have access to Pandas 1.0.1.

To create an environment, you need to give it a name. For a machine learning class, you might create an environment named mlclass (you can specify another name if you like). You can also specify a Python version for the environment. To create an environment with Python 3.7 from a terminal or Anaconda Prompt (on Windows) type:

conda create --name mlclass python=3.7

You can create as many environments as you like. I create them on a per-project basis. That way, I can jump into a project and make sure that I can work on it without dependency conflicts.

After you create the environment, you should activate it. This updates the PATH environment variable so that when you invoke python or utilities, you use the executables found in the environment. Type:

conda activate mlclass

After running this, your prompt should indicate that you are in the environment, and it will change to look like this:

(mlclass)

Note that Anaconda, by default, stores its environments under your home directory. On my Mac, it would put it in ~/anaconda3/envs/mlclass.

Install Packages with Conda

At this point, you have installed conda, created an environment, and activated the environment. You are now ready to install some packages. Here is a command to install some of the common data analysis packages:

conda install notebook pandas scikit-learn seaborn xlrd districtdatalabs::yellowbrick py-xgboost notebook conda-forge::shap bokeh graphviz

The packages with colons in their names are specified as channel::package name. Anaconda ships with certain packages, but libraries can create their own "channel" to ship different versions of their library (or even package their library for Anaconda) on a different cadence than the Anaconda release cycle.

At this point, you should be ready to go with a basic installation. To launch Jupyter notebook type:

jupyter notebook

To learn more about the conda tool, check out the User Guide.

Installation with Python

Python TLDR
TLDR: Use these commands on Linux or Mac if you do not want to use Anaconda. On Windows replace the second line with env\Scripts\activate

Note: If you are using Windows, it is recommended that you use Anaconda rather than this mechanism. Some of the scientific packages require installing and configuring a cross-compilation development environment, which can be annoying. If you are using a Mac and want to follow these instructions, it is recommended to install Python from Homebrew.

Install Python

If you are on Linux, use your package manager to install Python 3.x. On Mac, use Homebrew to install Python 3.x. If these instructions do not make sense to you, scroll up and use the instructions for Anaconda.

After Python is installed, open a terminal and check the version. Type:

python3 --version

It should say something like python 3.7.3. Make sure that you installed Python 3 and not Python 2.

Create a Virtual Environment

Python 3 ships with the ability to create environments that are very similar to Anaconda environments. However, they are not the same. Python calls its environments "Virtual Environments", so I will follow suit here. A virtual environment will allow you to create a sandbox for Python on a per-project basis.

To create a virtual environment named env in the current directory type:

python3 -m venv env

To activate this environment, type:

source env/bin/activate

Note that source is a UNIX Bash command (so this will not work on Windows. If you really want to use Windows, even though I suggested against it with this mechanism, you would type env\Scripts\activate.bat).

At this point, your prompt should change to something like this:

(env)

Install Packages with Pip

With Python installed and your virtual environment activated, you are ready to install packages. We use the pip tool to install packages from the Python Package Index. If your virtual environment is activated, then pip will install libraries into the virtual environment.

You should probably update pip with a new version. Type:

pip install -U pip

To install a base data science system type:

pip install notebook seaborn pandas scikit-learn xlrd yellowbrick xgboost shap bokeh graphviz

To launch Jupyter Notebook type the following:

jupyter notebook

There is a User Guide for pip as well.

Other Options

Can you use Docker? Probably. The above are the simplest ways of getting your environment set up. There may be an existing Docker container you can leverage. But Docker is usually orthogonal to what I'm teaching. Also, the Docker images will probably be using one of the two techniques I described above.

What about hosted services? Google COLAB is a great service (disclaimer: I'm a subscriber to their PRO account), but requires access to Google COLAB, which many of my clients do not have.

Summary

Setting up a Python development environment is three lines of code once you have Python 3 or Anaconda installed. If you have questions about this, feel free to tweet them. I try to keep this page up to date as I point my clients to it.