Make the most out of your Jupyter Notebooks using Poetry.

Manuel Gil
5 min readNov 19, 2022

--

Exploratory Data Analysis (EDA) is the first stage of any data science project. By exploring and analyzing datasets we can summarize their main characteristics, recognize hidden patterns, and prepare them for further processing. To perform an EDA we usually rely on Jupyter Notebooks, which is a web-based environment, within them, we can write Python code, show text data, and visualize charts in the same place. Yes, there are a lot of advantages when using notebooks, they are versatile and useful when it comes to creating something that should be presented.

Photo by Chris Ried on Unsplash

However, it is worth noting, that we can end up with a messy notebook if we ignore some basic coding practices and populate it without being aware of keeping desirable traits like reusability and maintainability. We can use tools to keep Jupyter Notebooks tidy and organized, one of them is called Poetry, which allows us to create Python Package and orchestrates and organize the code following the best practices.

How can Poetry help us to keep a neat notebook?

When exploring a dataset we are usually interested in plotting some descriptive graphs so we can have a better understanding of the phenomena the data is describing in just a single glance. To obtain a visual representation of the dataset we can tap into an extensive repertoire of libraries for Python, which leverages the interactive capabilities of Jupyter notebooks.

As mentioned above, Jupyter Notebooks are a great option for Exploratory Data Analysis. However, the fact that they are interactive and allow us to visualize data makes them a problem when it comes to following good coding practices. There is no way to guarantee a linear execution of the code, which makes it difficult to debug the notebook and keep the code reusable and maintainable.

At this point there might be questions like, so how can I use Poetry to improve my code and therefore improve my analysis? The idea is simple, we have to abstract most of the code that we are using within our Jupyter Notebooks, and try to write general functions that can be used throughout all our notebooks.

Using functions to encapsulate code.

In order to have a neat and organized notebook, and avoid repeated code as much as possible, we can take advantage of functions and the concept of abstraction to write reusable functions and use them within the notebook whatever they are needed. Let’s take a look at the next cell.

The code shown above plots some graphs using seaborn and matplotlib. When we take a look at this piece of code a clear problem arises, and it is related to reusability, since in many cases we may need to get the same plot for different variables or columns in a dataset, having to write all the code each time we need a graph is at least a bad coding practice.

To avoid the above mentioned problem, we can encapsulate the code that creates the graph in a function and call it each time we need this particular graph. Let’s see how the implementation of the same code looks like using functions.

Now we have a function called create_plot that can be used anywhere within our notebook (after declaring it) and use each time we need this particular graph.

Having function definitions within a notebook may not be convenient, but what about importing those functions as any other library, like NumPy or Pandas? We can tap into Poetry to create a customized Python library and use it within our Jupyter notebooks wherever we need it.

Poetry and customized libraries.

Let’s create an elementary Library with Python to illustrate creating a custom package. The first thing to do is to create a Poetry Project, we can do this using the command:

poetry new custom_visualization

The command shown above will create a Python Package structure in which we can write the code of our custom library. the project structure looks like this.

│   pyproject.toml
│ README.rst

├───custom_visualization
│ __init__.py

└───tests
test_custom_visualization.py
__init__.py

I would say the most important file in this structure is pyproject.toml in which, among other things, we can specify the dependencies our custom library needs, in this example, we will use matplotlib and seaborn. To add dependencies we use the command:

poetry add dependency

Once created a project structure, the code we need for our particular application should be placed under the folder called custom_visualization, inside it, we can create a modular structure, in this example, we only need one module called plotting, which contains the functions we can use.

To make this project a Python library is necessary to build a wheel file, which is the standard built-package format used for Python distributions. The wheel file contains all the necessary information (metadata) for a Python install, we can use Poetry to make this file, the command is:

poetry build

The command shown above creates the wheel file under a folder called dist, the file with extension .whl can be installed using the command pip install then the library custom_visualization can be used by importing it inside our code.

The results.

Once installed the package we can use the library in our Jupyter notebooks by using the command import. For example:

from custom_visualization.plotting import distribution

Then we can access all the functions under the module distribution using it as a regular class.

distribution.create_distribution_plot(data=revolv_financially_risky,
title='Créditos Revolving - Clientes con problemas financieros',
x_label='Porcentaje de utilización',
y_label="Densidad" )

By wrapping our code inside functions and creating customized libraries the workflow shown in Jupyter Notebooks looks neater and focused all the attention on the results, additionally, it makes it easy to maintain the code and feasible to use version control tools such as git. Although this is a simple use case, the package can be as complex as we need, containing functions and classes to perform particular activities.

The code developed so far can be found here.

I am passionate about data science and like to explain how these concepts can be used to solve problems in a simple way. If you have any questions or just want to connect, you can find me on LinkedIn or email me at manuelgilsitio@gmail.com.

References.

--

--

Manuel Gil
Manuel Gil

Written by Manuel Gil

I am an engineer and I am passionate about the computer science field. I have two years working as a Data scientist, using data to solve industrial problems.

No responses yet