Jupyter — helpful extra magics

Ben Mizrahi
Plarium-engineering

--

Working for a while with Jupyter, both for ML processes and Data Engineering ETL’s, we came up with an interactive solution to make our ETL’s stable, reliable and maintainable — read more about how we built our interactive platform in the following story:

https://medium.com/@benm_23166/tech-story-spark-jupyter-on-k8s-64b9d43a37ba

After having all the infrastructure ready to work - we started writing our ETL pipelines in Jupyter and schedule it in Google Cloud Composer. In our ETL’s we encountered multiple problem, and we wanted to resolved them with Jupyter Magics infrastructure. Our goal was to create more modular, maintainable and flexible source of code. In the next chapter we will go over the problems we had and give an example of our self-developed magics that resolved each issue:

  1. Huge Notebooks, Harder to maintain — As time goes by, notebooks complexity gets impossible to maintain. You start with a small notebook and end up with 10k of messy and complex code. To make our notebooks more readable and maintainable — we developed new magics thing we call SuperRun. SuperRun knows to run any notebook within the same kernel and it brings us the ability to split notebooks into smaller units. You can imagine working in any other IDE: of course you can write your program in a single java/python file, but you probably won’t do that/ You would split your code to logical files that represent functionality based on your solution needs. I feel the same about notebooks: splitting the source code to smaller notebooks makes your job more readable and maintainable. We will see some examples next to clarify those needs.
  2. Notification of plots and tables — A lot of our notebooks contain reports, both technical and business-oriented. One main feature we were looking for is creating a notification builder within the notebook flow. Sending the notebook as a PDF is in most cases considered too much information — for this we created the notify magic. This magic can collect pieces of information while executing the notebook, and send results when needed. We will see some illustrative examples to make this more clear next .
  3. BigQuery Scan/Cost Estimation — Another critical and a super powerful magic is the estimator. While working with BigQuery for a while, the main billing factor is scan. When working with BigQuery UI the end users has a scan estimator that can warn them when they’re doing something wrong . This way they can can avoid scanning large dataset and do cost optimization before even running the query. In Jupyter, the estimator doesn’t exist and we wanted to give the user the ability to get scan information without the need to go out of Jupyter — that’s where the BQ Estimator Magic comes in.

Let’s get started:

Installation:

pip install jupyter_extra_magics ORgit clone https://github.com/benmizrahi/jupyter-extra-magics && cd jupyter-extra-magics && python3 setup.py sdist bdist_wheel && pip install dist/*.whl

After having the package installed, you need to modify jupyter_notebook_config.py to autoload the magic when Jupyter instance is loaded:

c = get_config()
c.InteractiveShellApp.extensions = ['extra_magics']

Once you configure that, the extra magic is ready to be used inside Jupyter instance.

In Practice — Using the magics:

Deceleration of %%estimate magic:

%%estimate
optional params:
block_cost -
decimal/integer cost ($) blocker to run the query,
if cost is above the amount, Y/N question will be display
block_scan -
integer (GB) blocker to run the query,
if scan is above the amount, Y/N question will be display
dry-run -
Boolean - If the current cell contains %%bigquery magic ?

In practice:

%%estimate --block_cost 0.1 --block_scan 10 --dry-run True
%%bigquery
SELECT count(1) FROM `bigquery-public-data.worldpop.population_grid_1km` WHERE last_updated is not null

The estimate magic will block this query by presenting a Y/N input. If the user sets Yes — query will progress and the result will be presented — if NO, the query will be blocked:

Deceleration of %super_run magic:

%super_run 
notebook:
String, full path the the notebook location.
(example: /full/path/to/other/notebook.ipynb)
regex_filter:
String, declared regex for the content of the cell, means that if the if the regex pattern
exists in the cell, the cell will be executed, if not it will be skip execution.
this regex enable us to write "GENERIC" notebook in multiple kernel types, and
use them where needed.

In practice:

A simple example of using the super_run to execute sec notebook from first, so in our Jupyter files explore we created 2 notebooks:

Multiple Notebooks — Single Kernel

The content of sec notebook is a simple variable declaration named: other_param:

Params declaration

In the first notebook we declare a local variable named local_param, out target is to execute the sec notebook and get the other_param into the kernel local scope:

Params passed between notebooks — shared kernel

As you can see - both variables exist in the local kernel and we can continue the program. Using this technique, you can split your notebooks into smaller (and logical) parts — and keep your notebook’s flow clean and simple. Another ability we added is to download notebooks directly from GitHub — this is very useful when running single notebook in papermil or any other platform. Simply change the environment parameters and before executing the notebook — the magic will download it from GitHub:

IS_BATCH_JOB — If set to true — the magic takes the required notebook directly from GitHub, if false the magic uses local filesystem to run the notebook.

GIT_REPO — a git base repository to take the notebooks from.

GITHUB — a GitHub repository to take the notebooks from.

GIT_TOKEN — Token to access the git repo in raw mode.

Deceleration of %notify_collect,%notify_clean and %notify magic:

Using this magic requires having the following environment parameters declared when the Jupyter instance created:

SLACK_API_TOKEN — sending slack notification using slack sdks for python.read more about the web-token inside slack repo link: (https://github.com/slackapi/python-slack-sdk)

EMAIL_USER — username for smtp connection to send email.

EMAIL_PASS — password for smtp connection to send email.

ENV — current running environment (INT/PROD/INTERACTIVE).

%%notify_collect 
kind:
EMAIL|SLACK - what is the type of the message we want to send
destination:
String - how should we notify, a comma separated list
of destinations (example: "email@address.com") or comma
separated channels (example: #channel,#channel_two)
header:
String - the EMAIL header message
body:
String - the email body (will be append before the result-set).

The collect magic work with keys and can send more than one email on a notebook’s flow. The key for each notification is the following: kind, destination and header. Each permutation of the 3 parameters will be considered as a new notification.

In practice, wrap any paragraph with %%notify_collect to add the result-set to the email body. The collect magic can handle plots, charts and string results. All outputs will be added to the mail body one by one. As an example let’s look at the following:

The results-set of the query will be collected inside the collect magic, and the key for the notification is: email_email@address.com_Hello_World, each time this set of params will be presented in %%notify_collect, the result-set will be appended to the same email. you can also use external params to make your notification dynamic:

Another use-case we had is a way to clean the notification while running the notebook, so we presented %notify_clean to clean all the collected emails so far:

%notify_clean - cleaning the mails loaded until now 

After collecting all the wanted emails we can send it via %notify magic and all collected notifications will be sent to the destinations.

%notify 
show_env - is the env value should be visible in the message header
([INTERACTIVE]|[INT][PROD])

So the final notebook result will be as following:

Thanks for reading, feel free to contribute to the repository or report on bugs, issues, etc.
Ben Mizrahi

--

--