Working for a while with Jupyter, both for ML processes and Data Engineering ETL’s, we came up with interactive solution to make our ETL’s stable, reliable and maintainable — read more about how we build our interactive platform in the following story:
After having all infrastructure ready to work - we started writing our ETL pipelines in Jupyter and schedule it in Google Cloud Composer. In our ETL’s we came across multiple problems — we wanted to resolved them with Jupyter Magics infrastructure. Our goal was to create more modular, maintainable and flexible source of code. In the next chapter we will go over the problems we have and give an example of out self-develop magics that resolved each issue:
- Huge Notebooks, Harder to maintain — As time gos by, notebooks complexity gets impossible to maintain. Stating with small notebook and ending up with 10k of massy and complex code. To make our notebooks more readable and maintainable we developed a new magic we call SuperRun. SuperRun knows to run any notebook within the same kernel and it brings us the ability to split notebooks into smaller units. You can imagine working in any other IDE — you can of course write you program in a single java/python file but you probably won’t do that — you will split your code to logical files that represent functionality base on your solution needs — I feel the same about notebooks splitting the source code to smaller notebooks makes you’re job be more readable and maintainable — we will see some examples next to clearly the needs .
- Notification of plots and tables — Lot of our notebooks contains reports, both technical and business. One main feature we wanted is to create a notification builder within the notebook flow — sending the notebook as a PDF is in most cases consider as too much information — for that we created the notify magic, this magic can collect pieces of information while executing the notebook and send results when needed — also here we will see examples to make it more clear.
- BigQuery Scan/Cost Estimation — Another critical and super powerful magic is the estimator. while working with BigQuery for a while, the main billing factor is scan — when working with BigQuery UI the end user has a scan estimator that can warn him when his doing something wrong — With that we can avoid scanning large dataset and to do cost optimization before even running the query. In Jupyter the estimator doesn’t exists and we wanted to give the user the ability to get scan information without the need to go out of Jupyter — that’s where BQ Estimator Magic comes in.
Let’s get started:
pip install jupyter_extra_magics ORgit clone https://github.com/benmizrahi/jupyter-extra-magics && cd jupyter-extra-magics && python3 setup.py sdist bdist_wheel && pip install dist/*.whl
After having the package installed, you need to modify jupyter_notebook_config.py to autoload the magics when Jupyter instance is loaded:
c = get_config()
c.InteractiveShellApp.extensions = ['extra_magics']
Once you configure that, the extra magics are ready to use inside Jupyter instance.
In Practice — Using the magics:
decimal/integer cost ($) blocker to run the query,
if cost is above the amount, Y/N question will be display
integer (GB) blocker to run the query,
if scan is above the amount, Y/N question will be display
Boolean - If the current cell contains %%bigquery magic ?
%%estimate --block_cost 0.1 --block_scan 10 --dry-run True
SELECT count(1) FROM `bigquery-public-data.worldpop.population_grid_1km` WHERE last_updated is not null
The estimate magic will block this query by presenting a Y/N input, if the user set’s Yes — query will progress and result will be presented — if NO, the query will be blocked:
String, full path the the notebook location.
String, declared regex for the content of the cell, means that if the if the regex pattern
exists in the cell, the cell will be executed, if not it will be skip execution.
this regex enable us to write "GENERIC" notebook in multiple kernel types, and
use them where needed.
Simple example of using the super_run to execute sec notebook from first, so in our Jupyter files explore we created 2 notebooks:
The content of sec notebook is a simple variable declaration named: other_param:
In the first notebook we declare a local variable named local_param, out target is to execute the sec notebook and get the other_param into the kernel local scope:
As you can see both variables exists in the local kernel and we can continue the program. using this technic you can split you’re notebooks into smaller (and logical) parts — and keeping you’re notebook flow clean and simple. Another ability we added is to download notebooks directly from GitHub — this is very useful when running single notebook in papermil or any other platform, simply change the environment parameters and before executing the notebook the magic will download it from GitHub:
IS_BATCH_JOB — If set to true — the magic takes the required notebook directly from GitHub, if false the magic uses local filesystem to run the notebook.
GIT_REPO — a git base repository to take the notebooks from.
GITHUB — a GitHub repository to take the notebooks from.
GIT_TOKEN — Token to access the git repo in raw mode.
Using this magic requires having the following environment parameters declared when the Jupyter instance created:
SLACK_API_TOKEN — sending slack notification using slack sdks for python.read more about the web-token inside slack repo link: (https://github.com/slackapi/python-slack-sdk)
EMAIL_USER — username for smtp connection to send email.
EMAIL_PASS — password for smtp connection to send email.
ENV — current running environment (INT/PROD/INTERACTIVE).
EMAIL|SLACK - what is the type of the message we want to send
String - how should we notify, a comma separated list
of destinations (example: "firstname.lastname@example.org") or comma
separated channels (example: #channel,#channel_two)
String - the EMAIL header message
String - the email body (will be append before the result-set).
The collect magic works with key’s and can send more then one email on notebook flow, the key for each notification is the following: kind, destination and header. each permutation of the 3 parameters will be considered as a new notification.
In practice, wrap any paragraph with
%%notify_collect to add the result-set to the email body, the collect magic can handle plots, charts and string results, all outputs will be added to the mail body one by one, as an example let’s look at the following:
The results-set of the query will be collected inside the collect magic, and the key for the notification is: email@example.com_Hello_World, each time this set of params will be presented in
%%notify_collect, the result-set will be appended to the same email. you can also use external params to make you’re notification dynamic:
Another use-case we had is a way to clean the notification while running the notebook, so we presented
%notify_clean to clean all the collected emaills so far:
%notify_clean - cleaning the mails loaded until now
After collecting all the wanted email’s we can send it via
%notify magic and all collected notifications will be send to the destinations.
show_env - is the env value should be visible in the message header
So the final notebook result will be as following:
And the result notification will be as the following:
Thanks for reading, feel free to contribute to the repository or report on bugs, issues etc,