Paperless: Dataproc Serverless Meets Jupyter

Ben Mizrahi
2 min readFeb 2, 2024
serverless computing

paperless is a Python package designed to streamline the execution of Jupyter Notebooks with a Spark kernel. Unlike traditional approaches that rely on server-based solutions like Dataproc, paperless allows users to seamlessly run batch Jupyter Notebooks with Spark kernel based on Remote Krenel, it provide more flexible and cost effective solution.

https://pypi.org/project/paperless/

What is Papemill ? it’s a project that focuses on building tools for working with Jupyter Notebooks programmatically. It provides a set of utilities to execute notebooks, making it an essential component for the remote kernel technique employed by paperless. By leveraging Papemill, paperless enables communication with a Spark kernel running remotely, facilitating the execution of Spark-enabled code directly within the Jupyter environment.

What is dataproc serverless — Dataproc’s serverless solution enables users to submit Spark jobs without maintaining a dedicated Spark cluster. It dynamically allocates resources based on workload requirements, providing an elastic and cost-effective solution for distributed data processing — from google docs:

Use Dataproc Serverless to run Spark batch workloads without provisioning and managing your own cluster. Specify workload parameters, and then submit the workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing

pip install paperless

Installation and usage described in the project README including setup venv and google cloud preparation.

Enjoy,

--

--