Paperless: Dataproc Serverless Meets Jupyter

2 min readFeb 2, 2024

paperless is a Python package designed to streamline the execution of Jupyter Notebooks with a Spark kernel. Unlike traditional approaches that rely on server-based solutions like Dataproc, paperless allows users to seamlessly run batch Jupyter Notebooks with Spark kernel based on Remote Krenel, it provide more flexible and cost effective solution.

https://pypi.org/project/paperless/

GitHub - benmizrahi/paperless: A papermill implementation to run notebooks inside dataproc…

A papermill implementation to run notebooks inside dataproc serverless - GitHub - benmizrahi/paperless: A papermill…

github.com

What is Papemill ? it’s a project that focuses on building tools for working with Jupyter Notebooks programmatically. It provides a set of utilities to execute notebooks, making it an essential component for the remote kernel technique employed by paperless. By leveraging Papemill, paperless enables communication with a Spark kernel running remotely, facilitating the execution of Spark-enabled code directly within the Jupyter environment.

GitHub - nteract/papermill: 📚 Parameterize, execute, and analyze notebooks

📚 Parameterize, execute, and analyze notebooks. Contribute to nteract/papermill development by creating an account on…

github.com

What is dataproc serverless — Dataproc’s serverless solution enables users to submit Spark jobs without maintaining a dedicated Spark cluster. It dynamically allocates resources based on workload requirements, providing an elastic and cost-effective solution for distributed data processing — from google docs:

Use Dataproc Serverless to run Spark batch workloads without provisioning and managing your own cluster. Specify workload parameters, and then submit the workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing

Dataproc Serverless documentation | Dataproc Serverless Documentation | Google Cloud

Run Spark workloads without spinning up and managing a cluster.

cloud.google.com

pip install paperless

Installation and usage described in the project README including setup venv and google cloud preparation.

Enjoy,