Paperless: Dataproc Serverless Meets Jupyter
paperless
is a Python package designed to streamline the execution of Jupyter Notebooks with a Spark kernel. Unlike traditional approaches that rely on server-based solutions like Dataproc, paperless
allows users to seamlessly run batch Jupyter Notebooks with Spark kernel based on Remote Krenel, it provide more flexible and cost effective solution.
https://pypi.org/project/paperless/
What is Papemill ? it’s a project that focuses on building tools for working with Jupyter Notebooks programmatically. It provides a set of utilities to execute notebooks, making it an essential component for the remote kernel technique employed by paperless
. By leveraging Papemill, paperless
enables communication with a Spark kernel running remotely, facilitating the execution of Spark-enabled code directly within the Jupyter environment.
What is dataproc serverless — Dataproc’s serverless solution enables users to submit Spark jobs without maintaining a dedicated Spark cluster. It dynamically allocates resources based on workload requirements, providing an elastic and cost-effective solution for distributed data processing — from google docs:
Use Dataproc Serverless to run Spark batch workloads without provisioning and managing your own cluster. Specify workload parameters, and then submit the workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing
pip install paperless
Installation and usage described in the project README including setup venv and google cloud preparation.
Enjoy,