If you were in the data engineering industry for a while you probably get know all the existing notebooks solutions out there. in this article I’m gonna focus on what and how we use it in our Data & M.L platform.
First let talk about us…
Plarium is a leading gaming company in the industry — Plarium’s games aren’t just play-for-fun it’s totally a way of life — with millions of active players every day our games plays top 1# gaming company in RPG & Strategy games. Raid Shadow Legends is just one example of a game that blows the charts every day.
All our games relay on data in gaming industry it’s not just “nice-to-have” solution in the tech stack — with more and more companies competing limited players — our game experience must be at the best we can deliver and the way to know that relay on our data. All of that data is being collect, transformed, analysed and exposed in our Data & M.L platform — that’s what drives our business forward and provide our player the best game experience.
So with that context — let’s start:
First thing we did is to decided our main tech stack solution and for that we need to focus on the goal and requirements of our Data & M.L platform interactive system:
- Always Available — The main goal of a data platform in a fast growing company is to give us the ability to take actions based on numbers with minimum time and effort — to do that we need to have flexible environment that everyone can use anytime, anywhere.
- Multiple Clients — As the data grows the need to investigate and learn from it also grows, with that fact in mind our platform should support every data costumer (Engineer, Analysts, Scientist, Visualisation, BI etc).
- Scale To Zero — Another thing is the concept of paying only on what we need. defining the needs and estimation of payment is highly critical — when you need to process TB’s of data — take resources as you need.
when you no longer need the data — destroy.
- Minimum Maintenance Efforts — As a concept we don’t want to have an entire DevOps department to support our system stability and health, we as an engineers want to take it on our own — with minimal Friday night wake-up calls.
- Isolated User Environment — One of the lesson’s we learned during investigation is that shared user’s environment sucks, every change, load or installation in the environment effects all users and with that — it can break user’s work. We prefer the handle stuff individually and make the user environment isolated as possible.
Sure we have lot’s of other requirements and needs but basically that’s the 5 general ideas we keep in mind during our journey. Next thing we need to take in place is our tech stack :
Making it reliable — From VM’s to Containers:
After a long time working with VM’s on the cloud we suffer a-lot from updates, packages deprecation and mismatch with packages needed to be installed — it’s either if you work with Scala and share the JVM class-path or working with python and discover that the latest package update has some braking changes (of corse it happens at Friday 8pm) — So we wanted to avoid it and make our solution based on immutable containers. it’s a tech solution that proved it self and it’s just works.
Making it Scale — Containers & K8s:
Since we were already working with GCP for a long time, basically we had no problem of scaling containers at the cloud — we used Google Kubernetes Engine to create/destroy and update our containers solution.
Making it Schedule — Cloud Composer (Airflow):
We truly believe in solutions as a service, Google Cloud as a platform has ton’s of services that can really make our life easier, especially when we are talking about R&D teams that need to be focused on the business and not over discovering why the hell the process didn’t run today. We choose on going with GCP managed Airflow solution (AKA: Cloud Composer). Composer really makes our life easy after a long time spent on other scheduler systems we found it as the most popular, simple (infra as a code) and stable solution. also when choosing to go with Composer we got for “free”: containerized Airflow, Authentication , GCP plugins and many more
Making it a distributed — Apache Spark :
We where using Apache Spark for a while — All our pipelines are built using Spark and it play’s big part of our ability to process huge amounts of data in both streaming and batch. So as a starting point we wanted to bring this technology to our interactive solution. — the last release (v3.0.0) of spark was a game changer for us and we will discuss it a bit later.
JupyterHub — Connecting the Dots
So after we clarified what tech stack we want to use we start investigation solutions out there in the open-source forest. The first thing we spot out there is JupyterHub — a solution provided by the Jupyter community for multi-user environment — the best part of it is the awesome documentation of how to get things done in the cloud with our stack at GCP Kubernetes: https://zero-to-jupyterhub.readthedocs.io/en/latest/. So as a stating point we created the cluster and managed to launch the JupyterHub part, but that’s where things come’s to a point we need to go deeper into the code.
So to support Apache Spark in Jupyter we get into the forest again, and start investigating solutions that will give us the ability to run interactive spark on k8s, with auto-scale and on-demand provisioning of resources. we ended up using Livy & SparkMagic.
Apache Livy - is a rest API session manager over Apache Spark, the main idea was that in idle time we will have minimum existing pods in the cluster so resources could be free and downscale will take effect- Livy on it’s main repo still doesn’t support Spark K8s, but we found a repo (https://github.com/jahstreet) that have all we needed for our architecture. using this repo we got the behaviour we wanted — spark spins up pods on session initialize and destroy them when session ends.
SparkMagic — is the glow between Jupyter and Livy, it provides a kernel in PySpark, Scala and R with them we can run Spark interactive jobs via Livy service, this “Magic Show” gives the Jupyter instance a distributed processing powers that can handle endless data size just by connecting the components together.
To make it more clear let’s see the following diagram that shows the components taking part on our system:
So you can now spot the instances that takes part in our echo-system .
When it’s cool-down period the only pods being up are — Hub Pod, Proxy Pod and Livy service, those of corse can have minimum resources and giving that we can assign them to “low-resources” node pool in Kubernetes.
In warm-up period we have xN Jupyter instances (1 per user) that being allocated using the profile list for each user:
- display_name: "Default Small Image"
- display_name: "Strong Compute Image"
- display_name: "Super Strong Image"
When user choose one of the profiles he need’s JupyterHub spins a pod with the node_selector key, that triggers and auto-scale by the K8S system to allocate new nodes (if needed).
So-far-So good but what happens when users starts using spark code? Well that’s where the magic comes in place our new Livy service spins up a driver pod per user and start executing the needed statement the user requested.
You probably notice I didn’t say nothing about the the executers — that’s where Spark 3.0.0 was a game changer for us. Apache Spark has a mechanism called dynamicAllocation when using it Spark based on workload spins up more and more executers to support the scale, spark can estimate during processing the number of executers needed to support the next computation.
From Spark Documentation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
Before Spark 3.0.0 this feature of Spark could be enable only if externalShuffleService is enabled — it’s needed to control the shuffle files when Spark decide to destroy executer once workload is done. The main problem was that the externalShuffleService wasn’t supported in kubernetes before Spark 3.0.0 - but today using the shuffleTracking flag you can control this feature on you’re own, so we ended up using the following spark config:
"spark.dynamicAllocation.shuffleTracking.timeout": "10s" ,
This config makes sure every user will get 2 executers at the beginning of his work and Spark will spin up more and more executers as needed based on the statement workload. Using this feature we can dynamically support workload’s and give the user’s performant environment.
Another thing needed to be setup in this architecture is the ability to load JAR’s into Spark class-path dynamically without the efforts of building new image on every JAR needed. to do so we used a tool called gcsfuse. gcsfuse give’s us the ability to mount a google bucket into a pod and use it as a file-system to load files into spark runtime, to install gcsfuse we added the following command in our custom Spark image:
RUN apt-get -y -o Acquire::Check-Valid-Until=false update && apt-get install --yes --no-install-recommends \
&& echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" \
| tee /etc/apt/sources.list.d/gcsfuse.list \
&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
&& apt-get -y -o Acquire::Check-Valid-Until=false update \
&& apt-get install --yes gcsfuse \
&& apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
After installing gcsfuse you can now simply run the mount command on the preStart hook and you will mount the bucket into the pod:
gcsfuse -o nonempty --key-file /path/to/service/account/secret.json [bucket name] /etc/google/
and configure spark to add jar’s to it’s class-path:
Conclusions and what’s next:
After making this progress, we ended up having a platform chipper (compared to the original VM’s 24/7 solutions), reliable every component is isolated — user pods issues can only effect his own pod, without having to deal with shared-resources environment where everyone effects everyone, almost Zero Maintenance to system component — each time we need to upgrade the system we use pre-production image tags to lunch and test the new images which are immutable objects, nothing can effect the packages unless it’s an external issues we couldn’t control on our own.
Next things needs to be handled:
- Spark Image is huge — even when using all internally inside our GCP network still pulling spark image takes time — that issue maybe isn’t breaking anything but it cause our users to wait some time (2–3 min) when launching spark job. today we handle this issue with having a job that’s triggers a minimum number of nodes at 7am — with that we also setup a PrePull DaemonSet that pulls Spark image on every K8S node before launching any job. using this technic we decrease the spin up from minutes to seconds :
- Spark Shuffle Tracking is experimental feature — we took risk here, Spark documentation marks this flag as an experimental feature and it might have problems in the future. But because we know that supporting dynamicAllocation in K8S exists in spark roadmap we can adopt our solution in the next released. so far running in production we haven’t seen major effect over stability or performance.
- SparkMagic — it’s good but not enough — Spark kernel’s still have lot’s to do in terms of UX and stability. Sometimes we run into problems like unable to cancel execution of cell or kernel status is Unknown.
Follow us and in the next articles I will share the scheduling solutions using Composer and how we developed a full CI&CD flow for Jupyter notebooks, form commit to production.
Thanks for reading,