Recently, I got to work in building and running the web portals of a pan-India engineering entrance exam, with about 150k candidates (I can't say which one, due to non-disclosure agreement). Given the scale, and the system's history of failing due to surge in traffics, I decided to use Kubernetes on Google Cloud Platform (GKE). With some setbacks, I was able to pull it off well. More importantly, I learnt some valuable lessons that I think other people will benefit too. What I am sharing here is based on what I saw, read and hypothesized. I may be wrong, if so, please drop me an email.

Do you need Kubernetes?

The first question you should be asking when you want to try k8s for a project is whether you need it or not. Remember that, at the end of the day, everything is hosted on some machine somewhere in the world.

Kubernetes will not magically scale your system. Your application can still fail.

The reasons why I used k8s in this project is threefold:

Even though I faced some blunders in the first half of my job (I'll discuss about it later), in the second half, the result portal (with higher stakes) ran flawlessly without a glitch anywhere. So, looking back, I think I made the right choice!

Keep an eye on the bills

What I felt while operating GCP and AWS is that they love to hide your actual costs in tons of documentation. Even if you use the pricing calculator, it is far from accurate. The only way to not go bankrupt is to actively monitor your day to day costs and cut down at all places where you can. In this section, I'll put down some hard-earned chunks of wisdom.

Check out what type of machine is set as default. Generally they give you dual core 4 GB machines. This is enough for serving web applications, unless you need more juice for some compute heavy tasks. To avoid failures, you'd want to get as many nodes (horizontal scaling) or as powerful a node (vertical scaling) as possible, or a bit of both. But the tradeoff is that with more and bigger machines, comes higher costs.

If your cluster is set up in regional mode, rather than in zonal mode, the number of nodes you request will be multiplied by the number of zones in the cluster.

For example, the Mumbai region (asia-south1) has 3 zones (asia-south1-a, asia-south1-b and asia-south1-c). So if you request for 5 nodes, you'll get 15 in total. This might come as a surprise the next day, when you'll be billed 3x what you expected.

The machines are not billed as a unit, rather CPU, RAM and Disk are billed differently.

However, if you have poorly configured frontend, the killer would be your Network Egress (the amount of data GCP sent to your users from your clusters) costs.

Here's a story:

Our first service to go online was the information website regarding the exam. It was previously hosted on a single machine. As the exam days were approaching, more and more people started to hit the site every second and the page went super unresponsive.

I put the website into a cluster, spun up a sizeable number of pods under a LoadBalancer service and everything was back to being smooth. But the next day the Egress cost came out to be a couple thousand rupees!

So, what should you do to reduce the Egress costs? Here are some pointers:

Understand the networking

Understanding how networking is done among data centres was an eye-opener for me. Although, I understand in-depth very little, still I'd like to write something about it, mostly in layman's terms.

Let's consider that the machines owned by GCP form the "GCP world" and everything else (including your users) form the "outside world". Now, data can flow within and among these worlds in three ways:

  1. Egress: GCP world sends data to the outside world.
  2. Ingress: Outside world sends data to the GCP world.
  3. Data flows among the machines within GCP world.

Let's assume that your cluster is located in India and someone from Brazil wants to connect to your page. How will the data move? One way is that the machine sending the data would offload the packets to an external Service Provider as soon as possible and data reaches to the user via the public internet. This is called hot potato routing. The other way is that GCP uses its own network to send your cluster's data to a machine in Brazil. This machine then sends the data to the user over the public internet. This is called cold potato routing.

Yes, Google is an Autonomous System like other big internet companies and has its own global network.

GKE defaults to Premium tier networking which uses cold potato routing. The Standard tier uses hot potato and is totally free. The price difference is around 0.03 USD / GiB (so it isn't much).

Anyway here are some more references regarding routing and network tiers: [1] [2]

Add Resource Limits to your Deployment YAMLs

Anybody writing k8s YAMLs must have noticed sections like this:

        memory: "64Mi"
        cpu: "250m"
        memory: "128Mi"
        cpu: "500m"

Don't leave out the limits section for god's sake! If you omit, the pods would spin up fine and won't show any problem at all during development. However, in production, you will see the pods crash and restart even if there is enough resource left on the node.

If you don't provide the limits, Kubernetes will assume some limit close to your request values. If your pod's usage goes above these values, Kubernetes will crash the pod and restart.

This thing really bites in production.

Read the docs to know more.

Horizontal Scaling is not the panacea

Time for another story:

The second service to go online was a document download portal. Now the job is simple: Get Login details → Authenticate against the database → Fetch / generate the document → Send it to the user.

Here I was dealing with a legacy PHP system that generated PDF documents on the fly. Now, in previous years this never caused a problem. A critical difference between previous years and us was that in previous years they used a few high powered machines (no k8s), however, we used a lot of cheap low powered machines (k8s). My vision was that given enough machines, we should not fail at scale. But given that there was no clear APIs that we could load test properly (without using browser-level automation), we couldn't determine how many machine is enough.

From last year's statistics, we believed that the database would be a pressure point (since the document generation involved a triple join query). So, I didn't believe in my k8s setup for the database part and used GCP's Cloud SQL to host the database.

The end result: within 45 minutes of the launch, the system went down. It required hours of overnight debugging to solve the issue. I saw groups of pods oscillating between a momentary uptime and a long downtime. The database never received enough connections to drive its CPU usage above 20%.

The solution was to use a very high end machine to precompute all the 150k documents and serve those documents via a CDN (connected to Google Cloud Storage) and use our portal only to authenticate the users. I learnt a very valuable lesson that day:

What can be precomputed must be precomputed.

Push your data as far to the edge of your network as possible. The CDNs are the reason why the internet has not gone down even with today's high traffic volume.

To this day, I am not 100% sure what caused the downtime. But I have a pretty good hypothesis:

The document generator code consumes a lot of CPU and time to generate the PDF. Momentary 100% CPU usage is not a bad thing. But problem occurs when you are serving 1500 requests per second. Then the CPU usage remains 100% persistently.

Pod allocation to nodes is nothing but a problem of bin-packing. If the object to pack is larger than all the bins, there is no way that object can be packed. In the same way, the document generator code would have required 3 or more CPUs per pod to remain healthy. We used dual core nodes, so this was not possible. This caused the pods to fail. Once k8s restarts the pod, it remained healthy for a few seconds then the traffic once again drived the pod to 100% CPU usage and it crashed.

What does this show? It shows that unless resource disaggregation schemes (like the MIND project) becomes mainstream, you can't do better than what a single node can do for CPU-intensive tasks.

Know your Breakpoints beforehand

Write your code in such a way that you can easily perform load testing on it. Now, load testing doesn't mean that you create a 1000 processes and bombard your service with continuous request. The API hitting must be done in order to mimic human behaviour. For example, you can use Poisson arrival pattern. Apache Jmeter is a battle tested tool for this, although many cloud testing providers are available now. First, run your code locally and create a test plan by running Jmeter locally. Now crank up the requests rate to your heart's content and run Jmeter in Kubernetes using something like this. A rule of thumb is to expect 10x traffic than usual and test upto 50x the usual traffic.

Here's the final story of this post:

To not repeat the blunder of document generation portal, I rewrote the result portal code using JAM stack principles. I created a clean REST API using Go's Gin framework that can be load tested, and a frontend using VueJS that can be cached fully. The final load test showed that, with 15 pods, I can go as high as 7000 requests per second. I expected around 1500-2000 requests to show up per second in real life. However, the rate never crossed 800-900 mark. This resulted in a smooth functioning of the result portal, with almost 0 downtime.

Few words on Databases

The fact that Kubernetes can't scale your system by magic becomes apparent when you start dealing with databases.

The primary difference between an (stateless) application service and a database is that the application service doesn't have to concern about data consistency whereas for databases you gotta think in terms of the CAP theorem.

GKE does provide persistent disks. But two or more pods can't use the same persistent disk. Neither can you run some kind of consensus among disks belonging to same type of pods to ensure automatic consistency. Hence, the onus of data consistency falls under the application that is being run. If you develop your applications on, say, SQLite, you can't expect that tomorrow when you scale up, you'll get a consistent distributed SQLite database.

That being said, you should not overlook the power of high capacity single machine databases.

However, if I am in a scenario where I really need a distributed database, here's what I will consider:

  1. Is the data small and readonly?: If the data is readonly and can fit well in the allocated RAM for my pod, probably I don't need a database at all. An in-memory key value store can do the job better (this is what we used in result portal). If I were to use a database, if I put memcached in front of the database, it would cache all this data. So there really is no point using a database.

  2. Is the data large and has negligible write to read ratio?: A master slave architecture works best here. Create a master database with Read/Write capability and then create Read replicas. The machines may not be very powerful as a lot of them can be deployed without much delay in spreading the update. Example: A MySQL Statefulset as described here.

  3. Data is large and has high number of reads and writes: Either use master slave with a powerful master machine or use a consensus mechanism (State Machine Replication system) for this.


Here I tried to compile all the learnings that experience with Kubernetes has given me. I have written down all the myths I believed in, that were broken during the course of system management. I am sure I might have missed something important / written something wrong. Please feel free to correct me.

One last thing: you really don't sleep well when your system is out there running in the wild 😥.