Hi everyone, We’ve been having random 502 errors forever but had trouble reproducing them. Last week while working on load testing pieces of our API, the 502s started showing up with
POST requests. After a ton of reading and following a few GitHub issues, I came across these two posts about GCP http/s load balancers and keepalives. We implemented the keepalives to be greater than the LB keep alive per the posts, and the problem is solved. I wanted to let everyone here know since many here have helped me troubleshoot this or different issues at some point.
Specifically, this seems to be the way GCP Load Balancers are configured. While the two linked blog posts mention http/s load balancers, we are using TCP load balancers, and the suggestion to increase the keepalive time worked for us. I guess the TCP load balancers have the same keep alive as the http load balancers in GCP. Interesting things that happened because of this.
- No more random 502s
- API latencies are more consistent and less spikey. We had spikes of 30s, and now the largest is around 5s.
- CPU usage in all the pods has been reduced and is less spikey
- Memory usage by the pods has been reduced by 50%
What we changed:
Changing the nginx-ingress keep-alive setting alone did not fix the issue but resulted in fewer 502s. After adding the same keep-alive setting to our node app the problem was fully corrected.
In this screenshot, you can see the results of increasing the keepalives. An annotation to indicate when the keepalive fix was deployed is in the success rate and latency graphs.
The following blog posts are what finally pointed me in the right direction.