Some Intricacies of NGINX Caching
Contrary to intiuition, NGINX apparently caches 502 errors.
Following the Guide to Caching with NGINX I set up an NGINX cache in front of a Wordpress installation. The default setup
http {
proxy_cache_path /dev/shm/nginx_cache
levels=1:2
keys_zone=nginx_cache:10m
max_size=100m
inactive=1h
use_temp_path=off;
server {
listen 80;
server_name domain.tld;
allow all;
location / {
proxy_cache nginx_cache;
# Override origin cache control
#
proxy_ignore_headers Cache-Control;
proxy_cache_valid any 1h;
# Also cache responses that set a cookie
#
proxy_ignore_headers Set-Cookie;
add_header X-Cache-Status $upstream_cache_status;
proxy_pass http://ORIGIN/;
# Rewrite headers returned by the proxied resource
#
proxy_redirect default;
}
}
}
worked quite nicely, speeding up page loads by a factor of ten while displaying updates with a delay of an hour at most — requiring no plugins or caching-related changes on the Wordpress site.
Soon after that went live, visitors reported a "502 Bad Gateway" error on the homepage, and on the homepage only. Investigating, Wordpress was up and running. What had happened?
I toyed a little with proxy and cache configuration, and with disabling a proxied upstream resource. To me, it looks like this:
- With proxy, cache and upstream resource running, the setup works as expected: the proxy fetches objects from upstream the first time, serves them from cache for the "valid" duration, and re-fetches them after they have expired.
- In case the upstream resource is down and there are valid objects in the cache, the proxy serves them from cache.
- In case the upstream resource is down and there are no valid objects in the cache, the proxy returns "502 Bad gateway".
- However, in case a previously unavailable upstream resource becomes available again, the proxy will only attempt to re-fetch it after the "valid" duration has passed, and keep returning "502 Bad Gateway" in the meantime — although, at the time of the request, the upstream resource is actually alive and serving.
- For the site in question this resulted in the strange impression that some pages caused a "502 Bad Gateway" (the uncached ones requested when Wordpress was down) while others didn't (which either still had a valid cache, or were requested when Wordpress was available again).
Now, with the problem identified as an incorrect, sort-of-cached assumption of NGINX that an upstream resource was down, the solution would be to invalidate / purge the cache. That, however, appears to be suprisingly difficult. Without using NGINX Plus features (that are not available in the open source version), the only way to force NGINX to re-fetch objects from the upstream resource before expiration seems to be a shutdown and restart. Reloading a changed config, or even wiping the cache directory while NINGX runs does not suffice.
Intuitively, I assumed that NGINX would re-fetch a valid object as soon as possible, and not give the impression that an upstream resource was down when it actually is not. But it does not seem to work that way.
In the setup in question the issue dissolved when I decided to use NGINX not only as an accelerator, but also as a failover cache, following the "Delivering Cached Content When the Origin is Down" section from the guide, changing
inactive=365d
and adding
proxy_cache_use_stale error timeout http_500 http_502 http_503 http_504;
Now NGINX will serve stale content from the cache in case the upstream resource is down, rendering it unlikely for an end user to hit a "502 Bad Gateway" error.
I'd still be interested in how to properly get the NGINX cache to update unavailable objects as soon as possible. Got a hint? Poke me on Twitter.