Attention Beige Party-goers!
In the past month there have been two incidents that I'm aware of where the web server was hanging and the site was unresponsive for several minutes. One of those seemed to resolve on its own eventually; the other I happened to notice as it was occurring and resolved once I gave the web server a hard reboot. It's possible that this has happened more than twice and I just haven't noticed it.
Looking at the logs I'm not seeing anything specifically failing, so I think what's happening is that occasionally traffic spikes are overwhelming the system resources. If left alone, the server can eventually recover but it will remain unresponsive until it does.
We have grown a lot in the last two years so this isn't totally surprising. I have scaled up the web server from time to time in order to account for the processing demands of increased traffic, but I'm reaching the limits of what I can do with a single server.
So, what I'm planning to do is spin up an additional web server and put them both behind a load balancer. This has a few advantages. If one server is overtaxed then traffic can be shifted to the other. It also means that for simple server updates (ones that don't involve database schema changes), I can take one down to run the update while the other stays up. In other words, in most cases I won't have to take the entire site down to do server maintenance.
This all sounds great, but as usual I have no idea what I'm doing. So I'm going to move slowly to set things up in the background and make sure I'm 100% certain everything is going to work before I push it live. Until I have this new solution in place it's possible things could be a little bumpy. So far the server hanging has been an intermittent issue, which is why I want to take care of it now, before it becomes a bigger problem.
As always, thank you for bearing with me!
Beige-bless
a mastodon admin myself, and just planning another instance -
would you mind sharing your server specs and number of users?
@sebastian Currently I have a web server that's running Puma, Sidekiq, Redis, and Elasticsearch, and then a separate server running the database behind pgbouncer. The web server has 16 cores and 32 gb ram. I'm noticing that at times most of that ram is getting used up. We've got about 650 users, with around 400 active within the last month. That's a lot for us, but not a huge amount in the grand scheme of things. I've decided that for the first step I'm going to move everything but the puma processes to a different server, so that the web server will only be functioning as a web sever. This might resolve the issue on its own, and if not it's still a good first step before introducing a load balancer. I'm also going to try tweaking my puma process/thread counts to something that's not as taxing on memory, but still provides adequate response times.