7 powerful VPS troubleshooting tips to minimize downtime
Maybe you happen to visit your website or service and find that nothing loads. Maybe you’ve gotten some angry emails from viewers or customers. Maybe your monitoring system (if you’re forward-thinking enough to have one) has fired off a text message with a dire warning.
Your server is down. However you do learn about downtime, and no matter what you’re doing with your virtual private server (VPS), your priority is clear: Get it back online. ASAP.
Think of downtime like a fire—it can happen to the most careful of us, and sometimes it happens for completely unexpected reasons. Remember fire drills from your middle school days? Having a plan outlined ahead of time minimizes panic and gets the job done fast.
In the same vein, you can develop a plan of attack for VPS troubleshooting. The better prepared you are in advance, the faster you’ll get back online, minimizing downtime. Here are just over a half-dozen ways back online.
Whether you want to know exactly when your server crashed or want a brief look into your system load for potential CPU overloading,
uptime is a great place to start.
$ uptime 14:35:45 up 1 day, 18:41, 1 user, load average: 0.04, 0.03, 0.05
The command’s output lets you know the system’s current time, how long it has been running, how many users are currently online, and what’s called the system load. The three numbers are load for the previous minute, 5 minutes, and 15 minutes, respectively. The higher these load averages are to 1, the more likely some process is overloading the CPU.
If you want to see a raw, chronological list of which commands have been run most recently, use
history. Sometimes, by examining what we’ve done in the past, we can better understand why something might not be working now. Can you correlate downtime to a recent
apt update? Is there something unexpected there?
w command tells you who is on the system and what they’re up to. This is perhaps most relevant if you have multiple people SSH-ing into your VPS, but could let you know if someone you don’t know is violating your space.
$ w 14:34:59 up 1 day, 18:41, 1 user, load average: 0.08, 0.03, 0.05 USER TTY FROM [email protected] IDLE JCPU PCPU WHAT joel pts/0 123.456.78.9 14:34 3.00s 0.01s 0.00s w
Here’s how to read the output:
USER– The user’s name.
TTY– The terminal type.
FROM– The IP or hostname from which the user accessed your VPS.
[email protected]– The time at which the user logged in.
IDLE– Their idle time.
JCPU– This refers to the time used by all the processes related to this terminal instance.
PCPU– This refers to the time used by only the current process that’s displayed in the
WHAT– The user’s current process from the command line.
You can also use
last | head # to see a list of all the previous logins.
Knowing exactly what your VPS is running—if anything during this state of downtime—can be useful in diagnosing issues. Run
ps auxf to see all the current processes. Because the output displays both CPU and memory, this could be an easy way to pinpoint a process that’s “gone rogue.”
Famous last words, eh? We all tend to underestimate how much disk space we’ll use, and once you’ve lapped up the last megabytes, it’s not going to be pretty.
df -h command to quickly visualize the amount of available space on every disk and partition. There are countless reasons why you might be low on space, from uploading too many gifs onto your WordPress blog to a massive
apt cache, but this will give you a place to start.
Take this a step further by checking for available inodes with the
df -hi variation.
If you are running out of RAM, you might start getting some notifications in
/var/log/messages related to “killing” processes. Use
grep to search for these messages like so:
$ sudo grep kill /var/log/messages
If there’s any output, that’s one sign that your VPS is trying its very best to free up RAM by killing any processes it can.
If you’re looking for a single command that can give you much of the same information you’d find in the many above, check out
top, and its variant
htop. You’ll need to install the former via your OS’ package manager, but
top comes with most all Linux installations.
$ top top - 16:08:53 up 1 day, 20:15, 2 users, load average: 0.00, 0.01, 0.05 Tasks: 25 total, 1 running, 24 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 8388608 total, 8220476 free, 44288 used, 123844 buff/cache KiB Swap: 0 total, 0 free, 0 used. 3024374 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 1 -19 42960 3316 2296 S 0.0 0.0 0:02.26 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd/210ddf 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper 56 root 1 -19 223060 73908 73616 S 0.0 0.9 0:17.83 systemd-journal
htop offers another level of detail, and some little graphs that might be easier to understand than raw numbers. Either is useful for seeing tons of information about your VPS at a glance.
While all the above commands are useful in some way, you’re the best judge as to which are most relevant, and in which order you’d like to run them, based on your experience and particular application.
There are dozens of other similar commands, so think of this as a starting point. Once you’ve mastered these, you can start diving into
sar and more. In fact, you can even see how Netflix investigates performance issues in their infrastructure for more ideas.
Of course, you can also get into monitoring systems so that you’ll know of downtime as soon as it happens, but that’s another post entirely. In the meantime, brush up on these 7 quick tips for understanding downtime and give yourself a leg up when the inevitable finally does happen again.