7 powerful VPS troubleshooting tips to minimize downtime

VPS troubleshooting: A man frustrated over his laptop

Maybe you happen to visit your website or service and find that nothing loads. Maybe you’ve gotten some angry emails from viewers or customers. Maybe your monitoring system (if you’re forward-thinking enough to have one) has fired off a text message with a dire warning.

Your server is down. However you do learn about downtime, and no matter what you’re doing with your virtual private server (VPS), your priority is clear: Get it back online. ASAP.

Think of downtime like a fire—it can happen to the most careful of us, and sometimes it happens for completely unexpected reasons. Remember fire drills from your middle school days? Having a plan outlined ahead of time minimizes panic and gets the job done fast.

In the same vein, you can develop a plan of attack for VPS troubleshooting. The better prepared you are in advance, the faster you’ll get back online, minimizing downtime. Here are just over a half-dozen ways back online.

How long have you been ‘up’?

Whether you want to know exactly when your server crashed or want a brief look into your system load for potential CPU overloading, uptime is a great place to start.

$ uptime
 14:35:45 up 1 day, 18:41,  1 user,  load average: 0.04, 0.03, 0.05

The command’s output lets you know the system’s current time, how long it has been running, how many users are currently online, and what’s called the system load. The three numbers are load for the previous minute, 5 minutes, and 15 minutes, respectively. The higher these load averages are to 1, the more likely some process is overloading the CPU.

Get the full (his)story

If you want to see a raw, chronological list of which commands have been run most recently, use history. Sometimes, by examining what we’ve done in the past, we can better understand why something might not be working now. Can you correlate downtime to a recent yum/apt update? Is there something unexpected there?

Getting a handle on who

The w command tells you who is on the system and what they’re up to. This is perhaps most relevant if you have multiple people SSH-ing into your VPS, but could let you know if someone you don’t know is violating your space.

$ w
 14:34:59 up 1 day, 18:41,  1 user,  load average: 0.08, 0.03, 0.05
USER     TTY      FROM             [email protected]   IDLE   JCPU   PCPU WHAT
joel     pts/0    123.456.78.9     14:34    3.00s  0.01s  0.00s w

Here’s how to read the output:

  • USER – The user’s name.
  • TTY – The terminal type.
  • FROM – The IP or hostname from which the user accessed your VPS.
  • [email protected] – The time at which the user logged in.
  • IDLE – Their idle time.
  • JCPU – This refers to the time used by all the processes related to this terminal instance.
  • PCPU – This refers to the time used by only the current process that’s displayed in the WHAT field.
  • WHAT – The user’s current process from the command line.

You can also use last | head # to see a list of all the previous logins.

Is your VPS running?

Knowing exactly what your VPS is running—if anything during this state of downtime—can be useful in diagnosing issues. Run ps auxf to see all the current processes. Because the output displays both CPU and memory, this could be an easy way to pinpoint a process that’s “gone rogue.”

“I’ll never use up all that disk space”

Famous last words, eh? We all tend to underestimate how much disk space we’ll use, and once you’ve lapped up the last megabytes, it’s not going to be pretty.

VPS troubleshooting: Oh, this is bad

Use the df -h command to quickly visualize the amount of available space on every disk and partition. There are countless reasons why you might be low on space, from uploading too many gifs onto your WordPress blog to a massive apt cache, but this will give you a place to start.

Take this a step further by checking for available inodes with the df -hi variation.

Killing is never good

If you are running out of RAM, you might start getting some notifications in /var/log/messages related to “killing” processes. Use grep to search for these messages like so:

$ sudo grep kill /var/log/messages

If there’s any output, that’s one sign that your VPS is trying its very best to free up RAM by killing any processes it can.

Looking down from on high

If you’re looking for a single command that can give you much of the same information you’d find in the many above, check out top, and its variant htop. You’ll need to install the former via your OS’ package manager, but top comes with most all Linux installations.

$ top
top - 16:08:53 up 1 day, 20:15,  2 users,  load average: 0.00, 0.01, 0.05
Tasks:  25 total,   1 running,  24 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8388608 total,  8220476 free,    44288 used,   123844 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3024374 avail Mem

    1 root       1 -19   42960   3316   2296 S   0.0  0.0   0:02.26 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd/210ddf
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.00 khelper
   56 root       1 -19  223060  73908  73616 S   0.0  0.9   0:17.83 systemd-journal

htop offers another level of detail, and some little graphs that might be easier to understand than raw numbers. Either is useful for seeing tons of information about your VPS at a glance.

Putting it all together for faster VPS troubleshooting

While all the above commands are useful in some way, you’re the best judge as to which are most relevant, and in which order you’d like to run them, based on your experience and particular application.

There are dozens of other similar commands, so think of this as a starting point. Once you’ve mastered these, you can start diving into dmesg, ss, sar and more. In fact, you can even see how Netflix investigates performance issues in their infrastructure for more ideas.

Of course, you can also get into monitoring systems so that you’ll know of downtime as soon as it happens, but that’s another post entirely. In the meantime, brush up on these 7 quick tips for understanding downtime and give yourself a leg up when the inevitable finally does happen again.