How to fix the disk usage warning when /home partition or /home/nutanix directory is full

September 7, 2020 at 12:11 pm Leave a comment

Source: https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0600000008dpDCAQ

Summary:

This article describes ways to safely free up space if /home or /home/nutanix becomes full or does not contain enough space to facilitate an AOS upgrade or PCVM upgrade.

Versions affected:

ALL Prism Central Versions, ALL AOS VersionTroubleshootingUpgrade

Description:

WARNING: DO NOT treat the Nutanix CVM (Controller VM) or PCVM as a normal Linux machine. DO NOT perform “rm -rf /home” on any of the CVMs or PCVM. It could lead to data loss scenarios. Contact Nutanix Support in case you have any doubts.

This condition can be reported in two scenarios:

  • The NCC health checkdisk_usage_check reports that the /home partition usage is above a certain threshold (by default 75%)
  • The pre-upgrade check test_nutanix_partition_space checks if all nodes have a minimum of 5.6 GB space on the /home/nutanix directory before performing an upgrade

The following error messages will be generated in Prism by the test_nutanix_partition_space pre-upgrade check:

Not enough space on /home/nutanix directory on Controller VM [ip]. Available = x GB : Expected = x GB
Failed to calculate minimum space required
Failed to get disk usage for cvm [ip], most likely because of failure to ssh into cvm
Unexpected output from df on Controller VM [ip]. Please refer to preupgrade.out for further information

Nutanix reserves space on the SSD-tier of each CVM for its infrastructure. These files and directories are located in the /home folder that you see when you log in to a CVM. The size of the /home folder is capped at 40 GB so that the majority of the space on SSD is available for user data.

Due to the limited size of the /home partition, it is possible for it to run low on free space and trigger Prism Alerts, NCC Health Check failures or warnings, or Pre-Upgrade Check failures. These guardrails exist to prevent /home from becoming completely full, as this causes data processing services like Stargate to become unresponsive. Clusters with multiple CVMs having 100% full /home partition will often result in downtime for user VMs.

The Scavenger service running on each CVM is responsible for the automated clean-up of old logs in /home and improvements to its scope were made in AOS 5.5.9, 5.10.1, and later releases. For customers running earlier AOS releases, or in special circumstances, it may be necessary to manually clean up files out of certain directories in order to bring space usage in /home down to a level that will allow future AOS upgrades.

When cleaning up unused binaries and old logs on a CVM, it is important to note that all the user data partitions on each drive associated with a given node are also mounted within /home. This is why we strongly advise against using undocumented commands like “rm -rf /home”, since this will also wipe the user data directories mounted within this path. The purpose of this article is to guide you through identifying the files that are causing the CVM to run low on free space and removing only those which can be safely deleted.

Solution:

WARNING: DO NOT treat the Nutanix CVM (Controller VM) as a normal Linux machine. DO NOT perform “rm -rf /home” on any of the CVMs. It could lead to data loss scenarios. Contact Nutanix Support in case you have any doubts.

Step 1: Parsing the space usage for “/home”.

Log in to CVM, download KB-1540_clean_v7.sh to /home/nutanix/tmp directory, make it executable and run it.

KB-1540_clean_v7.sh has some checks (MD5, compatibility, etc.) and deploys KB-1540_clean_v7.sh script accordingly.

nutanix@cvm:~$ cd ~/tmp
nutanix@cvm:~/tmp$ wget http://download.nutanix.com/kbattachments/1540/KB-1540_clean_v7.sh
nutanix@cvm:~/tmp$ mv KB-1540_clean_v7.sh KB-1540_clean.sh
nutanix@cvm:~/tmp$ chmod +x KB-1540_clean.sh
nutanix@cvm:~/tmp$ ./KB-1540_clean.sh

You can select to deploy the script to the local CVM or all CVMs.

========
Select package to deploy
     1 : Deploy the tool only to the local CVM
     2 : Deploy the tool to all of the CVMs in the cluster
    Selection (Cancel="c"):

Run the script to get a clear distribution of partition space usage in /home.

nutanix@cvm:~/tmp$ ./nutanix_home_clean.sh

Step 2: Check for files that can be deleted from within the list of approved directories.

PLEASE READ: The following are the ONLY directories within which it is safe to remove files. Take note of the specific guidance for removing files from each directory. Do not use any other commands or scripts to remove files. Do not use “rm -rf” under any circumstances.

  1. Removing Old Logs and Core Files Before removing old logs, check to see if you have any open cases with pending RCAs (Root Cause Analysis). The existing logs might be necessary for resolving those cases and you should check with the owner from Nutanix Support before cleaning up /home. Only delete the files inside these directories. Do not delete the directories themselves.
    • /home/nutanix/data/cores/
    • /home/nutanix/data/binary_logs/
    • /home/nutanix/data/ncc/installer/
    • /home/nutanix/data/log_collector/
    Use this syntax for deleting files within each of these directories: nutanix@cvm:~$ rm /home/nutanix/data/cores/*
  2. Removing Old ISOs and Software Binaries Begin by confirming the version of AOS that is currently installed on your cluster by running the command below. Make sure never to remove any files that are associated with your current AOS version. You will find this under the “Cluster Version” field in the output of the command shown below. nutanix@cvm:~$ ncli cluster info Example output: Cluster Name : Axxxxa Cluster Version : 5.10.2 Only delete the files inside these directories. Do not delete the directories themselves.
    • /home/nutanix/software_uncompressed/ – Delete any old versions other than the versions you are currently upgrading. The software_uncompressed folder is only in use when the pre-upgrade is running and should be removed after a successful upgrade. If you see a running cluster which is currently not upgrading, it is safe to remove everything underneath software_uncompressed
    • /home/nutanix/foundation/isos/ – Old ISOs of hypervisors or Phoenix.
    • /home/nutanix/foundation/tmp/ – Temporary files that can be deleted.
    Use this syntax for deleting files within each of these directories: nutanix@cvm:~$ rm /home/nutanix/foundation/isos/* If you see large files in the software_downloads directory that are not needed for any planned upgrades, do not remove those from the command-line. Instead, use the Prism Upgrade Software UI to accomplish as shown below. This example lists multiple versions of AOS which consume around 5 GB each, simply click on the ‘X’ to delete the files. Then click on each of the following tabs including File Server, Hypervisor, NCC, and Foundation to locate further downloads you may not require. It is possible that Enable Automatic Download is checked. This is located below the above screenshot (on the AOS tab). Left unmonitored, the cluster will download multiple versions, consuming more space in the home directory.

Step 3: Check space usage in /home to see that it is now below 70%.

You can use the “df -h” command to check on the amount of free space in /home. To accommodate a potential AOS upgrade, usage should ideally be below 70%.

nutanix@cvm:~$ allssh "df -h /home"

Example output:

================== x.x.x.x =================
/dev/md2         40G  8.4G   31G  22% /home
================== x.x.x.x =================
/dev/md2         40G  8.5G   31G  22% /home
================== x.x.x.x =================
/dev/md2         40G   19G   21G  49% /home

Cleaned up files from the approved directories but still see high usage in /home?

Contact Nutanix Support and submit the script log bundle (/tmp/home_kb1540_<cvm_name>_<timestamp>.tar.gz). One of our Systems Reliability Engineers (SREs) will promptly assist you with identifying the source of and solution to the problem at hand. Under no circumstances should you remove files from any other directories aside from those found here as these may be critical to the CVM infrastructure or may contain user data.

For the home partition exceeding its limit on the PCVM refer to the KB-8950 to troubleshoot.

Advertisement

Entry filed under: Nutanix.

How do I flush or delete incorrect records from my recursive server cache? How to view the Network Configuration in AHV

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trackback this post  |  Subscribe to the comments via RSS Feed


Archives

Categories

Follow Hope you like it.. on WordPress.com

Blog Stats

  • 68,228 hits

%d bloggers like this: