30. November 2024

A problem with one or more vFAT bootbank partitions was detected. (PSOD Alarm!)

By H. Cemre Günay

During a VMware vSphere 7 to 8 upgrade, more precisely after the compliance check, I was confronted with the above error message. In this blog post I would like to show you how to solve the problem. Corrupt vFAT partitions already existed in vSphere 6.7 times and unfortunately they also appear today with vSphere 8.

The problem? If you go through with the upgrade in such a case, you will be confronted with a Purple Screen of Death (PSOD) relatively quickly, which you obviously want to avoid.

There are two ways to fix this issue:

  • Repair the corrupted vFAT Partition (Recommended by VMware)
    or
  • Re-create the vFAT Partition

So let’s start – first we set the affected node to Maintenance Mode

We then open an SSH session on this node and have a look at the available file systems by using the command:

esxcli storage filesystem list

As you can see both Bootbank Volumes have a vfat type. If further details are required, the following command can be used to display the disk and its partition:

vmkfstools -P /vmfs/volumes/254051f4-xxxx-xxxxxxxxxxxxx

To ensure the smooth functioning of the subsequent steps and to prevent any disruption to the disk, it is advisable to check for open file handles and close them if necessary. Stop crond, which periodically schedules backup.sh, updating the active bootbank

kill $(cat /var/run/crond.pid)

Then stop vmsyslogd, which has open file handles on /scratch (log files)

/usr/lib/vmware/vmsyslog/bin/shutdown.sh

And now check for further daemons having open file handles on the scratch partition and stop these daemons by using this command (in our case we did not have any outputs so I will put some of the examples from the KB):

lsof |grep scratch

Example Output:
1001391762  vmfstracegd   FILE  4   /scratch/vmfstraces/vmfsGlobalTrace.trace.0.gz

So stop this daemon by:
/etc/init.d/vmfstraced stop

Output:
watchdog-vmfstracegd: Terminating watchdog process with PID 1001391748
vmfstracegd stopped

---

lsof |grep scratch

Example Output:
-- note: 63fe3b74-########-####-#######46fa is the UUID of the scratch partition

lsof |grep 63fe3b74-########-####-########46fa

1001391489  rhttpproxy            FILE                       18   /vmfs/volumes/63fe3b74-########-####-#######46fa/log/rhttpproxy-1001391489-000000db02450060-lo0-1.pcap
1001391489  rhttpproxy            FILE                       19   /vmfs/volumes/63fe3b74-########-####-#######46fa/log/rhttpproxy-1001391489-000000db024501a8-vmk0-1.pcap

So stop this daemon by:
/etc/init.d/rhttpproxy stop

---

lsof | grep var/run/log

Example Output
2101088    python               FILE                       5  /var/run/log/vsandevicemonitord.log


So stop this daemon by:
/etc/init.d/vsandevicemonitord stop

I have already mentioned, that there are two ways to recover your corrupted vFAT Partition and I will show you the repair way not re-create. If the repair workaround does not work for you, here you can find the re-create workaround: https://knowledge.broadcom.com/external/article/345227/corrupted-vfat-partitions-from-esxi-6567.html

By using the following command we will try to get an output for healthy partition:

dosfsck -Vv /dev/disks/<disk and partition id>

But our command is giving out a failure about the file name “mfg_net“, so we will repair/delete it with the following command:

dosfsck -a -w /dev/disks/<disk and partition id>

After successful repair process, we will use the dosfsck -Vv command again to ensure, that there are no corrupted files:

dosfsck -Vv /dev/disks/<disk and partition id>

Now we will reboot the node and after that put it out of Maintenance Mode and now you should not see any Errors in the Compliance Check area. If the dosfsck command is still giving out failures please use the second workaround in the KB article.

This is it from this blog post, if you have any questions please use the comment section below. 🙂