A problem with one or more vFAT bootbank partitions was detected. (PSOD Alarm!)
During a VMware vSphere 7 to 8 upgrade, more precisely after the compliance check, I was confronted with the above error message. In this blog post I would like to show you how to solve the problem. Corrupt vFAT partitions already existed in vSphere 6.7 times and unfortunately they also appear today with vSphere 8.
The problem? If you go through with the upgrade in such a case, you will be confronted with a Purple Screen of Death (PSOD) relatively quickly, which you obviously want to avoid.
There are two ways to fix this issue:
- Repair the corrupted vFAT Partition (Recommended by VMware)
or - Re-create the vFAT Partition
So let’s start – first we set the affected node to Maintenance Mode
We then open an SSH session on this node and have a look at the available file systems by using the command:
esxcli storage filesystem list
As you can see both Bootbank Volumes have a vfat type. If further details are required, the following command can be used to display the disk and its partition:
vmkfstools -P /vmfs/volumes/254051f4-xxxx-xxxxxxxxxxxxx
To ensure the smooth functioning of the subsequent steps and to prevent any disruption to the disk, it is advisable to check for open file handles and close them if necessary. Stop crond, which periodically schedules backup.sh, updating the active bootbank
kill $(cat /var/run/crond.pid)
Then stop vmsyslogd, which has open file handles on /scratch (log files)
/usr/lib/vmware/vmsyslog/bin/shutdown.sh
And now check for further daemons having open file handles on the scratch partition and stop these daemons by using this command (in our case we did not have any outputs so I will put some of the examples from the KB):
lsof |grep scratch
Example Output:
1001391762 vmfstracegd FILE 4 /scratch/vmfstraces/vmfsGlobalTrace.trace.0.gz
So stop this daemon by:
/etc/init.d/vmfstraced stop
Output:
watchdog-vmfstracegd: Terminating watchdog process with PID 1001391748
vmfstracegd stopped
---
lsof |grep scratch
Example Output:
-- note: 63fe3b74-########-####-#######46fa is the UUID of the scratch partition
lsof |grep 63fe3b74-########-####-########46fa
1001391489 rhttpproxy FILE 18 /vmfs/volumes/63fe3b74-########-####-#######46fa/log/rhttpproxy-1001391489-000000db02450060-lo0-1.pcap
1001391489 rhttpproxy FILE 19 /vmfs/volumes/63fe3b74-########-####-#######46fa/log/rhttpproxy-1001391489-000000db024501a8-vmk0-1.pcap
So stop this daemon by:
/etc/init.d/rhttpproxy stop
---
lsof | grep var/run/log
Example Output
2101088 python FILE 5 /var/run/log/vsandevicemonitord.log
So stop this daemon by:
/etc/init.d/vsandevicemonitord stop
I have already mentioned, that there are two ways to recover your corrupted vFAT Partition and I will show you the repair way not re-create. If the repair workaround does not work for you, here you can find the re-create workaround: https://knowledge.broadcom.com/external/article/345227/corrupted-vfat-partitions-from-esxi-6567.html
By using the following command we will try to get an output for healthy partition:
dosfsck -Vv /dev/disks/<disk and partition id>
But our command is giving out a failure about the file name “mfg_net“, so we will repair/delete it with the following command:
dosfsck -a -w /dev/disks/<disk and partition id>
After successful repair process, we will use the dosfsck -Vv command again to ensure, that there are no corrupted files:
dosfsck -Vv /dev/disks/<disk and partition id>
Now we will reboot the node and after that put it out of Maintenance Mode and now you should not see any Errors in the Compliance Check area. If the dosfsck command is still giving out failures please use the second workaround in the KB article.
This is it from this blog post, if you have any questions please use the comment section below. 🙂