[File] [PATCH] Expand VMDK hosted sparse, stream-optimized, and descriptor detection

Daniel Carmo Olops daniel at olops.eti.br
Wed May 29 00:17:37 UTC 2024


VMware virtual disk image (VMDK) extents in Hosted Sparse or
Stream-Optimized Compressed sub-formats contain information in their
header which may be useful for troubleshooting or other purposes.
Expand the magic for these two sub-formats to include their version,
sub-format name, virtual disk size, embedded descriptor offset &
length (when available), and whether a footer is present. Also flag
when an unclean shutdown took place, or data corruption occurred as a
result from transferring the file over FTP as text. For Hosted Sparse,
the magic below applies to files created by hosted products only
(e.g., VMware Fusion, VMware Workstation). ESXi host sparse extents
have a different format.

This patch also includes magic to detect descriptor files which, like
VMDK extents, have a .vmdk extension as well.

Tested with file v5.45 and VMDK files in various sub-formats created
with qemu-img, Oracle VirtualBox, and Amazon EC2 VM Import/Export.

(this is my very first patch sent to an upstream project. Please let
me know if there's anything amiss.)

---

diff --git a/magic/Magdir/virtual b/magic/Magdir/virtual
index 64cb2cf1..522e862c 100644
--- a/magic/Magdir/virtual
+++ b/magic/Magdir/virtual
@@ -210,8 +210,76 @@
 >4 byte 2 undoable disk image
 >>32 string >\0 (%s)

-0 string/b VMDK VMware4 disk image
-0 string/b KDMV VMware4 disk image
+# VMware virtual machine disk (VMDK) elements.
+# Based on VMware Technical Note "Virtual Disk Format 5.0",
+# available upon request (also found in third-party websites).
+# Version 3 mentioned in
https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vddk-programming-guide/GUID-0598E48A-66B7-426A-8279-A45F0CEDF15D.html
+
+# Updated by Daniel Carmo Olops (daniel at olops.eti.br)
+#
+# Descriptor files.
+# Although the technical note states that the descriptor is
+# case-insensitive, the string below is described there as-is, and it's the
+# de facto standard seen with VMDK images created by various tools.
+# With Amazon EC2 VM Import/Export, there's a minor difference: there's no
+# space after the hash sign.
+0       regex/32    \^\#\ \?Disk\ DescriptorFile       VMware virtual
machine disk (VMDK) descriptor file
+#
+# Hosted sparse extents and stream-optimized compressed extents.
+# No differentiation possible between monolithicSparse and 2GbMaxExtentSparse
+# (that can be determined by checking the descriptor contents).
+#
+# magicNumber 0x564d444b translates to "VMDK". Since it's
+# stored as uint32 in little-endian, that becomes "KDMV".
+# VMDK version is part of the signature, as implied by the description of
+# 'version' field in p.7. Remaining header fields are only parsed when version
+# is a known one (1, 2, 3).
+0       ulelong     0x564d444b
+>4      clear       x
+>4      ulelong     1               VMware virtual machine disk
(VMDK) extent version 1
+>>0     use         sparse_header
+>4      ulelong     2               VMware virtual machine disk
(VMDK) extent version 2
+>>0     use         sparse_header
+>4      ulelong     3               VMware virtual machine disk
(VMDK) extent version 3
+>>0     use         sparse_header
+# Default case for unknown versions
+>4      default     x
+>>4     ulelong     x               VMware virtual machine disk
(VMDK) extent (unknown version %u)
+#
+# Sparse header parsing (structure is same for both Hosted Sparse and
Stream-Optimized).
+0       name        sparse_header
+# Hosted sparse extents do not have compression
+>8      ulelong     !0x10000
+>>77    uleshort    0               \b, hosted sparse
+# Stream-optimized extents have compression and markers.
+# The VMware technical note also says that flag bit 1 is not set
+# (redundant grain table not used), and that gdOffset is set to
+# 0xffffffffffffffff (header only, proper value on footer),
+# but that's not the case with VMDK images created with qemu-img.
+>8      ulelong     &0x10000
+>>8     ulelong     &0x20000
+>>>77   uleshort    1               \b, stream-optimized
+# Virtual disk size (a sector is always 512 bytes)
+>12     ulequad     x               \b, disk size: %llu sectors
+# Embedded descriptor (potentially) present
+>28     ulequad     >0              \b, embedded descriptor
+# twoGbMaxExtentSparse images created with qemu-img may not actually have
+# the descriptor embedded into them, despite being flagged as such, but in
+# a separate file instead. Alert if that's the case.
+>>(28.q*512)        regex/32        !\^\#\ \?Disk\ DescriptorFile
 \b not found at
+# Descriptor offset & length
+>>28    ulequad     x               \b offset/length: %llu
+>>36    ulequad     x               \b/%llu sectors
+# Boolean, expected to be set to 1 in case of unclean VM shutdown
+>72     byte        1               \b, unclean shutdown
+# These are one-byte text fields which must contain
+# \n, \ (space), \r, and \n, respectively. A mismatch suggests
+# that the file was corrupted as a result of transferring it
+# via FTP as text.
+>73     string/4b   !\n\ \r\n       \b, corrupted (transferred via FTP as text)
+# When a footer is present (streamOptimized only), it can be found at
(EOF - 2 sectors).
+# It's largely the same as the header, except for gdOffset.
+>-1024  ulelong     0x564d444b      \b, has footer

 #--------------------------------------------------------------------
 # Qemu Emulator Images


More information about the File mailing list