Menu

Directory Trees to Document Trees

March 30, 2005

John E. Simpson

XML.com readers know well that XML shines as a transport medium between applications. Less obvious is that XML can serve as a good platform for getting data into and out of a single application. Whether for simple backup/archival purposes or for porting between program versions, having data in XML form assures its relatively simple availability to any program.

As an object lesson, consider the case of the Quicken financial package. Quicken's data had always been kept in binary files conforming to the rules of a Quicken-only universe. This was fine as long as Intuit, Quicken's parent, continued to support these so-called QIF files. If you haven't been following the still-unfolding saga, trust me: it wasn't pretty when Intuit announced they would no longer support QIF data. In other words, when you upgraded to the next Quicken version, you could not import the staggering amount of data you'd accumulated over the years — not individual transactions, not accounts, not reports. Granted, the first versions of Quicken debuted years before XML. Still, the unavailability of any even remotely portable data format such as XML, even on an interim basis, has Intuit's customers over a very splintery barrel.

So it's interesting, in any case, to see vendors taking advantage of XML's built-in permanence-through-portability advantages, even when exposing an application's internals like this might seem as though it's not in the vendors' best interests. In this month's "XML Tourist," we'll take a look at one of those vendors, and one of its products.

JAM Software's TreeSize Professional

Making the leap from a computer's file system tree — directories and files — to an XML structure isn't exactly making a leap of breathtaking proportions. Just to rattle off one possibility, you might start out with something like this:

<disk>
<directory name="/">
<directory name="user">
<file name="file1.xml"/>
. . .
</directory>
</directory>
</disk>

But much more could be said about directories and files than just their names and relationships to one another. You might want to know a file's size, for instance, or its read-write-execute attributes, or the date/time it was created, last modified, or last accessed. Ultimately, you might want not just to understand the directory structure; you might want to manipulate it, to use it more efficiently, to ensure (even in these days of inexpensive multi-gigabyte storage media) that you're making the most of your hardware investment.

That's what JAM Software's TreeSize Professional aims to do. The program ($39.95 for a single-user license; free evaluation version available) runs only on Microsoft Windows 98+ machines; it's particularly useful for the most recent Windows versions, from NT through 2003, because it takes advantage of features of the more rigorous NT File System (NTFS) used by only those later versions.

Although billed as a "hard disk space manager," TreeSize doesn't actually manage a drive's space so much as analyze it. It can tell you, for example, which users of a network drive are hogging the most space; it can isolate problem areas, such as directories overloaded with multimedia files; it can indicate whether you should think about using some form of disk compression or defragmentation utility. It does, in fact, tell you just about everything you need to know in order to truly manage the drive through some other means. (The Tools menu includes hooks to the "Windows Add/Remove Software" and "Map Network Drive", arguably disk-management functions, I guess.) This amounts to a ton of data, even for fairly small directories.

Now, JAM certainly hopes to make money from sales of its product. So maybe, you'd think, it would want to hold all this data close to its vest, thereby preventing peeking by competitors (or by users, for that matter). Not so. In fact, certain features pretty much demand, for efficiency's sake, that the data be exposed externally to the program itself. For example, you can take a "before" snapshot of a directory tree, take steps to eliminate or at least ameliorate any problems uncovered, and then compare the tree's "after" state to this externally-saved snapshot. And here, of course, the data format of interest is XML.

Note: TreeSize has a flash-and-dazzle user interface whose primary appeal, at first, is visual. By default, for example, what you see when you first open the program is a multi-colored bar chart depicting the relative sizes of a selected drive or directory's contents. This interface can look spectacular in screen captures but — to my way of thinking — is fairly uninteresting, compared to the information behind it. I'm not going to be showing you much of what TreeSize looks like on the surface.

Trees In, Trees Out

While TreeSize doesn't automatically hold its data in any way external to the program — which may have a lot to do with the program's speed — it does offer several options for saving the data. You can export it to Microsoft Excel, for instance, or a tab-delimited text file. You can even generate an HTML (not XHTML) document, complete with an embedded JPG image of the current bar chart. While these options can be useful for sharing or presenting the data, they provide essentially static snapshots: if you want to use the saved data in any way, you must devise the usage on your own.

Figure 1: A portion of TreeSize's File menu, highlighting XML reportingNot so with the export-to-XML feature. More precisely, TreeSize refers to this as "XML reporting" rather than "XML exporting," as you can see from the partial screen capture at the right (a portion of the program's File menu).

Also as you can see here, you can perform three XML-related functions with TreeSize:

  • Load XML Report: Brings the contents of a previously-saved TreeSize XML report into the program. This switches the user interface to view the imported directory tree, regardless of whether or not the imported tree actually exists on the current drive.
  • Save Report as XML: Performs the inverse of the "Load" function, dumping the data behind the current user interface to an XML document.
  • Compare With XML Report: Compares the directory tree currently viewed through the interface with the directory tree whose characteristics TreeSize has previously saved as an XML document. (This is the ultimate goal of a "find a disk management problem, fix it, display the results of the fix" procedure.)

A Closer Look

Even a simple directory tree displayed in TreeSize's user interface does not translate to a simple TreeSize XML report. That's because what gets dumped to XML is information about the entire hard drive. Here's a typical example: displayed on the screen at the time was a directory named PPDB, containing 28 files and no subdirectories:

<Root>
<Version>3.22 (3.2.2.229)</Version>
<Path>C:\</Path>
<ExcludePatterns/>
<Filter>
<pattern>*</pattern>
</Filter>
<ArchiveBitFilesOnly>0</ArchiveBitFilesOnly>
<CreatedPastDaysOnly>0</CreatedPastDaysOnly>
<Filesystem>NTFS</Filesystem>
<SectorsPerCluster>1</SectorsPerCluster>
<BytesPerSector>512</BytesPerSector>
<BytesPerCluster>512</BytesPerCluster>
<Compressed>0</Compressed>
<FileBasedCompression>-1</FileBasedCompression>
<FoldersOccupySpace>0</FoldersOccupySpace>
<IsCompared>0</IsCompared>
<Title> Drive: Local Disk (C:)</Title>
<UserDefinedClusterSize>0</UserDefinedClusterSize>
<UsedBytesOnDrive>39991278592</UsedBytesOnDrive>
<FreeBytesOnDrive>30498184192</FreeBytesOnDrive>
. . .
</Root>

All of the Root element's children shown here, from Version through FreeBytesOnDrive, either contain general information about the program (e,g., Version) or the drive (e.g., BytesPerCluster), or contain flag values corresponding to options set (or defaulted) by the user at the time TreeSize dumped this report (e.g. ArchiveBitFilesOnly).

Of greater interest, probably, is what appears in place of the ellipsis (. . .) in the above fragment: details on every folder on the drive, in a set of nested folder elements which replicate the hierarchy of directories (folders) in an XML document subtree. Interestingly, this subtree says nothing at all about individual files, just sums them up by file type (that is, extension .doc for Word documents, .txt for text files, and so on). Here's what the information on my PPDB directory looks like at the moment:

<Folder fullpath="C:\PS\CACHE\PPDB\" IsFilesNode="0">
<Name>PPDB</Name>
<Attributes>16</Attributes>
<LastAccessDate Low="1493083888" High="29700422" />
<LastChangeDate Low="1493083888" High="29700422" />
<CreationDate Low="3340077362" High="29674503" />
<SizeData Size="1932569" Allocated="1937920"
Wasted="5351" CDRom="1953792" Files="28"
Folders="0" Compression="1" />
<FilesSizeData Size="1932569" Allocated="1937920"
Wasted="5351" CDRom="1953792" Files="28"
Folders="0" Compression="1" />
<TExtensionSizeArray>
<Item name=".DAT" Size="1734081"
Allocated="1737216" Wasted="3135"
CDRom="1744896" Files="14" Folders="0"
Compression="1" />
<Item name=".KEY" Size="198488"
Allocated="200704" Wasted="2216"
CDRom="208896" Files="14" Folders="0"
Compression="1" />
</TExtensionSizeArray>
</Folder>

File types are summarized in the Item child elements of the TExtension element. As shown here, for instance, the PPDB folder contains 14 files whose names include an extension of .DAT, and 14 files whose names include an extension of .KEY.

(Given that my hard drive currently contains thousands of folders, the absence of details about the files in those folders is probably a blessing. Otherwise, the TreeSize XML report itself could become a factor in my ability to manage space on the drive!)

What does JAM Software gain by saving all this data as XML? First, it simplifies getting the data back in, for comparisons across drives or for a single drive, across some arbitrary time interval. But this capability doesn't automatically suggest XML; indeed, there might be many advantages to saving the data in some proprietary binary format.

What JAM Software (and its customers) gain with the XML format is two-fold: transparency and future-proofing.

  • Transparency: Not only is TreeSize data easy to read (assuming you've got some familiarity with what the data represents); it's easy to process — not just by TreeSize itself, but potentially by clever users or developers of add-on products. For instance, why limit yourself to the simple bar and pie charts which TreeSize offers you? Why not use XSLT to transform all or parts of a TreeSize XML report into an SVG representation of a scatter chart, a line graph, a bubble chart?
  • Future-proofing: It's certainly true that hard-drive technology changes over time, and so the exact structure of the TreeSize XML report will need to change as well. But JAM (and its customers) never need worry that a report created in TreeSize version N won't be able to be used in version N+1, N+2, or N+what-have-you.

In short, what JAM gains is the full benefit of XML's "keep it simple" promise. With luck, we'll continue to find vendors of all kinds of software taking the same tack: lucky for the vendors, and lucky for us.