Cleaning up iTunes plist XML

Last night in the comments of my post about converting iTunes playlists to XML I was a little more critical of the Apple plist XML then i should have been. It was late and I guess I'd already taken off my XML hat for the night. Now putting it back on we solve the problem with the iTunes plist file by just transforming it into something a little easier to deal with. To do this we'll just use a simple XSL-T stylesheet.

The problem of course, is that the plist format is cumbersome to process with my favorite tool, XPath. I suppose you can't blame Apple too much for that as the plist format is just designed as a serialization of in memory structures and wasn't designed to be processed with XPath. But put something in XML and some fool like me will come along and try to run XPath against it.

Here's an example of the problem. The plist format is actually very simple, but it doesn't have a clean hierarchy to it. This means you have to rely on ordering of sibling nodes to be able to run a query. Here's an example that retrieves the list of all songs listed under the artist name "Alan Lomax".

//string[preceding-sibling::key[1]='Name' and following-sibling::string = 'Alan Lomax']

  

This query works for this particular case, however it does have the problem that the match on "Alan Lomax" is not specific enough. It matches a field named "string", but doesn't match on a particular key name. So if "Alan Lomax" shows up as the artist it matches, but it also matches if he shows up in any other field that comes after the field "Name". Fortunately it's easy to solve the problem with an XSL transform so that the same query can be expressed in a more straightforward manner.

//song[Artist = 'Alan Lomax']/Name

  

This query is much simpler and doesn't suffer from any ambiguity over what is being matched.

The XSL I used turns stuff that looks like this:

<key>36</key>
<dict>
	<key>Track ID</key><integer>36</integer>
	<key>Name</key><string>It Makes A Long Time Man Feel Bad</stlring>
	<key>Artist</key><string>22 &#38; Group</string>
	<key>Album</key><string>Prison Songs Volume One - Murderous Home</string>
	<key>Genre</key><string>Folk</string>
	<key>Kind</key><string>MPEG audio file</string>
	<key>Size</key><integer>3267520</integer>
	<key>Total Time</key><integer>163239</integer>
	<key>Track Number</key><integer>9</integer>
	<key>Track Count</key><integer>17</integer>
	<key>Date Modified</key><date>2003-05-05T06:58:27Z</date>
	<key>Date Added</key><date>2003-03-16T02:25:15Z</date>
	<key>Bit Rate</key><integer>160</integer>
	<key>Sample Rate</key><integer>44100</integer>
	<key>Play Count</key><integer>1</integer>
	<key>Play Date</key><integer>-1154095246</integer>
	<key>Play Date UTC</key><date>2003-07-12T23:27:30Z</date>
	<key>File Type</key><integer>1297106739</integer>
	<key>File Creator</key><integer>1752133483</integer>
	<key>Location</key><string>...</string>
	<key>File Folder Count</key><integer>4</integer>
	<key>Library Folder Count</key><integer>1</integer>
</dict>


  

into this:

<song>
    <Track_ID>36</Track_ID>
    <Name>It Makes A Long Time Man Feel Bad</Name>
    <Artist>22 &amp; Group</Artist>
    <Album>Prison Songs Volume One - Murderous Home</Album>
    <Genre>Folk</Genre>
    <Kind>MPEG audio file</Kind>
    <Size>3267520</Size>
    <Total_Time>163239</Total_Time>
    <Track_Number>9</Track_Number>
    <Track_Count>17</Track_Count>
    <Date_Modified>2003-05-05T06:58:27Z</Date_Modified>
    <Date_Added>2003-03-16T02:25:15Z</Date_Added>
    <Bit_Rate>160</Bit_Rate>
    <Sample_Rate>44100</Sample_Rate>
    <Play_Count>1</Play_Count>
    <Play_Date>-1154095246</Play_Date>
    <Play_Date_UTC>2003-07-12T23:27:30Z</Play_Date_UTC>
    <File_Type>1297106739</File_Type>
    <File_Creator>1752133483</File_Creator>
    <Location>...</Location>
    <File_Folder_Count>4</File_Folder_Count>
    <Library_Folder_Count>1</Library_Folder_Count>
</song>

  

The difference is subtle, but makes a huge difference for ease of processing as well as for speed of processing. The original query took 13.3 seconds to execute, while the second took 2.6 seconds. The file also takes less time to parse going from 6.4 seconds to 2.3 seconds. That's a little misleading though as the file also shrank from 40MB to 26MB. However, it's a 52% reduction in parsing time with only a 35% reduction in file size. The query time is also disproportionately reduced compared to the reduction in file size, however that can be largely attributed to the structure of the playlist entries in the original file. The new file is smaller because those entries are removed,

Here's the very simple XSL to make the transformation.

<xsl:stylesheet
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xsl:version="1.0">
        
    <xsl:template match="/">
        <songlist>
            <xsl:apply-templates select="plist/dict/dict/dict"/>
        </songlist>
    </xsl:template>
    
    <xsl:template match="dict">
        <song>
            <xsl:apply-templates select="key"/>
        </song>
    </xsl:template>
    
    <xsl:template match="key">
        <xsl:element name="{translate(text(), ' ', '_')}">
            <xsl:value-of select="following-sibling::node()[1]"/>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

  

The goal of this exercise wasn't to come up with a completely equivalent format, just to make it easier to deal with for the particular use I was exploring. My format does lose information from the original, in particular data types. Preserving them would have been easy, but I had no need for them. Preserving them would be important if you want to be able to get back to the original format.

Sometimes in the pursuit of XML purity we also forget that the exact format of the file is far less important then the fact it is XML at all. XML is XML and XSL-T is an enormously powerful tool to have in your arsenal. So the moral is, if you encounter XML that is uncooperative to the way you want to process it, just change it. XML everywhere is the first goal, the rest we can fix from there.

Posted by Kimbro Staken

Friday Dec 5, 2003 at 11:05 PM
Recommended Sites