<div>I am sorry if I suggested that you were making things up, that was not my intention.</div>
<div> </div>
<div>I didn't understand the reference to large numbers of threads in the context of XML parsing, and I was surprised by the poor performance of my own erlsom parser. The first was probably a misunderstanding of what you wrote, and the second was not directly related to the blog. Next time I will try to be more careful in my wording.
</div>
<div> </div>
<div>Anyway, my surprise at the poor performance of erlsom for files like this triggered me to have another look at this.</div>
<div> </div>
<div>I rewrote the SAX parser to work directly on (UTF-8 encoded) binaries. It now produces SAX events that contain binary data as 'payload' (or more exactly: it passes them as arguments to a callback function). They look like this;
</div>
<div> </div>
<div>{processingInstruction,<<"xml">>,<<"xml version=\"1.0\" encoding=\"UTF-8\"">>}<br>{startElement,<<>>,<br> <<"plist">>,
<br> <<>>,<br> [{attribute,<<"version">>,<<"version">>,<<>>,<<"1.0">>}]}<br>{ignorableWhitespace,<<"\r\n">>}
<br>{startElement,<<>>,<<"dict">>,<<>>,[]}<br>{ignorableWhitespace,<<"\r\n ">>}<br>{startElement,<<>>,<<"key">>,<<>>,[]}
<br>{characters,<<"Major Version">>}<br>etcetera.</div>
<div> </div>
<div>I wrote a simple callback function that counts the number of songs (more precisely: the number of "key" elements with value "Track ID"). Processing the file took 22 seconds.</div>
<div> </div>
<div><br>callbackFun(Event, S = {State, Count}) -> <br> case State of<br> start -><br> case Event of<br> {startElement, _, <<"key">>, _, _} -><br> {key, Count};<br>
_ -> S<br> end;<br> key -><br> case Event of <br> {characters, <<"Track ID">>} -><br> {start, Count + 1};<br> _ -> <br> {start, Count}
<br> end<br> end.</div>
<div> </div>
<div>I'll refine the parser a little bit, and I'll see if I can combine it with the rest of Erlsom. I'll include it in the next release.</div>
<div> </div>
<div> </div>
<div> </div>
<div><span class="gmail_quote">On 6/23/07, <b class="gmail_sendername">dda</b> <<a href="mailto:headspin@gmail.com">headspin@gmail.com</a>> wrote:</span>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">That would be me.<br><br>I was a little nonplussed by your doubts and your [implied] conclusion<br>that if you failed to do it, then it was impossible. What you only
<br>proved is that fiddling a couple of hours with XML parsers in Erlang<br>failed. Sure, been there done that, 6 months ago. Then I switched to<br>other options.<br><br>I haven't explained in depth how I did it, since this is going into a
<br>commercial application, and I am not at liberty to expose the innards<br>of the code, but I can tell you this much: I didn't use an xml parser<br>– existing or else – for this [neither did I in the version used<br>
currently by Dot-Tunes]. I wrote a specific parser for this file<br>format. Since the iTunes Libray XML file is machine-produced, the<br>format is extremely regular, and an XML parser has way too much<br>overhead for this task, which is quite simple, really, albeit time
<br>consuming.<br><br>I was myself surprised not only by the speed improvements, but also by<br>the non-linearity of the performance. This is probably a sign that<br>there's room for improvement in my code, but my client deemed the
<br>perfs good enough for now on the 50,000-record file. And when the<br>client's happy, the coder's happy too.<br><br>Many months ago I had asked a question about Elang and sqlite, and it<br>was related to this problem. Since sqlite is not suited to
<br>multi-threaded tasks, I had to split the process into producing first<br>the sql, and then dump it into an sqlite db [Dot-Tunes uses sqlite as<br>a backend, so I had no choice in the matter].<br><br>I wish I could show more, but then again I care more about my client's
<br>satisfaction then grumbles emitted on a mailing list.<br><br>Cheers.<br><br>--<br>dda aka Didier<br><br><br>On 6/22/07, Willem de Jong <<a href="mailto:w.a.de.jong@gmail.com">w.a.de.jong@gmail.com</a>> wrote:<br>
><br>><br>> It is a strange sory. The author claims to have achieved very good results<br>> using Erlang to parse a very big (35Mbyte) XML file (an Itunes Music Library<br>> file). He suggests that he uses lots of processes to do this.
<br>><br>> It made me curious, and I decided to do some tests. I used my 1.7 GHz<br>> laptop with 1GB of memory, running Windows XP.<br>><br>> - Parsing an Itunes file of 4Mbyte takes about 4 seconds with the SAX parser
<br>> that is the basis of Erlsom (if you let the callback function do something<br>> trivial).<br>><br>> - Parsing the file with Erlsom (which validates it against an XSD and<br>> translates it to records) takes about 5 seconds.
<br>><br>> - Parsing the file with Xmerl takes about 8 seconds.<br>><br>> I found an article on parsing the Itunes library using mono<br>> <a href="http://www.xml.com/pub/a/2004/11/03/itunes.html">http://www.xml.com/pub/a/2004/11/03/itunes.html
</a>). On an<br>> 800MHz powerbook parsing a 2.5Mbyte file apparently took 9 seconds, so I<br>> would say that Erlang doesn't look bad.<br>><br>> Surprisingly, loading the file into Microsoft Internet Explorer takes more
<br>> than a minute...<br>><br>> If things would scale lineary, parsing the 35Mbyte file should take about 40<br>> to 80 seconds, which is about twice as fast as what the author of the blog<br>> claims to have achieved (on another machine, obviously, so comparing these
<br>> figures may not make a lot of sense).<br>><br>> Unfortunately, these tests fail miserably - Erlang crashes. On my machine I<br>> cannot translate a file (binary) of this size to a list. I have to say that
<br>> I was a bit disappointed... Is there a way to fix this?<br>><br>> Willem.<br>><br>><br>> On 6/20/07, Brad Anderson <<a href="mailto:brad@sankatygroup.com">brad@sankatygroup.com</a>> wrote:<br>
> > I came across this blog today...<br>> ><br>> > <a href="http://www.sungnyemun.org/wordpress/?p=323">http://www.sungnyemun.org/wordpress/?p=323</a><br>> ><br>> > BA<br>> > _______________________________________________
<br>> > erlang-questions mailing list<br>> > <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>> > <a href="http://www.erlang.org/mailman/listinfo/erlang-questions">http://www.erlang.org/mailman/listinfo/erlang-questions
</a><br>> ><br>><br>><br>> _______________________________________________<br>> erlang-questions mailing list<br>> <a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br>>
<a href="http://www.erlang.org/mailman/listinfo/erlang-questions">http://www.erlang.org/mailman/listinfo/erlang-questions</a><br>><br>_______________________________________________<br>erlang-questions mailing list<br>
<a href="mailto:erlang-questions@erlang.org">erlang-questions@erlang.org</a><br><a href="http://www.erlang.org/mailman/listinfo/erlang-questions">http://www.erlang.org/mailman/listinfo/erlang-questions</a></blockquote></div>
<br>