[erlang-questions] Reading large (1GB+) XML files.

Thu Aug 16 19:25:52 CEST 2007

Well, I think I have a decent idea on where to go from here,
I'll report back this weekend with results.

Thanks everyone,

Patrik

On 8/16/07, Joe Armstrong <erlang@REDACTED> wrote:
> After thought - I *won't* be sending you any code :-(
>
> My XML stuff is in the middle of a big re-write - I'll do this first.
>
> I'm trying to make several inter-related XML processing things. I
> don't believe the
> one-tool-suits-all approach for manipulating XML.
>
>
>       Parsing XML raises a number of tricky design issues.   What
>       do we want to do with the XML?  --- do we have to handle
>       infinite (or at least very large) inputs.  Is all the input
>       available at the time of parsing, or does it arrive in
>       fragmented chunks from a stream. If the data is streamed do we
>       want to handle the chunks as they come in a re-entrant parser,
>       or do we want to wait until all the chunks have come and then
>       do the parsing? In this case we'll have to pre-scan the data so
>       that we know when to do the parsing.
>
>
>     Given that we've got a tokenizer can we write a parser that
>     works with lists of tokens, or with streams, or do we have to
>     write a number of different parsers to handle the different
>     cases?
>
>     Do we want to write a validating parser, or a non-validating parser?
>     Should it be re-entrant or not?
>
>     Do we want to handle simple ASCII character sets or many different
> character sets
>     and is the code very different in the two cases?
>
>     Do we want to *exactly reconstruct* the input or
>     should the parse tree represent the logically equivalent of the
>     input. For example, do we want to pass tag attributes in the
>     same order as they appear in the input. Do we want to exactly
>     retain white space and tabs in places where they are not
>     semantically important?
>
>     These are difficult design questions and it is difficult to
>     write the libraries in such a way that all of these things can be
>     done.  If we write a very general set of routines they will
>     probably not be very fast for a specific purpose. If we write fast
>     specialised routines, they will not be very general.
>
> A lot of XML processing can be done at a token level alone - there is no
> need to even have a well-formed document - here parsing and validating would
> be a waste of time.
>
> Then we have to decide on performance - a set of routines that work correctly
> of GByte files will also work on small files - but if we were only processing
> small files then a more efficient algorithms would be possible. Do we have to
> write two sets of routines (for large and small files) and can they
> share common code?
>
> Anyway - I'm trying to make a toolkit that can allow you to manipulate
> a document
> either as a stream of tokens, or as a well-formed or as a validated document.
>
> Another question I have is:
>
>    What do you want to do with an infinite document?
>
>    (here infinite means "too big to keep the parse tree in memory in
> an efficient manner")
>
>    Do you want to:
>
>     a) - produce another infinite document
>     b) - extract a sub-set according to some filter rules
>
>    If it's a) are the things in the output document in the same order
> as the things
>    in the input document? - I guess both a and b would be candidates
> for some kinds of
>    higher order functions that work on xml parse trees
>
>    Lot's to think about
>
> /Joe
>
>
>
> On 8/15/07, Joe Armstrong <erlang@REDACTED> wrote:
> > Interesting - I've been writing some new XML libraries and handling
> > infinite streams (Well very large) is one of the problems I've been
> > thinking about
> >
> > I'll poke around tomorrow and send you some code that might help
> >
> > /Joe Armstrong
> >
> > On 8/15/07, Patrik Husfloen <husfloen@REDACTED> wrote:
> > > I've been trying to learn erlang for a while, and I recently found
> > > what I thought to be an easy starter project. I currently have a
> > > simple application that reads data from a couple of Xml files using
> > > SAX, and inserts it using a rpc over http.
> > >
> > > I'm not sure about the terminology here, I've been stuck in OO land
> > > for so long that everything looks like an object, but here's what I'm
> > > thinking: One thread reading the xmls and piecing together the data,
> > > and then handing off each record to a pool of workers that issue the
> > > http requests, or, maybe the xml-reading part could just spawn a new
> > > thread for each record it reads, and ensure that only X are running at
> > > the most?
> > >
> > > The http request was easy enough to get working, but I'm having
> > > trouble with reading the xml, I used xmerl_scan:file to parse the
> > > file, but that loads the file into memory before starting to process.
> > >
> > > I took a look at Erlsom, and it's SAX reader examples, but that read
> > > the entire file into a binary before passing it off to the Xml reader.
> > >
> > >
> > > Thanks,
> > >
> > > Patrik
> > > _______________________________________________
> > > erlang-questions mailing list
> > > erlang-questions@REDACTED
> > > http://www.erlang.org/mailman/listinfo/erlang-questions
> > >
> >
>