[erlang-questions] Reading large (1GB+) XML files.

Thu Aug 16 10:17:40 CEST 2007

After thought - I *won't* be sending you any code :-(

My XML stuff is in the middle of a big re-write - I'll do this first.

I'm trying to make several inter-related XML processing things. I
don't believe the
one-tool-suits-all approach for manipulating XML.

      Parsing XML raises a number of tricky design issues.   What
      do we want to do with the XML?  --- do we have to handle
      infinite (or at least very large) inputs.  Is all the input
      available at the time of parsing, or does it arrive in
      fragmented chunks from a stream. If the data is streamed do we
      want to handle the chunks as they come in a re-entrant parser,
      or do we want to wait until all the chunks have come and then
      do the parsing? In this case we'll have to pre-scan the data so
      that we know when to do the parsing.

    Given that we've got a tokenizer can we write a parser that
    works with lists of tokens, or with streams, or do we have to
    write a number of different parsers to handle the different
    cases?

    Do we want to write a validating parser, or a non-validating parser?
    Should it be re-entrant or not?

    Do we want to handle simple ASCII character sets or many different
character sets
    and is the code very different in the two cases?

    Do we want to *exactly reconstruct* the input or
    should the parse tree represent the logically equivalent of the
    input. For example, do we want to pass tag attributes in the
    same order as they appear in the input. Do we want to exactly
    retain white space and tabs in places where they are not
    semantically important?

    These are difficult design questions and it is difficult to
    write the libraries in such a way that all of these things can be
    done.  If we write a very general set of routines they will
    probably not be very fast for a specific purpose. If we write fast
    specialised routines, they will not be very general.

A lot of XML processing can be done at a token level alone - there is no
need to even have a well-formed document - here parsing and validating would
be a waste of time.

Then we have to decide on performance - a set of routines that work correctly
of GByte files will also work on small files - but if we were only processing
small files then a more efficient algorithms would be possible. Do we have to
write two sets of routines (for large and small files) and can they
share common code?

Anyway - I'm trying to make a toolkit that can allow you to manipulate
a document
either as a stream of tokens, or as a well-formed or as a validated document.

Another question I have is:

   What do you want to do with an infinite document?

   (here infinite means "too big to keep the parse tree in memory in
an efficient manner")

   Do you want to:

    a) - produce another infinite document
    b) - extract a sub-set according to some filter rules

   If it's a) are the things in the output document in the same order
as the things
   in the input document? - I guess both a and b would be candidates
for some kinds of
   higher order functions that work on xml parse trees

   Lot's to think about

/Joe

On 8/15/07, Joe Armstrong <erlang@REDACTED> wrote:
> Interesting - I've been writing some new XML libraries and handling
> infinite streams (Well very large) is one of the problems I've been
> thinking about
>
> I'll poke around tomorrow and send you some code that might help
>
> /Joe Armstrong
>
> On 8/15/07, Patrik Husfloen <husfloen@REDACTED> wrote:
> > I've been trying to learn erlang for a while, and I recently found
> > what I thought to be an easy starter project. I currently have a
> > simple application that reads data from a couple of Xml files using
> > SAX, and inserts it using a rpc over http.
> >
> > I'm not sure about the terminology here, I've been stuck in OO land
> > for so long that everything looks like an object, but here's what I'm
> > thinking: One thread reading the xmls and piecing together the data,
> > and then handing off each record to a pool of workers that issue the
> > http requests, or, maybe the xml-reading part could just spawn a new
> > thread for each record it reads, and ensure that only X are running at
> > the most?
> >
> > The http request was easy enough to get working, but I'm having
> > trouble with reading the xml, I used xmerl_scan:file to parse the
> > file, but that loads the file into memory before starting to process.
> >
> > I took a look at Erlsom, and it's SAX reader examples, but that read
> > the entire file into a binary before passing it off to the Xml reader.
> >
> >
> > Thanks,
> >
> > Patrik
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
>