[erlang-questions] Reading large (1GB+) XML files.
Joe Armstrong
erlang@REDACTED
Thu Aug 16 10:17:40 CEST 2007
After thought - I *won't* be sending you any code :-(
My XML stuff is in the middle of a big re-write - I'll do this first.
I'm trying to make several inter-related XML processing things. I
don't believe the
one-tool-suits-all approach for manipulating XML.
Parsing XML raises a number of tricky design issues. What
do we want to do with the XML? --- do we have to handle
infinite (or at least very large) inputs. Is all the input
available at the time of parsing, or does it arrive in
fragmented chunks from a stream. If the data is streamed do we
want to handle the chunks as they come in a re-entrant parser,
or do we want to wait until all the chunks have come and then
do the parsing? In this case we'll have to pre-scan the data so
that we know when to do the parsing.
Given that we've got a tokenizer can we write a parser that
works with lists of tokens, or with streams, or do we have to
write a number of different parsers to handle the different
cases?
Do we want to write a validating parser, or a non-validating parser?
Should it be re-entrant or not?
Do we want to handle simple ASCII character sets or many different
character sets
and is the code very different in the two cases?
Do we want to *exactly reconstruct* the input or
should the parse tree represent the logically equivalent of the
input. For example, do we want to pass tag attributes in the
same order as they appear in the input. Do we want to exactly
retain white space and tabs in places where they are not
semantically important?
These are difficult design questions and it is difficult to
write the libraries in such a way that all of these things can be
done. If we write a very general set of routines they will
probably not be very fast for a specific purpose. If we write fast
specialised routines, they will not be very general.
A lot of XML processing can be done at a token level alone - there is no
need to even have a well-formed document - here parsing and validating would
be a waste of time.
Then we have to decide on performance - a set of routines that work correctly
of GByte files will also work on small files - but if we were only processing
small files then a more efficient algorithms would be possible. Do we have to
write two sets of routines (for large and small files) and can they
share common code?
Anyway - I'm trying to make a toolkit that can allow you to manipulate
a document
either as a stream of tokens, or as a well-formed or as a validated document.
Another question I have is:
What do you want to do with an infinite document?
(here infinite means "too big to keep the parse tree in memory in
an efficient manner")
Do you want to:
a) - produce another infinite document
b) - extract a sub-set according to some filter rules
If it's a) are the things in the output document in the same order
as the things
in the input document? - I guess both a and b would be candidates
for some kinds of
higher order functions that work on xml parse trees
Lot's to think about
/Joe
On 8/15/07, Joe Armstrong <erlang@REDACTED> wrote:
> Interesting - I've been writing some new XML libraries and handling
> infinite streams (Well very large) is one of the problems I've been
> thinking about
>
> I'll poke around tomorrow and send you some code that might help
>
> /Joe Armstrong
>
> On 8/15/07, Patrik Husfloen <husfloen@REDACTED> wrote:
> > I've been trying to learn erlang for a while, and I recently found
> > what I thought to be an easy starter project. I currently have a
> > simple application that reads data from a couple of Xml files using
> > SAX, and inserts it using a rpc over http.
> >
> > I'm not sure about the terminology here, I've been stuck in OO land
> > for so long that everything looks like an object, but here's what I'm
> > thinking: One thread reading the xmls and piecing together the data,
> > and then handing off each record to a pool of workers that issue the
> > http requests, or, maybe the xml-reading part could just spawn a new
> > thread for each record it reads, and ensure that only X are running at
> > the most?
> >
> > The http request was easy enough to get working, but I'm having
> > trouble with reading the xml, I used xmerl_scan:file to parse the
> > file, but that loads the file into memory before starting to process.
> >
> > I took a look at Erlsom, and it's SAX reader examples, but that read
> > the entire file into a binary before passing it off to the Xml reader.
> >
> >
> > Thanks,
> >
> > Patrik
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
>
More information about the erlang-questions
mailing list