[erlang-questions] Reading large (1GB+) XML files.

Wed Aug 15 22:15:39 CEST 2007

On Aug 15, 2007, at 11:23 , Patrik Husfloen wrote:

> I'm not sure about the terminology here, I've been stuck in OO land
> for so long that everything looks like an object, but here's what I'm
> thinking: One thread reading the xmls and piecing together the data,
> and then handing off each record to a pool of workers that issue the
> http requests, or, maybe the xml-reading part could just spawn a new
> thread for each record it reads, and ensure that only X are running at
> the most?

	This sounds very similar to the design of my load replay tool.  I've  
got a tool that reads a pcap file and writes out a binary file that I  
suppose is conceptually similar to XML.  The playback tool reads that  
file and issues HTTP requests with the same types of payload (some  
contents rewritten for validity on playback) with the same timings  
(to whatever scale is desirable) and logs the results.  It works like  
this:

	1)  There's an overseer process that starts all of the other  
processes and facilitates communication among them.
	2)  One process is responsible for reading the file, sleeping as  
appropriate, and sending records up to the overseer.
	3)  Another process is responsible for performing HTTP requests.  It  
receives the messages from the overseer, issues an async http request  
against inets, and adds the result to a dict with a timer.  When a  
response comes back from inets, it looks up the request and sends the  
timing, request, and results back up.
	4)  The logging process figures out what the request meant, on  
behalf of what user it was sent, and some other stuff and logs it.

	On startup, I find all available nodes and run one of the requestor  
processes (#3) on each node.  The overseer has a queue of these  
processes and pops the next available requestor off the front, sends  
it a request, and adds it to the back of the queue again.

	If you want to control how many concurrent requests you're  
executing, you can issue the requests synchronously and use a process  
queue like I've got there.

-- 
Dustin Sallings