[erlang-questions] html parsing in erlang?
Wed Jan 20 23:53:11 CET 2010
On Wed, Jan 20, 2010 at 4:06 PM, Carlo Cabanilla
> On Wed, Jan 20, 2010 at 2:23 PM, Garrett Smith <> wrote:
>> On Wed, Jan 20, 2010 at 7:41 AM, Carlo Cabanilla
>> If your application can process the web content in batch, or using a
>> disk based queue/spool, you could use this:
>> - Grab + parse web content in Python
>> - Dump your output (presumably trees, maps, etc.) to an Erlang term
>> (see the erl_term module in py-interface
>> http://www.lysator.liu.se/~tab/erlang/py_interface/ - or BERT
>> - Read the terms on disk from Erlang
>> To avoid the intermediary phase of writing to disk, you could setup
>> your Python app as a port, which I've found to work very well.
> I was actually considering this design, have you implemented this before?
> What's the overhead for the serialization/deserialization over the wire
It's an OS pipe, so no socket IO. If you're encoding Erlang terms, the
overhead in serializing Python data should be relatively low, when
compared to JSON, XML, etc.
It's tempting to over architect for performance, but your app may be
very happy even with something that's not terribly efficient. If
you're crunching through a lot of web content, lxml is probably a
better option (performance wise) than b-soup. If you can pair your
data down by pre-processing it in Python, you can minimize the work in
serializing/deserializing over the port.
Here's a step-by-step that uses Python to implement a simple Erlang
port friendly app:
I've warmed to the port architecture. While I haven't had a ton of VMs
crash from linked in processes (totally endemic in Python world), the
idea of strict message passing over a pipe certainly feels nice :) To
answer your question, I'm using ports in Python, bash, and Java.
My only qualm about Erlang ports is that it's not hard to orphan your
external processes. You need to be careful to close your app once
stdin is closed, else the OS process will hang around until killed.
More information about the erlang-questions