big erlang web solution

Mon Jun 26 08:32:11 CEST 2006

On Jun 26, 2006, at 1:56 AM, Yariv Sadan wrote:

>> > 2. I will want to have use a distributed filesystem instead of
>> > database to access data. Any good recommendations in this context.
>> > Database will still probably be used to hold pointers to  
>> specific data
>> > but data will be in filesystem.
>>
>> I have similar needs in the near future.  My desire is to use mensia
>> for my main OLTP data store and have a filesystem store for
>> documents.  I need full text search and easy admin for this file
>> system document store.  If you find a product that does this, please
>> let me know.
>>
>
> Hi,
>
> Just out of curiosity, what kind of characteristics are you looking
> for in a distributed file system?

My needs are a little less "distributed" and more full text search  
and splitting of data between "semantic+OLTP" and "doc search":
(1) store documents in HTML or other markup (Markdown) in utf-8.   
Documents will fall into different types or groups such as "service  
description", "service results", "account details", etc...  For  
example, a document of type "service description" might be a simple  
resume of a person and his skill set.    There is little or no  
semantic info in the docs.  i.e. the docs are to be referenced by the  
OLTP _and_ to be used in full text search.  The document types are  
simple and do not overlap.  i.e., You can think of each type as a  
directory containing all docs of that type.  No subtypes /  
subdirectories.  No doc belongs to more than one type; i.e. no need  
for file links (sym or hard).
(2) semantic docs - The docs in 1 do no have much semantics internal  
to them.  Any semantics are either (a) stored in the OLTP and  
reference docs or (b) stored in special attribute areas of the docs  
to be used by tools other than the full text search requirements herein.
(3) full text search and index maintenance - docs must be full text  
searchable by type (which would imply a file directory).  Index of  
text search must be automatic as docs are added/removed/modified.   
This part of the system must be zero admin (or close to it).  I don't  
care much about the size of index files; disk space is cheap.  I care  
more about fast performance and low memory footprint of queries,  
usefulness of query results, and low admin of entire search system.
(4) search results must be "google-like".  (a) results contain enough  
highlighted context to the original search to let the user know which  
item in the results are worth digging into. (b) queries have  
"continuations".  meaning I don't have to retrieve all 100,000  
matching results in one chunk just to show the user the first three  
pages.  This aspect must have low memory consumption.
(5) distribution - all access to the docs and queries will be through  
an erlang node, distribution can be through erlang thus we are a bit  
free here.  I expect that since full text search will be outside  
erlang and a search will only use the file resources of its local  
disks (direct attached or SAN). No searches spanning slow internet  
attached file systems.
(6) access and scaling -

To get a physical picture of what I plan to implement, imagine:
(A) 2 x erlang+yaws+mnesia servers each with power and disk (RAID 1)  
redundancy.  This is a load balanced and failover config...so each  
server has the full capabilities of the other with one being the lead  
server for certain shared info.
(B) 2 x file servers+full text search.  This can be either a shared  
RAID 1+0 file system or two separate RAID 0 file systems.  These  
servers are only queriable via A, the erlang servers.

> I'm in the initial stages of
> building a web application that needs to store files on disc (not huge
> files, probably only images), and my planned approach at the moment is
> to store file metadata and pointer to physical location in a
> distributed database -- either MySQL or Mnesia (I prefer Mnesia, but
> at the moment, it looks like I'm going to have to use MySQL to store
> at least some of my data, so I might end up using MySQL exclusively --

Any reasons why MySQL over PostgreSQL?  If its due to erlang access,  
I know that ejabberd accesses postgresql so there is a working native  
driver for it.

> I haven't made up my mind yet). What's the primary motivation for
> using a distributed file system over a database-backed approach?
>

Mainly the full text search and having no need to store docs or blobs  
in a DB.  also, over time, I want to write other utils besides full  
text search that utilize the docs and I think referencing them via a  
file system gives me more options than having docs stored as fields  
in a DB.

thanks for taking time to brainstorm on this topic.  I need persons  
who can share their experiences on various tools.
ke han

> Best regards,
> Yariv