big erlang web solution
ke han
ke.han@REDACTED
Mon Jun 26 08:32:11 CEST 2006
On Jun 26, 2006, at 1:56 AM, Yariv Sadan wrote:
>> > 2. I will want to have use a distributed filesystem instead of
>> > database to access data. Any good recommendations in this context.
>> > Database will still probably be used to hold pointers to
>> specific data
>> > but data will be in filesystem.
>>
>> I have similar needs in the near future. My desire is to use mensia
>> for my main OLTP data store and have a filesystem store for
>> documents. I need full text search and easy admin for this file
>> system document store. If you find a product that does this, please
>> let me know.
>>
>
> Hi,
>
> Just out of curiosity, what kind of characteristics are you looking
> for in a distributed file system?
My needs are a little less "distributed" and more full text search
and splitting of data between "semantic+OLTP" and "doc search":
(1) store documents in HTML or other markup (Markdown) in utf-8.
Documents will fall into different types or groups such as "service
description", "service results", "account details", etc... For
example, a document of type "service description" might be a simple
resume of a person and his skill set. There is little or no
semantic info in the docs. i.e. the docs are to be referenced by the
OLTP _and_ to be used in full text search. The document types are
simple and do not overlap. i.e., You can think of each type as a
directory containing all docs of that type. No subtypes /
subdirectories. No doc belongs to more than one type; i.e. no need
for file links (sym or hard).
(2) semantic docs - The docs in 1 do no have much semantics internal
to them. Any semantics are either (a) stored in the OLTP and
reference docs or (b) stored in special attribute areas of the docs
to be used by tools other than the full text search requirements herein.
(3) full text search and index maintenance - docs must be full text
searchable by type (which would imply a file directory). Index of
text search must be automatic as docs are added/removed/modified.
This part of the system must be zero admin (or close to it). I don't
care much about the size of index files; disk space is cheap. I care
more about fast performance and low memory footprint of queries,
usefulness of query results, and low admin of entire search system.
(4) search results must be "google-like". (a) results contain enough
highlighted context to the original search to let the user know which
item in the results are worth digging into. (b) queries have
"continuations". meaning I don't have to retrieve all 100,000
matching results in one chunk just to show the user the first three
pages. This aspect must have low memory consumption.
(5) distribution - all access to the docs and queries will be through
an erlang node, distribution can be through erlang thus we are a bit
free here. I expect that since full text search will be outside
erlang and a search will only use the file resources of its local
disks (direct attached or SAN). No searches spanning slow internet
attached file systems.
(6) access and scaling -
To get a physical picture of what I plan to implement, imagine:
(A) 2 x erlang+yaws+mnesia servers each with power and disk (RAID 1)
redundancy. This is a load balanced and failover config...so each
server has the full capabilities of the other with one being the lead
server for certain shared info.
(B) 2 x file servers+full text search. This can be either a shared
RAID 1+0 file system or two separate RAID 0 file systems. These
servers are only queriable via A, the erlang servers.
> I'm in the initial stages of
> building a web application that needs to store files on disc (not huge
> files, probably only images), and my planned approach at the moment is
> to store file metadata and pointer to physical location in a
> distributed database -- either MySQL or Mnesia (I prefer Mnesia, but
> at the moment, it looks like I'm going to have to use MySQL to store
> at least some of my data, so I might end up using MySQL exclusively --
Any reasons why MySQL over PostgreSQL? If its due to erlang access,
I know that ejabberd accesses postgresql so there is a working native
driver for it.
> I haven't made up my mind yet). What's the primary motivation for
> using a distributed file system over a database-backed approach?
>
Mainly the full text search and having no need to store docs or blobs
in a DB. also, over time, I want to write other utils besides full
text search that utilize the docs and I think referencing them via a
file system gives me more options than having docs stored as fields
in a DB.
thanks for taking time to brainstorm on this topic. I need persons
who can share their experiences on various tools.
ke han
> Best regards,
> Yariv
More information about the erlang-questions
mailing list