[erlang-questions] High volume CDR analysis
Tue Jan 22 09:39:41 CET 2008
What you also need to tell us is how independent the data is. Do you
need access to all data in
the same name space in order to analyze it?
Suppose, for example, I wanted to keep a records of billing data,
where the key is a persons name.
This data is independent - ie Joe's bills don't depend upon Ukyo's bills.
This kind of computation is easily distributed and scales nicely. If I
have N machines
then I could keep my billing data on machine K, where K = md5("joe
armstrong") mod N, for fault tolerance I might keep a replica on
machine (K + N/2) mod N.
For partitionable data nothing fancy is required - distributed Erlang
on a cluster would be fine.
The key to scaling your solution depends upon how well you can
partition your problem into independent
tasks that can be performed in parallel - it's got nothing to do with
programming language - it's *easier* to
implement a distributed system in Erlang (since you get all the
plumbing for free) - but Erlang won't save you from a
bad architecture - "put everything in a huge database" is often a very
bad idea - much better is to think about what data you need and where
you will place it and try to make sure you can move the data and
access it in a reasonably efficient way.
On Jan 22, 2008 8:39 AM, Christian S <chsu79@REDACTED> wrote:
> On Jan 21, 2008 5:53 PM, Ukyo Virgden <listproc@REDACTED> wrote:
> > Hi Christian,
> > You're right. What I'm imagining is to collect call detail records
> > from several telco equipment and periodically create reports. At this
> > moment I'm not thinking about real-time (by realtime I mean as-it-
> > happens) reports.
> > So this basically means, collect input data in parallel, apply some
> > transformation, store in mnesia and run a job to create reports.
> > Therefore, the only data I need to store is for only one period,
> > which is 100-300 million records of input.
> > Any suggestions? I suppose there is a storage limit of 2gig for
> > mnesia per node right?
> There is a storage limit per disk table. I.e. you could create and
> direct logging
> to a new table before the current one has time to grow full.
> I believe you would be better off recording these CDRs to flat files,
> perhaps using
> the disk_log library which can give you copies on multiple nodes.
> The later would be a very primitive form of column-based table representation.
> erlang-questions mailing list
More information about the erlang-questions