[erlang-questions] Twoorl: an open source Twitter clone

Thu Jun 5 00:03:35 CEST 2008

2008/6/5 Scott Lystig Fritchie <fritchie@REDACTED>:
> Steve Davis <steven.charles.davis@REDACTED> wrote:

> As for "message queueing", there may be a misunderstanding over how MQ
> systems typically work: they have producers *and* consumers, and (more
> importantly) consumers actually "consume".  Consuming a queue item
> usually means also deleting it from the queue.  A single Twitter user X
> can have thousands of consumers all trying to consume the same messages,
> but in a typical MQ system, all but the first consumer would find X's
> queue empty.
>
> For one example, see the RabbitMQ FAQ, "Q. How do I archive to RDBMS?".

In case anyone's losing track, I was the one who suggested keeping
tweets in queues essentially forever, and having users retrieve them
from queues without deleting the message from the que.

I understand how MQ works in normal environments; what I'm suggesting
is that Twitter (and any clones) aren't "normal" once they start to
scale up to many millions of users.

The reasons I suggested storing messages in queues indefinitely are:
- experience says that queueing systems can scale very large, and that
it appears to be an "easier" problem to solve than scaling a database
very large.  I'll accept it if anyone complains about "gross
generalisation"...
- the APIs for storing messages to queues and then retrieving them are
designed to be very fast, and (again referencing IBM's MQ) we know
they can scale to queues holding very large numbers of messages

Storing messages in flat files seems to have a couple of limitations to me:
- if you're going to store 1 message per flat file, you need a
database (or database-like thing) to track those zillions of flat
files.  I figure that's going to put you back where you started in
terms of scalability
- assuming you're always appending messages to the end of flat files,
you'd have to assume that most requests will be for the most recent
message i.e. the last message in the file.  Do you really want to be
seeking through to the last record of flat files all the time?  That
doesn't seem to be a scalable approach
- alternately, if you always add the most recent message to the
*start* of a flat file, you'll constantly be rewriting the entire file
(at least, that's the case in any file system I can think of; there
might be an exception).  I suppose you could write your own file
system to optimise that...

Please speak up if you've got any thoughts - I'm treating this like a
bunch of intellectuals throwing ideas around, rather than an argument
about right and wrong, and it seems that everyone else is too at this
stage.  Very happy to be convinced I'm wrong, in other words

Regards

David Mitchell