[erlang-questions] Twoorl: an open source Twitter clone

Paul Stanley pstanley1970@REDACTED
Sun Jun 1 14:39:19 CEST 2008


On Sun, Jun 1, 2008 at 9:41 AM, David Mitchell <monch1962@REDACTED> wrote:

...
> Are you suggesting that Twoorl should be architected as follows:
> - when they register, every user gets assigned their own RabbitMQ
> incoming and outgoing queues
> - user adds a message via Web/Yaws interface (I know, this could be
> SMS or something else later...)
> - message goes to that user's RabbitMQ incoming queue
> - a backend reads messages from the user's incoming queue, looks up in
> e.g. a Mnesia table to see who should be receiving messages from that
> user and whether they're connected or not.  If "yes" to both, RabbitMQ
> then forwards the message to each of those users' outgoing queues
> - either the receiving users poll their outgoing queue for the
> forwarded message, or a COMET-type Yaws app springs to life and
> forwards the message to their browser (again, ignoring SMS)
>
> This seems like a reasonable approach; I'm just curious if that's what
> you're suggesting, or whether you've got something else in mind.

I think this is not quite right. As I understand it, twitter messages
are retrieved
by someone who is "following" another user when the follower logs on.
They don't have to be connected when the message is sent.

In other words, the server side has to maintain (or construct) an archive of
"received messages" for each user. Like email. But unlike email (which is
normally longish messages to few people, this system assumes short
messages which may well go to lots of people.

There seem to be two options.

The first (which is what Twitter originally used) is to store every outgoing
message once, and construct the archive on the fly. When a user comes
online and asks for messages, a server looks up who that user follows,
finds the messages from each followed person (which may involve checking
who is allowed to see them), arranges them and delivers them. The cost, in
effort and space, of storing a message is small. The cost of retrieving them
is high, and this (it seems) is where Twitter has been hitting problems. It is
difficult to cache efficiently (since most users follow a different set of
people, so queries are often unique) and they have had trouble scaling it.

The alternative is to process outgoing messages at once, delivering
copies to each "follower". It's tempting to assume this is the obviously
right solution, but bear in mind that it means more heavy weight
processing on send, and that it also means a great deal of redundant
storage. Even if each message is only 200 bytes long, including
metadata, if 10 people are following a message that still means more
than 1K of "unnecessary" storage, and that's before you consider the
need for copies to ensure reliability. As I understand it, some users
are followed by thousands of people ... you can see where that is
going. Tempting as it is to say "throw more disks at it", I don't think
that's an altogether elegant answer: after all, even if storage doesn't
cost much it does cost something.

One solution, which Yariv proposed above, is to limit the number of
messages in each queue, flushing old messages. But as I understand
it, there are many users who like to be able  to go back more than 20
messages. So that solves the problem at  the cost of desired function.

Perhaps (probably?) the key is to find a way of storing very lightweight
pointers to messages, which can be appear redundantly in many
queues/archives without too much wastage, but without arbitrary
limits on archive size.

The devil is probably in the  detail with that, though. In particular, unless
you make them very  lightweight you may still have a painfully wastefully
bloated storage requirement and you have to be happy that you have a
blazingly efficient way of retrieving the messages from these pointers.

(I feel this discussion, fascinating as I find it, is not very Erlang-related
(at least not directly). I hope it's worth having though, because these
basic issues are very interesting ... at least I find them so!)

-- 
Paul Stanley



More information about the erlang-questions mailing list