[erlang-questions] Twoorl: an open source Twitter clone

Wed Jun 4 10:12:07 CEST 2008

Do you need a database at all for the individual messages?

I'm aware that you need to be able to access an archive of a user's
posts, but look at the Twitter use cases for *receiving* messages (and
I REALLY hope I've got these right, otherwise I'll look like a
klutz...):
- user receives all messages that they've "subscribed" to (i.e. from
users that they're following)
- user receives all messages directed to them specifically (i.e. stuff
directed to @me)
- user receives all messages directed to them as part of a group (I
assume this functionality is in Twitter somewhere...)
- user looks at the history of messages sent by a specific user

You could achieve ALL of these by forwarding messages to queues, and
not storing the messages themselves in a central database.  For the
4th use case above, you need to be able to retain messages in queues
indefinitely, but there's ways to achieve that without relying on a
single, central database.  Yep, you're going to burn through disc
space, but that's unavoidable unless you put in some mechanism for
ageing out old messages.

Essentially your queues *become* the database, and managing a vast
number of queues (and the messages in those queues) is a different
challenge than managing a vast database.

If you assume that you can massively scale out the number of queues
being managed (and IBM does a pretty good job of that with MQ running
on a mainframe, so I assume it's at least feasible to do it with
RabbitMQ or SQS), and you've got the capacity to store a large number
of messages in those queues (a big assumption, but again IBM does it
with MQ), the key 2 items of data you need to manage is (a) mapping of
user IDs to the queues for those users (i.e. queue name, queue server
name), and (b) mapping the many-to-many relationships between users.
That's the bit that needs the rapid response, but you've reduced the
database scalability problem to only that.

Managing the scalability of the actual messages themselves then falls
within the realm of the queue servers & software, and IBM's experience
seems to be that you can go a very long way down the path of just
throwing hardware at that problem before you hit the limits of that
approach.

It's HIGHLY likely I'm missing something here, since I'm still a
novice as far as Twitter's functionality is concerned (and it's late,
and I'm tired...).  This thread continues to be a very interesting
discussion of scaling a big bit of infrastructure, and I'm learning a
lot from everyone participating.

Please jump in and tell me where I'm wrong, and please don't think I'm
questioning the decisions you've made.  Twoorl is a really impressive
bit of work, and a great demo of what can be achieved relatively
easily with Erlyweb.  For me at least, it's really opened my eyes to
what could be achieved using pure Erlang infrastructure.

Regards

David Mitchell

2008/6/4 Yariv Sadan <yarivsadan@REDACTED>:
> I considered using a reliable queuing mechanism such as RabbitMQ or
> Amazon SQS but I don't think it would make the architecture inherently
> more scalable (more reliable maybe, but not more scalable). I think a
> Twitter like solution can be designed to scale using just Yaws, MySQL,
> and Mnesia or memcache (and maybe Ejabberd if you add an XMPP
> gateway). RabbitMQ or SQS would provide *reliable* asynchronous
> background processing, but if you don't need 100% reliability (Twoorl
> isn't a banking application after all), you can just spawn Erlang
> processes from Yaws to do background tasks after a user posts a
> message. Also, using persistent queues doesn't make the need for a
> database go away. When you pull a tweet from a queue you have to put
> it somewhere so it can be shown on rendered pages, and a database is
> the most reasonable place to put it. The main problem in scaling
> Twitter/Twoorl is how you architect your database backend --
> partitioning, denormalization, replication, load balancing, caching,
> etc, will probably make or break your ability to scale.
>
> Yariv
>
> On Sun, Jun 1, 2008 at 1:41 AM, David Mitchell <monch1962@REDACTED> wrote:
>> This is a REALLY interesting discussion, but at this point it's
>> becoming obvious that I don't know enough about Twitter...
>>
>> Are you suggesting that Twoorl should be architected as follows:
>> - when they register, every user gets assigned their own RabbitMQ
>> incoming and outgoing queues
>> - user adds a message via Web/Yaws interface (I know, this could be
>> SMS or something else later...)
>> - message goes to that user's RabbitMQ incoming queue
>> - a backend reads messages from the user's incoming queue, looks up in
>> e.g. a Mnesia table to see who should be receiving messages from that
>> user and whether they're connected or not.  If "yes" to both, RabbitMQ
>> then forwards the message to each of those users' outgoing queues
>> - either the receiving users poll their outgoing queue for the
>> forwarded message, or a COMET-type Yaws app springs to life and
>> forwards the message to their browser (again, ignoring SMS)
>>
>> This seems like a reasonable approach; I'm just curious if that's what
>> you're suggesting, or whether you've got something else in mind.
>>
>> Great thread, and thanks Yariv for getting this discussion going with Twoorl
>>
>> Regards
>>
>> Dave M.
>>
>> 2008/6/1 Steve <steven.charles.davis@REDACTED>:
>>>
>>> On May 31, 5:04 pm, "Yariv Sadan" <yarivsa...@REDACTED> wrote:
>>>> ...but it's the only way you can scale this kind of service when N is
>>>> big.
>>>
>>> Hmmm, Yariv, aren't you still thinking about this in the way that Dave
>>> Smith pointed to as the heart of the issue? i.e.
>>> Dave said: "My understanding is that the reason they have such poor
>>> uptime is due to the fact that they modeled the problem as a web-app
>>> instead of a messaging system."
>>>
>>> I'm aware that you are likely a good way away from hitting any
>>> scalability problems, but some kind of tiering would seem to be
>>> appropriate if twoorl is to be "twitter done right". Yaws at the front
>>> end, definitely - but rather /RabbitMQ/ at the back end. I believe
>>> that you'd then have the flexibility to distribute/cluster as
>>> necessary to scale to the required level (whatever that may be).
>>>
>>> For sure, Twoorl is a great demo of what can be done with Erlang in an
>>> incredibly short time. I'm a relative noob to Erlang, and have learned
>>> a great deal from your blog/code/examples.
>>>
>>> Steve
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions@REDACTED
>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>