[erlang-questions] Framework / library for at-least-once execution?

Wed Apr 11 20:35:30 CEST 2018

On 04/08/2018 07:38 AM, Petri Pellinen wrote:
> Hello everybody,
>
> I'm a new subscriber to the list and new to the Erlang world. I have read a couple of books on Erlang and OTP and, as a long-time Java programmer, am very excited by all the concurrency and high availability features that come out of the box.
>
> Tried searching the archives for answers but am not sure if I came up with the correct search terms and ended up empty-handed.
>
> I'm curious if there is an existing library or framework that would let me submit a "job" and the framework makes sure that the job is run *at least once* to completion in an OTP cluster even if the machine running the submitted job dies during execution. If a machine/node dies during execution of the job then another node should restart the job as soon as possible.
>
> So, from a client perspective, if I get an acknowledgment that a job was successfully received then I can rest assured that the job runs to completion.
>
> If any of you are familiar with Spring Batch in the Java world then this is something similar but not really ETL orientated or for heavy batches - when I say "job" here I really mean any piece of code, even a very lightweight function. I'm trying to come up with an extremely reliable backend solution for delivering and processing messages between parties.
>
> Any information or pointers to relevant existing solutions would be greatly appreciated.
>
> Thanks in advance for any help you may be able to provide!
>
> Kind regards,
> Petri
>

For real-time at-least-once processing you have two basic high-level abstract choices:
1) Treat a job as a piece of data you put in queue data to process at least once, with distributed consensus to ensure it can be fault-tolerant
2) Treat a job as source code in a service that receives task messages, so the concept of a job is abstract, allowing the algorithm and input/output of the job to change separately

#2 is similar to #1 because the tasks are still queued to get handled by the service.  However, #2 is clearly different by allowing hot-code upgrades/downgrades without extra complexity (i.e., task messages are using a protocol that is clearly defined and you don't need to be concerned about data structures changing, since they are isolated in the source code as a separate entity).

CloudI (https://cloudi.org) provides the #2 approach which is a natural way to approach the problem in Erlang, though you may be expecting to see the #1 approach.  If you were using CloudI to solve this problem, you could use the cloudi_service_quorum source code as a proxy to achieve consensus among different "job" services on separate machines that process the same task message concurrently, though something similar could be done with the CloudI API function mcast_async.

Best Regards,
Michael