[erlang-questions] List Question

Wed Aug 9 01:27:24 CEST 2017

From what I've been able to glean about the HL7 message
format, there are two aspects:
* the basic syntax is the multi-level delimited thingy but
* there is also *semantics*, predefined message types,
  an assortment of data types, rules for mapping data
  types to trees and so on.

From a quick look at the Elixir HL7 parser, they have
taken steps to handle some (but perhaps not all) of the
*semantics* of HL7 and don't just give a tree of strings,
but more structured data.

Parsing the multi-level delimited syntax is trivial.
Dealing with the semantics is not.
I think that in figuring out for yourself how to work
with HL7 messages in Erlang, the starting point would
be
 - what message types do you want to handle?
 - what kinds of data occur in them?
 - how do you want to represent those kinds of
   data in Erlang?
 - do you actually want to represent *every* field
   at all?  Some might not be relevant to you.
 - would you be streaming messages through a
   system (like some sort of pub/sub queueing
   middleware), summarising messages, storing
   them, or what?
 - what does a type declaration for a message type
   look like in HL7?  Is there some way to automatically
   derive parsing code from that?

What I'm getting at with the last point is that there
is ASN.1 support for Erlang.  Give it an ASN.1
definition, and you get Erlang code out the other end.
I am particularly thinking of the PADS project:

PADS: Processing Arbitrary Data Streams

Kathleen Fisher and Bob Gruber, AT&T Labs

Slides in ppt http://homepages.inf.ed.ac.uk/wadler/xmlbinding/

Transactional data streams, such as sequences of stock-market buy/sell orders, credit-card purchase records, web server entries, and electronic fund transfer orders, can be mined very profitably. As an example, researchers at AT&T have built customer profiles from streams of call-detail records to significant financial effect.

Often such streams are high-volume: AT&T's call-detail stream contains roughly 300 million calls per day requiring approximately 7GBs of storage space. Typically, such stream data arrives ``as is'' in ad hoc formats with poor documentation. In addition, the data frequently contains errors. The appropriate response to such errors is application-specific. Some applications can simply discard unexpected or erroneous values and continue processing. For other applications, however, errors in the data can be the most interesting part of the data.

Understanding a new data source and producing a suitable parser are crucial first steps in any use of such data. Unfortunately, writing parsers for this kind of data is a difficult task, both tedious and error-prone. It is complicated by lack of documentation, convoluted encodings designed to save space, the need to handle errors robustly, and the need to produce efficient code to cope with the scale of the stream. Often, the hard-won understanding of the data ends up embedded in parsing code, making long-term maintenance difficult for the original writer and sharing the knowledge with others nearly impossible.

The goal of the PADS project is to provide languages and tools for simplifying data processing. We have a preliminary design of a declarative data-description language, PADSL, expressive enough to describe the data feeds we see at AT&T in practice, including ASCII, binary, EBCDIC, Cobol, and mixed data formats. From PADSL we generate a tunable C library with functions for parsing, manipulating, and summarizing the data. In joint work with Mary Fernandez and Ricardo Medel, we are working to integrate PADS and XQuery to support declarative querying of data sources with PADS descriptions.

--------------------
The PADS project moved from AT&T to
http://pads.cs.tufts.edu/doc.html

I say this in all seriousness: if I had a need to process
GB of HL7 data, I would start by seeing if PADS was adequate
to describe it, and if so I'd write an HL7->Erlang data
translator in C or ML (as PADS has C and ML versions).
If not, I'd see what ideas I could steal from PADS.

Using a declarative data language to describe the message types
I was interested in would be an up-front cost, but it would
hugely simplify later maintenance.