[erlang-questions] Erlang re:run regular exp, match problrm

Mathias mathiasstalas@REDACTED
Sun Oct 31 17:41:31 CET 2010


Hi,

That expression was actually the first one I tried out, with the only
difference that I did It from within my application. This was before I knew
that file:read_file and UTF-8 don't blend well. I used file:read_file to
read in my UTF-8 encoded file... I only tried dilfen's expression from the
CLI, and there it succeeded. At the point of posting I had tried so many
different solutions that I was tired and just wanted my rather simple
laboratory application to work. Later I found out that neither did that
sample he gave work from my programs application scope or the one proposed
form you.

Further investigation has lead me to believe that re:run/3 has either an
issue with strings lacking of linefeed.
Putting some non valid xml in a file and using my rather simple program
always yields([[{0,162}]]) the first char and the last points(see attached
doc) ending char '>' as a match, it doesn't split them up as expected which
is either the expected module behaviour which I find a bit odd or a(god
forbid) programming fault from my part, Here is the code I use:

-module(mock).
-export([start/0, read_file/1, decode_data/1, find_pattern/3]).

start() ->
        Bin = read_file("point.xml"),
        UnicodeString = decode_data(Bin),
        NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
global]),
        NodeList.

read_file(File) ->
        case file:read_file(File) of
                {ok, Bin} -> Bin;
                _ -> []
end.

decode_data(Data) ->
         case unicode:characters_to_list(Data, utf8) of
                 {error, Encoded, Rest} ->
                        io:format("Caught Error~w~n", Encoded, Rest),
                        [];
                 List ->
                        List
 end.

find_pattern(Str, Pattern, Options) ->
        case re:run(Str, Pattern, Options) of
                {match, Part} ->
                        io:format("find_pattern: ~w~n", [Part]),
                        Part;
                nomatch -> []
end.

However, adding a linefeed '\n' after each entity in the doc will give the
expected result:
[[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
which to me looks strange. Haven't read up on the re module that much but
this is my experience.

I have resigned to using xmerl_xpath which seems to do the job. I guess me
coming from the Java world is a bit spoiled with strong support for string
manipulation and doing the above would have taken men less then 10 min.

Anyway Thank you both for the effort.

BR,
Mathias Stalås



On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <hynek@REDACTED> wrote:

> I would not thanks on your place. It doesn't do what you want but
> works only by accident in this particular example. [^<point]* means
> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
> same way in this particular example.
>
> This would work much more generally
>
> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])
>
> but anyway you should use xml parser for xml parsing because xml is
> not parseable by regular grammar so regular expression is not proper
> tool for do it. You will end up with error prone solution.
>
> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <mathiasstalas@REDACTED> wrote:
> > Works like a charm!
> >
> > Many thanks dlfen!
> >
> > BR,
> > Mathias
> >
> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <erlangdlf@REDACTED> wrote:
> >
> >> try this.
> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
> >> z=\"14\"/>", "<point[^<point]*\/>",[global]).
> >>
> >>
> >> 在 2010-10-30,下午6:44, Mathias 写道:
> >>
> >> > Hi there,
> >> >
> >> > I'm trying to figure out how Erlangs re:run module works.
> >> >
> >> > When executing this::
> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
> >> > z=\"14\"/>", "<point(?:\s|.)*\/>").
> >> > {match,[{0,54}]}
> >> >
> >> > I can see that it gives me a match on the complete XML representation
> >> > {match,[{0,54}]}.
> >> >
> >> > But what I really would like to do is for it to give me a subset of
> >> matches
> >> > on each entity similar to {match,[{0,26},{27, 26}]}.
> >> >
> >> > so the output would yield  something like this:
> >> > 0-26 gives the first xml entity complete with it's attributes <point
> >> x="12"
> >> > y="2" z="4"/> and
> >> > match 27,26 gives the remaining entity.
> >> >
> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and
> guide
> >> me
> >> > in the right direction closer to find the solution it will be greatly
> >> > appreciated.
> >> >
> >> > I know about xmerl but for my trivial case it seems like overkill.
> >> >
> >> > Thx in advance.
> >> >
> >> > BR,
> >> > Mathias Stalås
> >>
> >>
> >
>
>
>
> --
> --Hynek (Pichi) Vychodil
>
> Analyze your data in minutes. Share your insights instantly. Thrill
> your boss.  Be a data hero!
> Try GoodData now for free: www.gooddata.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20101031/9ffa62cb/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: point.xml
Type: text/xml
Size: 163 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20101031/9ffa62cb/attachment.xml>


More information about the erlang-questions mailing list