[erlang-questions] Erlang re:run regular exp, match problrm

Jesse Gumm <>
Sun Oct 31 17:58:04 CET 2010


The problem in that last example is that by default * is greedy and .
doesn't match the linefeed (which is why putting \n at the end of each thing
worked)

"<point.*\/>" will match from the first instance of "<point" to the very
last "/>"

Changing it to:

"<point.*?\/>" will make * act "ungreedy" and only match at the first
instance it finds, then end.

Alternatively, you could use

"<point [^>]*\/>" then you don't really have to worry about greediness or
not.

-Jesse


On Sun, Oct 31, 2010 at 11:41 AM, Mathias <> wrote:

> Hi,
>
> That expression was actually the first one I tried out, with the only
> difference that I did It from within my application. This was before I knew
> that file:read_file and UTF-8 don't blend well. I used file:read_file to
> read in my UTF-8 encoded file... I only tried dilfen's expression from the
> CLI, and there it succeeded. At the point of posting I had tried so many
> different solutions that I was tired and just wanted my rather simple
> laboratory application to work. Later I found out that neither did that
> sample he gave work from my programs application scope or the one proposed
> form you.
>
> Further investigation has lead me to believe that re:run/3 has either an
> issue with strings lacking of linefeed.
> Putting some non valid xml in a file and using my rather simple program
> always yields([[{0,162}]]) the first char and the last points(see attached
> doc) ending char '>' as a match, it doesn't split them up as expected which
> is either the expected module behaviour which I find a bit odd or a(god
> forbid) programming fault from my part, Here is the code I use:
>
> -module(mock).
> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>
> start() ->
>         Bin = read_file("point.xml"),
>         UnicodeString = decode_data(Bin),
>         NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
> global]),
>         NodeList.
>
> read_file(File) ->
>         case file:read_file(File) of
>                 {ok, Bin} -> Bin;
>                 _ -> []
> end.
>
> decode_data(Data) ->
>          case unicode:characters_to_list(Data, utf8) of
>                  {error, Encoded, Rest} ->
>                         io:format("Caught Error~w~n", Encoded, Rest),
>                         [];
>                  List ->
>                         List
>  end.
>
> find_pattern(Str, Pattern, Options) ->
>         case re:run(Str, Pattern, Options) of
>                 {match, Part} ->
>                         io:format("find_pattern: ~w~n", [Part]),
>                         Part;
>                 nomatch -> []
> end.
>
> However, adding a linefeed '\n' after each entity in the doc will give the
> expected result:
> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
> which to me looks strange. Haven't read up on the re module that much but
> this is my experience.
>
> I have resigned to using xmerl_xpath which seems to do the job. I guess me
> coming from the Java world is a bit spoiled with strong support for string
> manipulation and doing the above would have taken men less then 10 min.
>
> Anyway Thank you both for the effort.
>
> BR,
> Mathias Stalås
>
>
>
>
> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <>wrote:
>
>> I would not thanks on your place. It doesn't do what you want but
>> works only by accident in this particular example. [^<point]* means
>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>> same way in this particular example.
>>
>> This would work much more generally
>>
>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])
>>
>> but anyway you should use xml parser for xml parsing because xml is
>> not parseable by regular grammar so regular expression is not proper
>> tool for do it. You will end up with error prone solution.
>>
>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <> wrote:
>> > Works like a charm!
>> >
>> > Many thanks dlfen!
>> >
>> > BR,
>> > Mathias
>> >
>> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <> wrote:
>> >
>> >> try this.
>> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> >> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>> >>
>> >>
>> >> 在 2010-10-30,下午6:44, Mathias 写道:
>> >>
>> >> > Hi there,
>> >> >
>> >> > I'm trying to figure out how Erlangs re:run module works.
>> >> >
>> >> > When executing this::
>> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>> >> > z=\"14\"/>", "<point(?:\s|.)*\/>").
>> >> > {match,[{0,54}]}
>> >> >
>> >> > I can see that it gives me a match on the complete XML representation
>> >> > {match,[{0,54}]}.
>> >> >
>> >> > But what I really would like to do is for it to give me a subset of
>> >> matches
>> >> > on each entity similar to {match,[{0,26},{27, 26}]}.
>> >> >
>> >> > so the output would yield  something like this:
>> >> > 0-26 gives the first xml entity complete with it's attributes <point
>> >> x="12"
>> >> > y="2" z="4"/> and
>> >> > match 27,26 gives the remaining entity.
>> >> >
>> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and
>> guide
>> >> me
>> >> > in the right direction closer to find the solution it will be greatly
>> >> > appreciated.
>> >> >
>> >> > I know about xmerl but for my trivial case it seems like overkill.
>> >> >
>> >> > Thx in advance.
>> >> >
>> >> > BR,
>> >> > Mathias Stalås
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> --Hynek (Pichi) Vychodil
>>
>> Analyze your data in minutes. Share your insights instantly. Thrill
>> your boss.  Be a data hero!
>> Try GoodData now for free: www.gooddata.com
>>
>
>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:
>



-- 
Jesse Gumm
Sigma Star Systems
414.940.4866

http://www.sigma-star.com


More information about the erlang-questions mailing list