[erlang-questions] Erlang re:run regular exp, match problrm

Mathias mathiasstalas@REDACTED
Sun Oct 31 18:09:25 CET 2010


Thank you for the clarification Jesse.
Nicely explained. I have tried it and it worked.

BR,
Mathias Stalås

On Sun, Oct 31, 2010 at 5:58 PM, Jesse Gumm <sigmastar@REDACTED> wrote:

> The problem in that last example is that by default * is greedy and .
> doesn't match the linefeed (which is why putting \n at the end of each thing
> worked)
>
> "<point.*\/>" will match from the first instance of "<point" to the very
> last "/>"
>
> Changing it to:
>
> "<point.*?\/>" will make * act "ungreedy" and only match at the first
> instance it finds, then end.
>
> Alternatively, you could use
>
> "<point [^>]*\/>" then you don't really have to worry about greediness or
> not.
>
> -Jesse
>
>
> On Sun, Oct 31, 2010 at 11:41 AM, Mathias <mathiasstalas@REDACTED> wrote:
>
>> Hi,
>>
>> That expression was actually the first one I tried out, with the only
>> difference that I did It from within my application. This was before I knew
>> that file:read_file and UTF-8 don't blend well. I used file:read_file to
>> read in my UTF-8 encoded file... I only tried dilfen's expression from the
>> CLI, and there it succeeded. At the point of posting I had tried so many
>> different solutions that I was tired and just wanted my rather simple
>> laboratory application to work. Later I found out that neither did that
>> sample he gave work from my programs application scope or the one proposed
>> form you.
>>
>> Further investigation has lead me to believe that re:run/3 has either an
>> issue with strings lacking of linefeed.
>> Putting some non valid xml in a file and using my rather simple program
>> always yields([[{0,162}]]) the first char and the last points(see attached
>> doc) ending char '>' as a match, it doesn't split them up as expected which
>> is either the expected module behaviour which I find a bit odd or a(god
>> forbid) programming fault from my part, Here is the code I use:
>>
>> -module(mock).
>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>>
>> start() ->
>>         Bin = read_file("point.xml"),
>>         UnicodeString = decode_data(Bin),
>>         NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
>> global]),
>>         NodeList.
>>
>> read_file(File) ->
>>         case file:read_file(File) of
>>                 {ok, Bin} -> Bin;
>>                 _ -> []
>> end.
>>
>> decode_data(Data) ->
>>          case unicode:characters_to_list(Data, utf8) of
>>                  {error, Encoded, Rest} ->
>>                         io:format("Caught Error~w~n", Encoded, Rest),
>>                         [];
>>                  List ->
>>                         List
>>  end.
>>
>> find_pattern(Str, Pattern, Options) ->
>>         case re:run(Str, Pattern, Options) of
>>                 {match, Part} ->
>>                         io:format("find_pattern: ~w~n", [Part]),
>>                         Part;
>>                 nomatch -> []
>> end.
>>
>> However, adding a linefeed '\n' after each entity in the doc will give the
>> expected result:
>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
>> which to me looks strange. Haven't read up on the re module that much but
>> this is my experience.
>>
>> I have resigned to using xmerl_xpath which seems to do the job. I guess me
>> coming from the Java world is a bit spoiled with strong support for string
>> manipulation and doing the above would have taken men less then 10 min.
>>
>> Anyway Thank you both for the effort.
>>
>> BR,
>> Mathias Stalås
>>
>>
>>
>>
>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil <hynek@REDACTED>wrote:
>>
>>> I would not thanks on your place. It doesn't do what you want but
>>> works only by accident in this particular example. [^<point]* means
>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>>> same way in this particular example.
>>>
>>> This would work much more generally
>>>
>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>>> y=\"2\"z=\"14\"/>", "<point.*?\/>",[global])
>>>
>>> but anyway you should use xml parser for xml parsing because xml is
>>> not parseable by regular grammar so regular expression is not proper
>>> tool for do it. You will end up with error prone solution.
>>>
>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias <mathiasstalas@REDACTED>
>>> wrote:
>>> > Works like a charm!
>>> >
>>> > Many thanks dlfen!
>>> >
>>> > BR,
>>> > Mathias
>>> >
>>> > On Sat, Oct 30, 2010 at 1:03 PM, dlfen <erlangdlf@REDACTED> wrote:
>>> >
>>> >> try this.
>>> >> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>> >> z=\"14\"/>", "<point[^<point]*\/>",[global]).
>>> >>
>>> >>
>>> >> 在 2010-10-30,下午6:44, Mathias 写道:
>>> >>
>>> >> > Hi there,
>>> >> >
>>> >> > I'm trying to figure out how Erlangs re:run module works.
>>> >> >
>>> >> > When executing this::
>>> >> > 1> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>> >> > z=\"14\"/>", "<point(?:\s|.)*\/>").
>>> >> > {match,[{0,54}]}
>>> >> >
>>> >> > I can see that it gives me a match on the complete XML
>>> representation
>>> >> > {match,[{0,54}]}.
>>> >> >
>>> >> > But what I really would like to do is for it to give me a subset of
>>> >> matches
>>> >> > on each entity similar to {match,[{0,26},{27, 26}]}.
>>> >> >
>>> >> > so the output would yield  something like this:
>>> >> > 0-26 gives the first xml entity complete with it's attributes <point
>>> >> x="12"
>>> >> > y="2" z="4"/> and
>>> >> > match 27,26 gives the remaining entity.
>>> >> >
>>> >> > If anyone can spot why my regexp:<point(?:\s|.)*\/> is failing and
>>> guide
>>> >> me
>>> >> > in the right direction closer to find the solution it will be
>>> greatly
>>> >> > appreciated.
>>> >> >
>>> >> > I know about xmerl but for my trivial case it seems like overkill.
>>> >> >
>>> >> > Thx in advance.
>>> >> >
>>> >> > BR,
>>> >> > Mathias Stalås
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> --Hynek (Pichi) Vychodil
>>>
>>> Analyze your data in minutes. Share your insights instantly. Thrill
>>> your boss.  Be a data hero!
>>> Try GoodData now for free: www.gooddata.com
>>>
>>
>>
>>
>> ________________________________________________________________
>> erlang-questions (at) erlang.org mailing list.
>> See http://www.erlang.org/faq.html
>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>
>
>
>
> --
> Jesse Gumm
> Sigma Star Systems
> 414.940.4866
> gumm@REDACTED
> http://www.sigma-star.com
>


More information about the erlang-questions mailing list