[erlang-questions] Erlang re:run regular exp, match problrm

Mathias mathiasstalas@REDACTED
Sun Oct 31 23:05:31 CET 2010


I'm just goofing around at the moment so no worries.
I found xmerl_xpath to be a good friend at the moment.
It's a pity(IMOP) that the documentation on how to use these library is
gently speaking sparse.
On the other side I'm glad that they exist, and there are some good examples
in other open source projects that have some nice coding where one can pick
up some bits and pieces and figure things out.

BR,
Mathias Stalås



On Sun, Oct 31, 2010 at 10:08 PM, Morten Krogh <mk@REDACTED> wrote:

> Hi
>
> Take care, '>' is an allowed character in attribute values, e.g.,
>
> <point id="/>"/>
>
> is valid xml.
>
> If you control the input yourself it is no problem, of course.
>
> Morten.
>
>
>
> On 10/31/10 5:58 PM, Jesse Gumm wrote:
>
>> The problem in that last example is that by default * is greedy and .
>> doesn't match the linefeed (which is why putting \n at the end of each
>> thing
>> worked)
>>
>> "<point.*\/>" will match from the first instance of"<point" to the very
>> last "/>"
>>
>> Changing it to:
>>
>> "<point.*?\/>" will make * act "ungreedy" and only match at the first
>> instance it finds, then end.
>>
>> Alternatively, you could use
>>
>> "<point [^>]*\/>" then you don't really have to worry about greediness or
>> not.
>>
>> -Jesse
>>
>>
>> On Sun, Oct 31, 2010 at 11:41 AM, Mathias<mathiasstalas@REDACTED>
>>  wrote:
>>
>>  Hi,
>>>
>>> That expression was actually the first one I tried out, with the only
>>> difference that I did It from within my application. This was before I
>>> knew
>>> that file:read_file and UTF-8 don't blend well. I used file:read_file to
>>> read in my UTF-8 encoded file... I only tried dilfen's expression from
>>> the
>>> CLI, and there it succeeded. At the point of posting I had tried so many
>>> different solutions that I was tired and just wanted my rather simple
>>> laboratory application to work. Later I found out that neither did that
>>> sample he gave work from my programs application scope or the one
>>> proposed
>>> form you.
>>>
>>> Further investigation has lead me to believe that re:run/3 has either an
>>> issue with strings lacking of linefeed.
>>> Putting some non valid xml in a file and using my rather simple program
>>> always yields([[{0,162}]]) the first char and the last points(see
>>> attached
>>> doc) ending char '>' as a match, it doesn't split them up as expected
>>> which
>>> is either the expected module behaviour which I find a bit odd or a(god
>>> forbid) programming fault from my part, Here is the code I use:
>>>
>>> -module(mock).
>>> -export([start/0, read_file/1, decode_data/1, find_pattern/3]).
>>>
>>> start() ->
>>>         Bin = read_file("point.xml"),
>>>         UnicodeString = decode_data(Bin),
>>>         NodeList = find_pattern(UnicodeString, "<point.*\/>", [unicode,
>>> global]),
>>>         NodeList.
>>>
>>> read_file(File) ->
>>>         case file:read_file(File) of
>>>                 {ok, Bin} ->  Bin;
>>>                 _ ->  []
>>> end.
>>>
>>> decode_data(Data) ->
>>>          case unicode:characters_to_list(Data, utf8) of
>>>                  {error, Encoded, Rest} ->
>>>                         io:format("Caught Error~w~n", Encoded, Rest),
>>>                         [];
>>>                  List ->
>>>                         List
>>>  end.
>>>
>>> find_pattern(Str, Pattern, Options) ->
>>>         case re:run(Str, Pattern, Options) of
>>>                 {match, Part} ->
>>>                         io:format("find_pattern: ~w~n", [Part]),
>>>                         Part;
>>>                 nomatch ->  []
>>> end.
>>>
>>> However, adding a linefeed '\n' after each entity in the doc will give
>>> the
>>> expected result:
>>> [[{0,27}],[{28,27}],[{56,27}],[{84,27}],[{112,27}],[{140,27}]]
>>> which to me looks strange. Haven't read up on the re module that much but
>>> this is my experience.
>>>
>>> I have resigned to using xmerl_xpath which seems to do the job. I guess
>>> me
>>> coming from the Java world is a bit spoiled with strong support for
>>> string
>>> manipulation and doing the above would have taken men less then 10 min.
>>>
>>> Anyway Thank you both for the effort.
>>>
>>> BR,
>>> Mathias Stalås
>>>
>>>
>>>
>>>
>>> On Sat, Oct 30, 2010 at 8:47 PM, Hynek Vychodil<hynek@REDACTED
>>> >wrote:
>>>
>>>  I would not thanks on your place. It doesn't do what you want but
>>>> works only by accident in this particular example. [^<point]* means
>>>> any char except of '<', 'p', 'o', 'i', 'n', 't'. [^<]* would work in
>>>> same way in this particular example.
>>>>
>>>> This would work much more generally
>>>>
>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\"
>>>> y=\"2\"z=\"14\"/>","<point.*?\/>",[global])
>>>>
>>>> but anyway you should use xml parser for xml parsing because xml is
>>>> not parseable by regular grammar so regular expression is not proper
>>>> tool for do it. You will end up with error prone solution.
>>>>
>>>> On Sat, Oct 30, 2010 at 1:34 PM, Mathias<mathiasstalas@REDACTED>
>>>>  wrote:
>>>>
>>>>> Works like a charm!
>>>>>
>>>>> Many thanks dlfen!
>>>>>
>>>>> BR,
>>>>> Mathias
>>>>>
>>>>> On Sat, Oct 30, 2010 at 1:03 PM, dlfen<erlangdlf@REDACTED>  wrote:
>>>>>
>>>>>  try this.
>>>>>> re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>>> z=\"14\"/>","<point[^<point]*\/>",[global]).
>>>>>>
>>>>>>
>>>>>> 在 2010-10-30,下午6:44, Mathias 写道:
>>>>>>
>>>>>>  Hi there,
>>>>>>>
>>>>>>> I'm trying to figure out how Erlangs re:run module works.
>>>>>>>
>>>>>>> When executing this::
>>>>>>> 1>  re:run("<point x=\"12\" y=\"2\" z=\"4\"/><point x=\"4\" y=\"2\"
>>>>>>> z=\"14\"/>","<point(?:\s|.)*\/>").
>>>>>>> {match,[{0,54}]}
>>>>>>>
>>>>>>> I can see that it gives me a match on the complete XML representation
>>>>>>> {match,[{0,54}]}.
>>>>>>>
>>>>>>> But what I really would like to do is for it to give me a subset of
>>>>>>>
>>>>>> matches
>>>>>>
>>>>>>> on each entity similar to {match,[{0,26},{27, 26}]}.
>>>>>>>
>>>>>>> so the output would yield  something like this:
>>>>>>> 0-26 gives the first xml entity complete with it's attributes<point
>>>>>>>
>>>>>> x="12"
>>>>>>
>>>>>>> y="2" z="4"/>  and
>>>>>>> match 27,26 gives the remaining entity.
>>>>>>>
>>>>>>> If anyone can spot why my regexp:<point(?:\s|.)*\/>  is failing and
>>>>>>>
>>>>>> guide
>>>>
>>>>> me
>>>>>>
>>>>>>> in the right direction closer to find the solution it will be greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> I know about xmerl but for my trivial case it seems like overkill.
>>>>>>>
>>>>>>> Thx in advance.
>>>>>>>
>>>>>>> BR,
>>>>>>> Mathias Stalås
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> --Hynek (Pichi) Vychodil
>>>>
>>>> Analyze your data in minutes. Share your insights instantly. Thrill
>>>> your boss.  Be a data hero!
>>>> Try GoodData now for free: www.gooddata.com
>>>>
>>>>
>>>
>>> ________________________________________________________________
>>> erlang-questions (at) erlang.org mailing list.
>>> See http://www.erlang.org/faq.html
>>> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>>>
>>>
>>
>>
>
> ________________________________________________________________
> erlang-questions (at) erlang.org mailing list.
> See http://www.erlang.org/faq.html
> To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED
>
>


More information about the erlang-questions mailing list