[erlang-questions] wkipedia rendering engine

Joe Armstrong erlang@REDACTED
Mon Jun 30 14:18:11 CEST 2008


On Mon, Jun 30, 2008 at 1:58 PM, Alain O'Dea <alain.odea@REDACTED> wrote:
> On Mon, Jun 30, 2008 at 9:10 AM, Joe Armstrong <erlang@REDACTED> wrote:
>> On Mon, Jun 30, 2008 at 1:36 PM, Jan Lehnardt <jan@REDACTED> wrote:
>>> On Jun 30, 2008, at 13:23, Joe Armstrong wrote:
>>>>
>>>> Is there a REST interface so that I can retreive the latest version of
>>>> the MetaWiki markup for a specific page with, for example,
>>>> a wget command.
>>>
>>> You can get bulk dumps
>>> http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get...
>>>
>>> Why would you do individual scraping? In order to keep up to date with
>>> changes that happened between the last dump and now()?
>>>
>>
>> To get a few test cases to test my parser on *before* download the entire thing.
>>
>> Also I suspect the dumps are in MySQL format with xml junk - so it might not be
>> a trival job to extract the raw data. I (presumably) will have to
>> install MySQL and
>> turn some XML stuff into the raw data (just guessing here) - thought
>> that could be a job for a
>> volunteer :-)
>>
>> /Joe
>>
>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>> Has anybody made an erlang interface to scrape individual pages from
>>>> the wikipedia - or to bulk convert the entire
>>>> wikipedia to erlang terms :-)
>>>>
>>>> /Joe
>>>>
>>>>
>>>>
>>>> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@REDACTED> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I was at the erlang exchange and heard the *magnificant*  talk
>>>>>
>>>>> "Building a transactional distributed data store with Erlang", by
>>>>> Alexander Reinefeld.
>>>>>
>>>>> I'll be blogging this as soon as I have the URL of the video of the talk.
>>>>>
>>>>> (in advance of this there was talk at the google conference on
>>>>> scalability
>>>>>
>>>>>
>>>>> http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+scalable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en
>>>>>
>>>>> oh and they also seem to have won the SCALE 2008 prize at the
>>>>> CCGrid conferense in Lyon but there is zero publicity about this AFAICS
>>>>> )
>>>>>
>>>>> We (collectively) promised to help Alexander - I promised to provide him
>>>>> with a
>>>>> rendering engine (in Erlang) for the wikipedia markup language.
>>>>>
>>>>> Before I start hacking has anybody done this before?
>>>>>
>>>>> /Joe Armstrong
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> fra@REDACTED; ingvar.akesson@REDACTED
>>>>
>>>> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
>>>> De vill ju ändå läsa min e-post.]
>>>>
>>>> [A copy of this mail has been sent to
>>>> FRA for monitoring purposes. FRA wants to read all my e-mail and have
>>>> been allowed to do by the Swedish parliment - in violation of article
>>>> 12 of the UN Universal Declaration of Human Rights]
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@REDACTED
>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>
>>>
>>
>>
>>
>> --
>> fra@REDACTED; ingvar.akesson@REDACTED
>>
>> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
>> De vill ju ändå läsa min e-post.]
>>
>> [A copy of this mail has been sent to
>> FRA for monitoring purposes. FRA wants to read all my e-mail and have
>> been allowed to do by the Swedish parliment - in violation of article
>> 12 of the UN Universal Declaration of Human Rights]
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>
> There is a REST interface, but it is not exactly machine-friendly. If
> you request http://en.wikipedia.org/w/index.php?title=<TOPIC
> NAME>&action=edit with a topic name put in you will get an editor
> page. For example
> http://en.wikipedia.org/w/index.php?title=Erlang%20(programming%20language)&action=edit
> brings up the editor page for the Erlang programming language.
>
> The raw MediaWiki markup is in a textarea with id "wpTextbox1", but
> unfortunately I have been unable to get xmerl to extract it due to the
> fact that the page is HTML and not well-formed XML.

Seems to work - should be easy to extract the content

Why bother with xmerl just scan the text for a constant string ...

<textarea tabindex='1' accesskey="," name="wpTextbox1" id="wpTextbox1"

This is really easy.

I wonder if this is what is in the database or has this been generated
from something else

/Joe




>
> I imagine a simple parser which looks for '<textarea', then
> 'id="xpTextbox1"', then '>', then gathers text until '</textarea'
> would work pretty well. I'll take a look at this when I get home this
> evening.
>



-- 
fra@REDACTED; ingvar.akesson@REDACTED

[Kopia av detta meddelande skickas till FRA för övervakningsändamål.
De vill ju ändå läsa min e-post.]

[A copy of this mail has been sent to
FRA for monitoring purposes. FRA wants to read all my e-mail and have
been allowed to do by the Swedish parliment - in violation of article
12 of the UN Universal Declaration of Human Rights]



More information about the erlang-questions mailing list