[erlang-questions] wkipedia rendering engine

Thorsten Schuett <>
Mon Jun 30 14:13:36 CEST 2008


Hi all,

as I am partially to blame for the noise around the wikirenderer, I will add 
my two cents.

For our experiments, we used the XML dumps available at 
http://download.wikimedia.org. We have a small Java program which converts 
the XML dump to Erlang terms (http://www.zib.de/schuett/dumpreader.tgz). E.g. 
converting the bavarian dump:
java -jar dumpreader.jar /home/schuett/barwiki-20080225-pages-meta-history.xml

But you still have to parse the mediawiki text and convert it to HTML.
For the last step we currently have two solutions:

1. Early experiments used flexbisonparse 
(http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/) to convert 
the mediawiki text to XML and XSLT to convert the XML to HTML.

2. The current code is based on plog4u/bliki( see 
http://matheclipse.org/en/Java_Wikipedia_API)

Thorsten

On Monday 30 June 2008, Joe Armstrong wrote:
> On Mon, Jun 30, 2008 at 1:36 PM, Jan Lehnardt <> wrote:
> > On Jun 30, 2008, at 13:23, Joe Armstrong wrote:
> >> Is there a REST interface so that I can retreive the latest version of
> >> the MetaWiki markup for a specific page with, for example,
> >> a wget command.
> >
> > You can get bulk dumps
> > http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get..
> >.
> >
> > Why would you do individual scraping? In order to keep up to date with
> > changes that happened between the last dump and now()?
>
> To get a few test cases to test my parser on *before* download the entire
> thing.
>
> Also I suspect the dumps are in MySQL format with xml junk - so it might
> not be a trival job to extract the raw data. I (presumably) will have to
> install MySQL and
> turn some XML stuff into the raw data (just guessing here) - thought
> that could be a job for a
> volunteer :-)
>
> /Joe
>
> > Cheers
> > Jan
> > --
> >
> >> Has anybody made an erlang interface to scrape individual pages from
> >> the wikipedia - or to bulk convert the entire
> >> wikipedia to erlang terms :-)
> >>
> >> /Joe
> >>
> >> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <> wrote:
> >>> Hi,
> >>>
> >>> I was at the erlang exchange and heard the *magnificant*  talk
> >>>
> >>> "Building a transactional distributed data store with Erlang", by
> >>> Alexander Reinefeld.
> >>>
> >>> I'll be blogging this as soon as I have the URL of the video of the
> >>> talk.
> >>>
> >>> (in advance of this there was talk at the google conference on
> >>> scalability
> >>>
> >>>
> >>> http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+s
> >>>calable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en
> >>>
> >>> oh and they also seem to have won the SCALE 2008 prize at the
> >>> CCGrid conferense in Lyon but there is zero publicity about this AFAICS
> >>> )
> >>>
> >>> We (collectively) promised to help Alexander - I promised to provide
> >>> him with a
> >>> rendering engine (in Erlang) for the wikipedia markup language.
> >>>
> >>> Before I start hacking has anybody done this before?
> >>>
> >>> /Joe Armstrong
> >>
> >> --
> >> ; 
> >>
> >> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
> >> De vill ju ändå läsa min e-post.]
> >>
> >> [A copy of this mail has been sent to
> >> FRA for monitoring purposes. FRA wants to read all my e-mail and have
> >> been allowed to do by the Swedish parliment - in violation of article
> >> 12 of the UN Universal Declaration of Human Rights]
> >> _______________________________________________
> >> erlang-questions mailing list
> >> 
> >> http://www.erlang.org/mailman/listinfo/erlang-questions





More information about the erlang-questions mailing list