De siesta en siesta

hugo’s blog

Category Archives: Development

String Manipulation example in IMPURE

Pre-processing a text file in Impure

Here is a simple way to pre-process a text file in Impure using the ReplaceSubstrings control.

I wanted to analyse the book “Alice in Wonderland” using Impure’s linguistic analysis tools. The text of the book is available form the Gutenberg project in .txt format at

Loading this file into Impure and visualising it is very easy with a FileLoader and a SimpleStringVisualizator:Impure screen shot

The SimpleStringVisualizator goes a little mad when the file to be visualized is too large, so for debugging purposes I descided to work with a small part of the text only, which I obtained using the getSubstring control:

Ok, now we can see the file. But if you look closely at the lines, you will notice that paragraphs are broken by line breaks. More specifically, in this file lines were separated by line breaks (actually using \r\n) as in the original printed book. Parargaphs were separated by double line breaks (using \r\n\r\n). Impure does a good job at converting all kinds of line breaks into impure line breaks, so we dont need to worry about the line break encodings. However, line breaks are problematic because one sentence can be broken into two or more lines, which may confuse the linguistic analyser.

What we need to do is to reconnect the sentences. This could be done simply by replacing line breaks by spaces using a replaceSubstring control. However, this would eliminate paragraphs. If we want to preserver paragraph information, we need to:

  1. replace paragraph breaks (double line breaks) into a temporary marker,</li>
  2. replace single line breaks by spaces (this puts sentences back together),</li>
  3. replace the temporary marker by a single line break, recovering the paragraphs.</li>

These three operations can be done using three replaceSubstring, or with a single replace Substrings, as done here:

We now have a string where each line break indicates a paragraph. It is now very easy to pick a few paragraphs for linguistic processing. To select a single paragraph, a splitString by linebreak and getElementFromList the desired paragraph:

Servicio Cliente de Vodafone: incompetentes y mentirosos.

Este es el resumen de mi aventura al intentar que Vodafone cambie en mi contrato mi número de pasaporte por el de mi numero de DNI:

Visita a tienda 1: tiene que llamar al servicio cliente.

Llamada servicio cliente 1: tiene que ir a la tienda.

Visita a la tienda 2:  tiene que llamar al servicion cliente. Pues llamen usetedes porque a mi no me hacen caso.

Llamada servicio cliente (desde la tienda) 2: llamamos y esta vez me hacen el cambio! Perfecto! Recibirá un SMS en 48 horas para confirmar.

(Pasan 48 horas)

Llamada servicio cliente 3: Se ha heco el cambio? No.

Llamada servicio cliente 4: Se ha hecho el cambio? No.

Llamada servicio cliente 5: Se ha hecho el cambio? “ pero usted envió el fax?” Que?! Si tiene que enviar un fax. Como?!

Visita a la tienda 3: Ellos no saben nada hay que llamar

Llamada servicio cliente (desde la tienda) 6: Llamamos y nos dan los datos que hay que enviar por fax y el numero.

Envio de Fax 1. Y por si acaso en la tienda deciden enviar una segunda vez:

Envio de Fax 2. En 4-7 días estará hecho el cambio!

(Pasan 7 días)

Llamada servicio cliente 7: Se ha heco el cambio? No, debe esperar 1-2 semanas.

(Pasan 5 días)

Llamada servicio cliente 8: Se ha heco el cambio? No, pero el sistema dice que está en curso Bien!

(Pasan 2 días)

Llamada servicio cliente 9: Se ha heco el cambio? No. Pero está en curso? No. Pero han recibido el fax? Parece que no. Tiene que enviar otro fax, el primero “no entró“. Pero si tengo el comprobante de que recibieron el fax. Un momento por favor… Si puede que se recibiera pero “ no entró en el sistema“. Pues metanlo en el sistema, no? No tiene que enviar un fax. Mire señorita llevo 2 semanas con esto, no se cuantas llamadas y visitas, así que busquen mi fax y metanlo en el sistema. Tiene que enviar otro fax. Pues no lo pienso enviar. Tiene alguna consulta más? Antes de que cuelgue quisiéramos informarle de nuestras nuevas ofertas blah…

GOOD BYE VODAFONE! Wellcome Yoigo!

CommonTag is out // Salió CommonTag

Bibtex Crossrefs in Lyx

Bibtex collapses similar crossrefs into an additional reference.

To stop this in LaTeX  you need to specify the -min-crossrefs parameter when you run bibtex.

To stop this in Lyx you need to go set Tools/Preferences/Output/LaTeX/BibTeX command: bibtex -min-crossrefs=900

Correlator, our new demo.



Correlator is now live at the Yahoo! Sandbox.

Correlator showcases our ability to locate and classify entities in text, and to sort them by relevance with respect to a query.  You can learn more about it in different places:

Writing a SIGIR (sig-alternate) paper on Lyx

Lyx is wonderful… when it works :)

Here is what I had to do to write my SIGIR paper on Lyx 1.6.1 (I recommend updating to this before continuing…).

First, get the class file that you need (in my case for SIGIR was and put it somewhere your latex can find it. (For windows see below, for unix you can
instructions here

For Windows using MikeTex this means putting the file somewhere under C:\Program Files\MiKTeX 2.7\tex\latex and running MikeTex/Settings/Refresh FNDB and MikeTex/Settings/Update Formats.

Now get Lyx to use this class. For this you can follow instructions in the Lyx Customization manual or many web pages on the topic (e.g., here).

Great, almost there. You can start your paper and all works fine… except citations. I got the following error when adding citations:

Use of  \@citex doesn’t match its definition

… then you must always put `1′ after `\a’, since control sequence names are

made up of letters only. The macro here has not been

followed by the required stuff, so I’m ignoring it

Not so useful :( The problem seems to be that citation in sig-alternate clashes with the babel package. You need to turn babel off in your Lyx. You can do this as follows:

  1. In Tools/Preferences/LanguageSettings  you need to untick “Use babel” and “Global”.
  2. You need to clear the “Command start” field (this should not be necessary, probably a Lyx bug?). (You can find the command start later by searching for “babel” in the User Guide.)

This is enough to make it work on a new document. However, if you are half way thru writing your document, some of the language options stick to it, I got the message:

Package babel Error: You haven’t loaded the option english yet.

\select@language{english} You may proceed …

To fix this you need to create a new document and cut and paste the contents of your old document into this new one. This worked for me!

Closer to the brickwall to knoweldge

I wrote a simple page that produces a picture of a thing:

It uses DBPedia to find a the infobox of a wikipedia entity and query for its picture.

It works very nicely in some examples. Try: wine, peach, “Karl Marx”, woman, dog, fear, swimming…

It also fails on many others. Try: green, war,

And the picture is not what you expect sometimes. Try notebook, apple…

In my opinion having services such as these is very important, not because they bring us closer to AI or knowledge, but because they bring us closer towards the many brick walls separating us from them!


XEmacs for Windows

Installing Packages

To install packages in Window’s XEmacs I had to:

1) Tools-Packages-SetDownloadSite-OfficialReleases-US (Main XEMacs site). (I tried with some of the official european sites but the packages I needed were missing)

2) Tools-Packages-ListAndInstall

3) Go to the package you need (with the cursor! mouse gets confused for this) and right-click mouse on “Toggle Install” and on “Add Required” and finally on “Install/Remove Selected”.

Say yes to whatever questions is asks about creating new directories

Installing MG4J search engine on Cygwin and Linux

Spent some time installing MG4J on both Windows and Linux. Here are my findings (thanks Vigna and Boldi for all the help):

Eclipse – Cygwin Installation for Windows:


  • optional: install rlwrap for cygwin


  • Create an Eclipse project from “existing Ant BuildFile” and choose the build.xml file in the sources-folder
  • Add to Project’s build-path all jars in the dependencies-folder
  • Eclipse project should build now…

Indexing and Querying from Cygwin’s bash
You can now index and query through Eclipse using the Run interface.

However, if you would like to do this from the command line, you need to work a little more.

I am using Cygwin’s bash. The biggest problem I had was setting the path. You need to set if from the cygwin shell (bash in my case) but in WINDOWS format :( I wrote a little perl script to do this:

#!/usr/bin/perl -w
# Hugo Zaragoza, 2008.
# BUILDS Windows java classpath line for cygwin bash
die ("ARGUMENTS: files") unless @ARGV==1;
foreach $d (@ARGV){
$d =~ s/\/cygdrive\/c/c:/;
$d =~ s/\//\\/g;
$line .=";$d";
print $line;

Then you can set the path in bash with (substituting the paths of your dependency and source folders):
export CLASSPATH=`perl ./ /cygdrive/c/dependency-folder/*jar /cygdrive/c/source-folder/build`

Now you can build your indices and query the engine… for example (remember to use Windows paths):

java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection trec.collection c:/HUGO/DATA/trecFiles*
java -Xmx512M it.unimi.dsi.mg4j.tool.IndexBuilder -Itoken -S trec.collection trec

and you are ready to query the enginre:

rlwrap -H ~/.Query-history java -Xmx512M it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v

now you can use command line interface or check http://localhost:4242/Query

Unix Installation

No: you are NOT stupid. It took me 4 days of emails with my system administrator and Vigna and Boldi (the authors of MG4J!)

First, root needed to install these in our system:

  • You need a recent Ant with some extras:
    • install ant version >= 1.7.0 (otherwise it does NOT work, believe me… :(
    • install xml-commons-api
    • reinstall java (in my case xml-commons-api removes it!)
    • install ant-nodeps (in .rpm) or ant-ant-optional (in .deb)
  • (optional) install “rlwrap” to have a nicer command line access to mg4j


  • download & untar dependencies ( in my case) into a <dependency-directory>
  • download & untar source ( in my case) into a <source-directory>
  • Go to the top directory in your <source-directory>, you should see a file.
  • modify second line: jar.base=<dependency-directory>
  • set your classpath to point to dependency-directory (ant does not need it but javacc does (here is a nice trick to do this: export CLASSPATH=$(ls -1 <dependency-dir>/*jar | paste -s -d:)
  • “ant jar” in source-directory, where the build.xml file is.
  • I get the following warnings, but it builds!!!
    • [taskdef] Could not load definitions from resource It could not be found.
      [taskdef] Could not load definitions from resource It could not be found.
      [taskdef] build.xml:165: Warning: taskdef class edu.umd.cs.findbugs.anttask.FindBugsTask cannot be found

Indexing and Querying:

Ok you are almost there :)

With the above you should be able to build and index and run queries on the command line (as described in the windows section above), if you set you classpath correctly :

export CLASSPATH=$(ls -1 <dependency-dir>/*jar <source-dir>/build | paste -s -d:)
java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection trec.collection trecFiles*
java -Xmx512M it.unimi.dsi.mg4j.tool.IndexBuilder -Itoken -S trec.collection trec
rlwrap -H ~/.Query-history java -Xmx512M it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v

However, if you try to access the search engine through http it will probably break with an error about query.velocity not being found. For this the fix I found is to indicate the relative path to the query.velocity file as -Dit.unimi.dsi.mg4j.query.QueryServlet.template= (relative to a directory in the classpath:)

export CLASSPATH=$(ls -1 <dependency-dir>/*jar <source-dir>/build | paste -s -d:)

rlwrap -H ~/.Query-history java -Xmx512M -Dit.unimi.dsi.mg4j.query.QueryServlet.template=$velocity it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v

Installing PerlDL for Windows Cygwin

Wanted to play with PerlDL as an alternative to numpy…

I thought it would be easier to install under cygwin with CPAN but I kept getting the error:

Running make test
Can’t test without successful make
Running make install
make had returned bad status, install seems impossible

After looking around and wasting a lot of time I found out that:

  1. I needed to install gcc on cygwin using its setup.exe
  2. perl’s IO:Tty would not install