De siesta en siesta

hugo’s blog

MG4J Adventures

[ NOTE: This is now old and not maintained nor looked at. If you have questions regarding MG4J you should go to https://groups.google.com/forum/#!forum/mg4j ]

 

Some of my findings in the Mg4J world, mostly guided by Paolo Boldi and Sebastiano Vigna who are too wise and busy to write such silly things down…

  1. Installation Adventures
  2. Indexing Adventures
  3. Coding Adventures

1. Installation Adventures:

Spent some time installing MG4J on both Windows and Linux. Here are my findings (thanks Vigna and Boldi for all the help):

Eclipse – Cygwin Installation for Windows:

Prerequisites:

  • optional: install rlwrap for cygwin

Installation

  • Create an Eclipse project from “existing Ant BuildFile” and choose the build.xml file in the sources-folder
  • Add to Project’s build-path all jars in the dependencies-folder
  • Eclipse project should build now…

Indexing and Querying from Cygwin’s bash

You can now index and query through Eclipse using the Run interface.

However, if you would like to do this from the command line, you need to work a little more.

I am using Cygwin’s bash. The biggest problem I had was setting the path. You need to set if from the cygwin shell (bash in my case) but in WINDOWS format :( I wrote a little perl script to do this:

# BUILD Windows java classpath line for cygwin bash
die ("ARGUMENTS: files") unless (scalar @ARGV>0);
@l=();
foreach $d (@ARGV){
$d =~ s/\/cygdrive\/c/c:/;
$d =~ s/\//\\/g;
push(@l,$d);
}
print(join(";",@l));

Then you can set the path in the cygwin bash with:
export CLASSPATH=`perl ./winCLASSPATH.pl /cygdrive/c/your-dependency-folder/*jar /cygdrive/c/your-source-folder/build`

Now you can build your indices and query the engine… for example (remember to use Windows paths):

# BUILD INDEX
java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection trec.collection c:/HUGO/DATA/trecFiles*
java -Xmx512M it.unimi.dsi.mg4j.tool.IndexBuilder -Itoken -S trec.collection trec
...

and you are ready to query the enginre:

# START ENGINE
rlwrap -H ~/.Query-history java -Xmx512M it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v

now you can use command line interface or check http://localhost:4242/Query

Unix Installation

No: you are NOT stupid. It took me 4 days of emails with my system administrator and Vigna and Boldi (the authors of MG4J!)

Prerequisites

First, root needed to install these in our system:

  • You need a recent Ant with some extras:
    • install ant version >= 1.7.0 (otherwise it does NOT work, believe me… :(
    • install xml-commons-api
    • reinstall java (in my case xml-commons-api removes it!)
    • install ant-nodeps (in .rpm) or ant-ant-optional (in .deb)
  • (optional) install “rlwrap” to have a nicer command line access to mg4j

Installation:

  • download & untar dependencies (http://mg4j.dsi.unimi.it/mg4j-deps.tar.gz in my case) into a <dependency-directory>
  • download & untar source (http://mg4j.dsi.unimi.it/mg4j-2.1.1-src.tar.gz in my case) into a <source-directory>
  • Go to the top directory in your <source-directory>, you should see a build.properties file.
  • modify build.properties second line: jar.base=<dependency-directory>
  • set your classpath to point to dependency-directory (ant does not need it but javacc does (here is a nice trick to do this: export CLASSPATH=$(ls -1 <dependency-dir>/*jar | paste -s -d:)
  • “ant jar” in source-directory, where the build.xml file is.
  • I get the following warnings, but it builds!!!
    • [taskdef] Could not load definitions from resource emma_ant.properties. It could not be found.
      [taskdef] Could not load definitions from resource checkstyletask.properties. It could not be found.
      [taskdef] build.xml:165: Warning: taskdef class edu.umd.cs.findbugs.anttask.FindBugsTask cannot be found

Indexing and Querying:

Ok you are almost there :)

With the above you should be able to build and index and run queries on the command line (as described in the windows section above), if you set you classpath correctly :

export CLASSPATH=$(ls -1 <dependency-dir>/*jar <source-dir>/build | paste -s -d:)
java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection trec.collection trecFiles*
java -Xmx512M it.unimi.dsi.mg4j.tool.IndexBuilder -Itoken -S trec.collection trec
rlwrap -H ~/.Query-history java -Xmx512M it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v

However, if you try to access the search engine through http it will probably break with an error about query.velocity not being found. For this the fix I found is to indicate the relative path to the query.velocity file as -Dit.unimi.dsi.mg4j.query.QueryServlet.template= (relative to a directory in the classpath:)

export CLASSPATH=$(ls -1 <dependency-dir>/*jar <source-dir>/build | paste -s -d:)

velocity=’../it/unimi/dsi/mg4j/query/query.velocity’
rlwrap -H ~/.Query-history java -Xmx512M -Dit.unimi.dsi.mg4j.query.QueryServlet.template=$velocity it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v

2. Indexing Adventures:

Exception in thread “main” java.lang.StringIndexOutOfBoundsException: String index out of range: -1

Most likely the format of your documents does not correspond to that of the DocumentCollection document reader. In my case I was using a TRECDocumentCollection which was missing a tag.

Exception in thread “main” java.util.NoSuchElementException: The key ENCODING cannot be resolved

add “-p encoding=ISO-8859” or whatever encoding you prefer to your documentcollection creation line. In my case:


java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection -p encoding=ISO-8859 $col.collection $path

3. Coding Adventures:

Q: How do I load indices in memory?

(From Roi Blanco :)

Even though there are a few classes with appealing names like “InMemoryIndex” or “MemoryMappedIndex” with getInstance() methods, this is only a disguise – of course, they wouldn’t make it that easy, what were you expecting?

Indeed, you have to call DiskBasedIndex with something like this:

EnumMap map = new EnumMap(UriKeys.class); //pick one of these two map.put(UriKeys.INMEMORY, "true");
map.put(UriKeys.MAPPED, "true");
//the next booleans in the method call stand for: randomAccess?
loadDocumentSizes? loadTermMapsIntoMemory?
my_index = (BitStreamIndex) DiskBasedIndex.getInstance(new File(pathIndex+"-token").getPath(), true, true, true, map);

yes, you read it right, if you call InMemoryIndex.getInstance() you will get a DiskBased index, if you call DiskBasedIndex with those parameters (and the EnumMap correctly set) you can end up with an Inmemory/MemoryMapped index. And no, MemoryIndexes don’t implement this particular getInstance() method, only DiskBasedIndex. hah!

remarks: it doesn’t really matter what value you assign in the EnumMap (they just check if UriKeys.INMEMORY has *any* value, not which) so be careful.
for some reason when loading quite large indexes into memory
(UriKeys.INMEMORY=”blah”) the performance blows up big time (from 8ms to
29 secs per query) but mg4j goes on anyway, probably because of disk swapping but I’m not 100% sure, so I found it safer to use MemoryMapped indexes.

Q: How do I get the content of a field as a String?


DocumentCollection documentCollection = (DocumentCollection)BinIO.loadObject("sentences.collection"));
Document d = documentCollection.document( documentID ); // or whatever you need to get the document
int fieldIndex = documentCollection.factory().fieldIndex( index.field );

final Reader reader = (Reader)d.content( fieldIndex );
String content = org.apache.commons.io.IOUtils.toString( reader );


Q: How do I use a WordReader on a document?


DocumentCollection documentCollection = (DocumentCollection)BinIO.loadObject("sentences.collection"));
Document d = documentCollection.document( documentID ); // or whatever you need to get the document
final Reader reader = (Reader)d.content( fieldIndex );
WordReader wr = d.wordReader( fieldIndex ).setReader( reader ); // this first gets a WordReader class, then applies it to the content reader.
MutableString tmp1 = new MutableString(), tmp2= new MutableString();
MutableString doc = new MutableString();
while(wr.next(tmp1,tmp2)) {
doc.append(tmp1);
doc.append(tmp2);
}
String docString=doc.toString();

Q: How do I get a term string from the term Integer index?

BitStreamIndex bindx = ((BitStreamIndex)index);
ObjectList dictionary = bindx.prefixMap.list();
String termString = dictionary.get(termInteger);

Q: How do I escape characters in the query which are confused with mg4j operators
Use backslash. In java I use the group “[\\^|\\/’*()~+-]” to find them, but it may be missing something.

Installation Adventures:

Advertisements

9 responses to “MG4J Adventures

  1. iameleone July 13, 2009 at 6:38 am

    Hi I tried building mg4j on netbeans(on windows) and failed, but I hope I can use some of your pointers for eclipse. One clarification though. Does one need cygwin installation for mapping the windows paths into unix paths in order for the ant script to feed on ?

    In your case you used Eclipse+cygwin. The question is do you at all need cygwin in order to build the sources. Great post, keep it up as there are not many mg4j resources.

    • Hugo July 17, 2009 at 5:58 am

      You do not need cygwin at all, just ignore it and use Eclipse. To run the programs, you can run them from Eclipse or any shell, inlcuding Windows command. I just happen to use cygwin as my (bash) shell.

      ________________________________

  2. Keshav August 28, 2009 at 3:30 am

    mg4j dependencies has javacc-4.0.jar that needs to be renamed to javacc.jar.

  3. Hugo October 14, 2009 at 7:28 am

    right, thanks Keshav

  4. quark June 16, 2010 at 1:17 pm

    Can u send me some sample programs. I want to create a index on document but not getting how to add fields to the document.

  5. jucimar January 6, 2011 at 2:20 pm

    How do I get all anchor text from one document in MG4J ? For exemplo: docId = 1 I’d like to know the text from which href that exist in document. Another question is: Is there a way to inform the docId and return the all the anchor text from this document ? Thanks a lot !

  6. Zhirong Yang January 19, 2012 at 1:42 pm

    I have a document-term matrix. I want to calculate kNN of each each document using BM25. How can I do this by using MG4J?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: