De siesta en siesta
hugo’s blog
MG4J Adventures
Some of my findings in the Mg4J world, mostly guided by Paolo Boldi and Sebastiano Vigna who are too wise and busy to write such silly things down…
- Installation Adventures
- Indexing Adventures
- Coding Adventures
1. Installation Adventures:
Spent some time installing MG4J on both Windows and Linux. Here are my findings (thanks Vigna and Boldi for all the help):
Eclipse – Cygwin Installation for Windows:
Prerequisites:
- optional: install rlwrap for cygwin
- Download dependencies http://mg4j.dsi.unimi.it/mg4j-deps.tar.gz and untar to a dependencies-folder
- Download source http://mg4j.dsi.unimi.it/mg4j-2.1.1-src.tar.gz and untar to sources-folder
Installation
-
Create an Eclipse project from “existing Ant BuildFile” and choose the build.xml file in the sources-folder
-
Add to Project’s build-path all jars in the dependencies-folder
-
Eclipse project should build now…
Indexing and Querying from Cygwin’s bash
You can now index and query through Eclipse using the Run interface.
However, if you would like to do this from the command line, you need to work a little more.
I am using Cygwin’s bash. The biggest problem I had was setting the path. You need to set if from the cygwin shell (bash in my case) but in WINDOWS format :( I wrote a little perl script to do this:
# BUILD Windows java classpath line for cygwin bash
die ("ARGUMENTS: files") unless (scalar @ARGV>0);
@l=();
foreach $d (@ARGV){
$d =~ s/\/cygdrive\/c/c:/;
$d =~ s/\//\\/g;
push(@l,$d);
}
print(join(";",@l));
Then you can set the path in the cygwin bash with:
export CLASSPATH=`perl ./winCLASSPATH.pl /cygdrive/c/your-dependency-folder/*jar /cygdrive/c/your-source-folder/build`
Now you can build your indices and query the engine… for example (remember to use Windows paths):
# BUILD INDEX
java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection trec.collection c:/HUGO/DATA/trecFiles*
java -Xmx512M it.unimi.dsi.mg4j.tool.IndexBuilder -Itoken -S trec.collection trec
...
and you are ready to query the enginre:
# START ENGINE
rlwrap -H ~/.Query-history java -Xmx512M it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v
now you can use command line interface or check http://localhost:4242/Query
Unix Installation
No: you are NOT stupid. It took me 4 days of emails with my system administrator and Vigna and Boldi (the authors of MG4J!)
Prerequisites
First, root needed to install these in our system:
- You need a recent Ant with some extras:
- install ant version >= 1.7.0 (otherwise it does NOT work, believe me… :(
- install xml-commons-api
- reinstall java (in my case xml-commons-api removes it!)
- install ant-nodeps (in .rpm) or ant-ant-optional (in .deb)
- (optional) install “rlwrap” to have a nicer command line access to mg4j
Installation:
-
download & untar dependencies (http://mg4j.dsi.unimi.it/mg4j-deps.tar.gz in my case) into a <dependency-directory>
-
download & untar source (http://mg4j.dsi.unimi.it/mg4j-2.1.1-src.tar.gz in my case) into a <source-directory>
- Go to the top directory in your <source-directory>, you should see a build.properties file.
- modify build.properties second line: jar.base=<dependency-directory>
- set your classpath to point to dependency-directory (ant does not need it but javacc does (here is a nice trick to do this: export CLASSPATH=$(ls -1 <dependency-dir>/*jar | paste -s -d:)
-
“ant jar” in source-directory, where the build.xml file is.
- I get the following warnings, but it builds!!!
- [taskdef] Could not load definitions from resource emma_ant.properties. It could not be found.
[taskdef] Could not load definitions from resource checkstyletask.properties. It could not be found.
[taskdef] build.xml:165: Warning: taskdef class edu.umd.cs.findbugs.anttask.FindBugsTask cannot be found
- [taskdef] Could not load definitions from resource emma_ant.properties. It could not be found.
Indexing and Querying:
Ok you are almost there :)
With the above you should be able to build and index and run queries on the command line (as described in the windows section above), if you set you classpath correctly :
export CLASSPATH=$(ls -1 <dependency-dir>/*jar <source-dir>/build | paste -s -d:)
java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection trec.collection trecFiles*
java -Xmx512M it.unimi.dsi.mg4j.tool.IndexBuilder -Itoken -S trec.collection trec
rlwrap -H ~/.Query-history java -Xmx512M it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v
However, if you try to access the search engine through http it will probably break with an error about query.velocity not being found. For this the fix I found is to indicate the relative path to the query.velocity file as -Dit.unimi.dsi.mg4j.query.QueryServlet.template= (relative to a directory in the classpath:)
export CLASSPATH=$(ls -1 <dependency-dir>/*jar <source-dir>/build | paste -s -d:)
velocity=’../it/unimi/dsi/mg4j/query/query.velocity’
rlwrap -H ~/.Query-history java -Xmx512M -Dit.unimi.dsi.mg4j.query.QueryServlet.template=$velocity it.unimi.dsi.mg4j.query.Query -i GenericItem trec-token -h -c trec.collection -v
2. Indexing Adventures:
Exception in thread “main” java.lang.StringIndexOutOfBoundsException: String index out of range: -1
Most likely the format of your documents does not correspond to that of the DocumentCollection document reader. In my case I was using a TRECDocumentCollection which was missing a tag.
Exception in thread “main” java.util.NoSuchElementException: The key ENCODING cannot be resolved
add “-p encoding=ISO-8859″ or whatever encoding you prefer to your documentcollection creation line. In my case:
java -Xmx512M it.unimi.dsi.mg4j.document.TRECDocumentCollection -p encoding=ISO-8859 $col.collection $path
3. Coding Adventures:
Q: How do I load indices in memory?
(From Roi Blanco :)
Even though there are a few classes with appealing names like “InMemoryIndex” or “MemoryMappedIndex” with getInstance() methods, this is only a disguise – of course, they wouldn’t make it that easy, what were you expecting?
Indeed, you have to call DiskBasedIndex with something like this:
EnumMap map = new EnumMap(UriKeys.class); //pick one of these two map.put(UriKeys.INMEMORY, "true");
map.put(UriKeys.MAPPED, "true");
//the next booleans in the method call stand for: randomAccess?
loadDocumentSizes? loadTermMapsIntoMemory?
my_index = (BitStreamIndex) DiskBasedIndex.getInstance(new File(pathIndex+"-token").getPath(), true, true, true, map);
yes, you read it right, if you call InMemoryIndex.getInstance() you will get a DiskBased index, if you call DiskBasedIndex with those parameters (and the EnumMap correctly set) you can end up with an Inmemory/MemoryMapped index. And no, MemoryIndexes don’t implement this particular getInstance() method, only DiskBasedIndex. hah!
remarks: it doesn’t really matter what value you assign in the EnumMap (they just check if UriKeys.INMEMORY has *any* value, not which) so be careful.
for some reason when loading quite large indexes into memory
(UriKeys.INMEMORY=”blah”) the performance blows up big time (from 8ms to
29 secs per query) but mg4j goes on anyway, probably because of disk swapping but I’m not 100% sure, so I found it safer to use MemoryMapped indexes.
Q: How do I get the content of a field as a String?
DocumentCollection documentCollection = (DocumentCollection)BinIO.loadObject("sentences.collection"));
Document d = documentCollection.document( documentID ); // or whatever you need to get the document
int fieldIndex = documentCollection.factory().fieldIndex( index.field );
final Reader reader = (Reader)d.content( fieldIndex );
String content = org.apache.commons.io.IOUtils.toString( reader );
Q: How do I use a WordReader on a document?
DocumentCollection documentCollection = (DocumentCollection)BinIO.loadObject("sentences.collection"));
Document d = documentCollection.document( documentID ); // or whatever you need to get the document
final Reader reader = (Reader)d.content( fieldIndex );
WordReader wr = d.wordReader( fieldIndex ).setReader( reader ); // this first gets a WordReader class, then applies it to the content reader.
MutableString tmp1 = new MutableString(), tmp2= new MutableString();
MutableString doc = new MutableString();
while(wr.next(tmp1,tmp2)) {
doc.append(tmp1);
doc.append(tmp2);
}
String docString=doc.toString();
Q: How do I get a term string from the term Integer index?
BitStreamIndex bindx = ((BitStreamIndex)index);
ObjectList dictionary = bindx.prefixMap.list();
String termString = dictionary.get(termInteger);
Q: How do I escape characters in the query which are confused with mg4j operators
Use backslash. In java I use the group “[\\^|\\/'*()~+-]” to find them, but it may be missing something.
Hi I tried building mg4j on netbeans(on windows) and failed, but I hope I can use some of your pointers for eclipse. One clarification though. Does one need cygwin installation for mapping the windows paths into unix paths in order for the ant script to feed on ?
In your case you used Eclipse+cygwin. The question is do you at all need cygwin in order to build the sources. Great post, keep it up as there are not many mg4j resources.
You do not need cygwin at all, just ignore it and use Eclipse. To run the programs, you can run them from Eclipse or any shell, inlcuding Windows command. I just happen to use cygwin as my (bash) shell.
________________________________
mg4j dependencies has javacc-4.0.jar that needs to be renamed to javacc.jar.
right, thanks Keshav
Can u send me some sample programs. I want to create a index on document but not getting how to add fields to the document.
Have a look at the mg4j “book”: http://mg4j.dsi.unimi.it/man/manual.pdf
Hello
Do you like travelling? Me too. What’s the most annoying thing in travelling? Costs and time you have to spend getting from one point to another. It’s hard to find solution on time but you are able to do something with prices. You can choose bilety promowe that are quite inexpensive and anyone can afford to buy them.
Greetings
How do I get all anchor text from one document in MG4J ? For exemplo: docId = 1 I’d like to know the text from which href that exist in document. Another question is: Is there a way to inform the docId and return the all the anchor text from this document ? Thanks a lot !
I have a document-term matrix. I want to calculate kNN of each each document using BM25. How can I do this by using MG4J?