Monthly Archives: November 2010
Pre-processing a text file in Impure
Here is a simple way to pre-process a text file in Impure using the ReplaceSubstrings control.
I wanted to analyse the book “Alice in Wonderland” using Impure’s linguistic analysis tools. The text of the book is available form the Gutenberg project in .txt format at http://www.gutenberg.org/files/11/11.txt
Loading this file into Impure and visualising it is very easy with a FileLoader and a SimpleStringVisualizator:
The SimpleStringVisualizator goes a little mad when the file to be visualized is too large, so for debugging purposes I descided to work with a small part of the text only, which I obtained using the getSubstring control:
Ok, now we can see the file. But if you look closely at the lines, you will notice that paragraphs are broken by line breaks. More specifically, in this file lines were separated by line breaks (actually using \r\n) as in the original printed book. Parargaphs were separated by double line breaks (using \r\n\r\n). Impure does a good job at converting all kinds of line breaks into impure line breaks, so we dont need to worry about the line break encodings. However, line breaks are problematic because one sentence can be broken into two or more lines, which may confuse the linguistic analyser.
What we need to do is to reconnect the sentences. This could be done simply by replacing line breaks by spaces using a replaceSubstring control. However, this would eliminate paragraphs. If we want to preserver paragraph information, we need to:
- replace paragraph breaks (double line breaks) into a temporary marker,</li>
- replace single line breaks by spaces (this puts sentences back together),</li>
- replace the temporary marker by a single line break, recovering the paragraphs.</li>
We now have a string where each line break indicates a paragraph. It is now very easy to pick a few paragraphs for linguistic processing. To select a single paragraph, a splitString by linebreak and getElementFromList the desired paragraph: