Indri Indexer Help
Getting Started
IndexUI is an application that simplifies the process
of indexing a collection of data files for use with the RetUI retrieval
application. You specify an index name and at
least one data file to
index in the appropriate fields of the IndexUI's main panel. For all
other parameters, the defaults supplied here should be appropriate for
most applications. Review these settings (described below) and click
the
Build Index button to build your index.
Menus
IndexUI has two menus, described below.
File Menu
- Quit
- Exits the Indri IndexUI. This item is disabled
when an index is being built. It is still possible to close the window
with the title bar control. If you do close the window while an index
is being built, the build will fail and you will have to manually remove
the partially built index files.
Help Menu
- Help
- Opens a window to view this help file. Any
hyperlinks in this file will be opened in a separate window within a
second window.
- About
- Displays the About Dialog.
Indexing Tab
On this tab you select the values necessary to construct your
index. You are required to select a name for you index, and at least
one
input data file or directory. All other parameters have default
values. Each is described below.
- Index Name: base name for
the
indri repository directory. This can be typed in directly (as an
absolute or relative pathname), or you can use the Browse button to
navigate to the directory where you want your index, and type the
basename into the file chooser dialog. This directory will be created
if it does not exist.
- Data File(s): Use the Browse
button
to navigate to a directory and select files or directories. To remove
unwanted items from the list, highlight the item (right click with the
mouse) and use the Remove button to remove the entry. Multiple entries
can be selected by using either the Shift key (to select a contiguous
range) or the Control key (to select discontiguous elements) while
right
clicking with the mouse.
- Document Format:
- trecweb
- for trec web formatted documents. This is the default
document format.
- trectext
- for standard TREC formatted documents
- txt
- for plain text
- html
- for web pages.
- doc
- For Microsoft Word documents (Windows only).
- ppt
- For Microsoft PowerPoint documents (Windows only)
- pdf
- For Adobe PDF documents.
- Filename filter: If a filter is supplied, it is used
for
the file selection dialog box. Additionally, if the filter is supplied
and an entry is a directory, all files in that directory that match the
filter are added to the list of datafiles. The filter should be a
command line filename regular expression, eg *.sgml or data?.txt. In
the
pattern, '*' expands into 0 or more characters (including .) and '?'
expands into 0 or 1 character (including .).
- Recurse into subdirectories: if selected and an
entry in
the data files list is a directory, we recurse into each subdirectory,
again applying any supplied filename filter. If the filter is empty,
all
files in a directory are added to the list of data files for
indexing.
- Collection Fields Enter a comma delimited list of
field names to index as metadata, such as docno, title, without
spaces between the field names. Fields
that are not indexed as metadata can be retrieved (eg title
display in the retrieval ui), indexing them as metadata speeds up
that retrieval
- Indexed Fields Enter a comma delimited list of field
names, without
spaces between the field names, to index as data, such as title,
heading (h1, h2, h3, and h4 html tags), headline (newswire head,
hl, headline), etc. for use in queries. Fields that are not
indexed will not be available for use in queries.
- Memory Limit: Select How much memory to use while
indexing. A rule of thumb is no more than 3/4 of your physical memory
(default 128MB).
- Stopword list Enter the name of file containing the
stopword list you wish to use or use the Browse button to navigate to
the file. The file should contain the list of stopwords that you wish to
use, one word per line.
- Stem Collection: Select this check box to enable
stemming
(conflation of morphological variants) for the index (default on).
- Stemmer:
- krovetz
- Krovetz stemmer (default).
- porter
- Porter stemmer.
- Build Index: Enabled when an index
name and at least one data file
have been
entered. Builds the index. displaying status messages in the Status Messages tab, which is
brought to the front
when indexing is started. You should not close the application while an
index is being built. If you do, partially built index files will be
left behind.
- Quit: Exits the IndexUI. This item is disabled
when an index is being built. It is still possible to close the window
with the title bar control. If you do close the window while an index
is
being built, the build will fail and you will have to manually remove
the partially built index files.
This tab provides status messages from the indexing process. It
is
updated as the process runs and should scroll to (or near) the bottom
as
messages are added.