Building and Maintaining Text Search Indexes Version 0.6 June 8, 2006 I. Text Search Command Line Build Utility The Text Search Command Line Build Utility is intended to perform rudimentary Text Search Index maintenance functions. It is called DocUtil.exe. The first command line parameter specifies a function to be performed. This parameter is followed by additional parameters as required by the specific function. The five function parameters are: PARSE - parse a specified bucket file; STRUCT - build specified index structures FRESH - reset one or more specified index structures UPDATE - modify an index structure UPDATEF - modify an index structure from file The first three functions, PARSE/STRUCT/FRESH, are vigorous enough for production mode use. The UPDATE function is designed primarily for testing; its use in production mode would be extremely limited. The specific syntax for each function is: PARSE [] -WN - base word with no score pooling -WS - base word with WordNet Stem score pooling -WY - base word with WordNet Synonym score pooling -DS - derivative word from WordNet Stem -DY - derivative word from WordNet Synonym -DX - derivative word from Soundex -DM - derivative word from MetaPhone -PMn - emit prox pairs from MetaPhone derivatives; n = {1, 2, 3, 4, 5} -PRn - emit prox pairs from raw words -PPn - emit prox pairs from Porter Stem derivatives -PPMn - emit prox pairs from Porter + MetaPhone derivatives -PSn - emit prox pairs from WordNet Stem derivatives -Udbname - sets the WriteUrlFlag; ref TextSearchAPI.txt DocParseSetWriteUrlFlag() for further information -Ttablename - table name for WriteUrlFlag; required with -U option [For the current project we will use -WN -DS -DY -DM -PPM2.] - full path-name of input bucket file - full pathname of output meta file; this file is input to the STRUCT function - full path-name of the output info file; this file contains updated passthru tags with additional meta information; optionally, this parameter can be "null" to suppress output There can be multiple in a command. However, there can be only one -PXn option. For all options except WN, the option token is used to modify the respective output file names. For example, for mode DY, output file "OutMeta.dat" becomes "OutMeta.DY.dat". STRUCT - full path-name of the target database - name of the input meta file [meta output from PARSE]; There can be multiple s concatenated with '+' signs. But there can be no extraneous space characters before or after the '+' signs. RAW - RANK - PROXn; n = {1, 2, 3, 4, 5} - name of the target index to be created FRESH ... - full path-name of the target database - name of the target index to be reset UPDATE [...] UPDATEF [...] - full path-name of the target database - the table containing the object indexes -the record whose indexes are being updated -PMn - update prox pairs from MetaPhone derivatives; n = {1, 2, 3, 4, 5}" -PRn - update prox pairs from raw words -PPn - update prox pairs from Porter Stem derivatives -PPMn - update prox pairs from Porter + MetaPhone derivatives -PSn - update prox pairs from WordNet Stem derivatives - WNraw - Pool None Raw WNrank - Pool None Rank WSraw - Pool Stem Raw WSrank - Pool Stem Rank WYraw - Pool Synonym Raw WYrank - Pool Synonym Rank DSraw - Derived Stem Raw DSrank - Derived Stem Rank DYraw - Derived Synonym Raw DYrank - Derived Synonym Rank DXraw - Derived Soundex Raw DXrank - Derived Soundex Rank DMraw - Derived MetaPhone Raw DMrank - Derived MetaPhone Rank PROXn; n = {1, 2, 3, 4, 5} - the name of the object index [There can be multiple pairs] - old values(s) enclosed in "s - new values(s) enclosed in "s - file containing tag & document for old value - file containing tag & document for new value [For compound indexes, the and must include the respective values of all columns that contribute to the compound index. For example, if you have a compound index called "PersonalInfo" that consists of columns Name, Address and City; and if Address changes; the and must contain not just the Address info, but also the Name and City info, even if they have not changed.] [If one or more PROXn modes are present, there must be a -PXn option present.] For review, the possible types of Text Indexes are: . Simple Text - a single database column . Compound Text - multiple database columns mapped to a single index . DB Object [Memo field, etc] - a single DB object . Compound DB Object - multiple DB Objects mapped to a single index . DB doc - a DB record pointing to Doc . Compound DB doc - multiple DB docs mapped to a single index . Free standing doc - a document for which we supply a DB table . URL II. Creating a Text Search Database Text Search Indexes can co-exist with other database data types. However, for this discussion, we focus on Text Search Indexes. In a Thunderbolt database schema, a Text Search Index is declared as a keyed field with a special qualifier. In particular: X(16) KEY VIRTUAL or, X(16) KEY EXTERNAL At present, the field length 16 is a constant in the DocUtil utility and cannot be changed. For an Index that includes ranking information, the pre-defined length is 17. III. Building Text Search Indexes These are the steps required to build a Text Search Index: A. Extract the object data to a bucket file. Each "document" in the bucket file must be prefaced by a passthru tag. At present, the only essential information for the passthru tag is the "dbinfo" attribute. The tricky part is that the "recno" field within the "dbinfo" attribute must contain the TB recno where this item is going to reside. B. For each completed bucket file, the next step is to perform the "DocUtil PARSE" function. At present, the command will be: DocUtil PARSE -WN -DM -DS -DY -PPM2 NULL For each bucket file, this will produce five meta files: For -WN, For -DM, For -DS, For -DY, For -PPM, These files are used in the next step. At present, there is no requirement to preserve these file permanently. C. For each parsed bucket file, the next step is to build the initial structure. This pre-supposes that the target TB database exists and that the requisite records have been created. For each parsed bucket file, the commands for building the RAW indexes will be: DocUtil STRUCT RAW DocUtil STRUCT RAW DocUtil STRUCT RAW DocUtil STRUCT RAW The commands for building the RANK indexes will be: DocUtil STRUCT RANK DocUtil STRUCT RANK DocUtil STRUCT RANK DocUtil STRUCT RANK The commands for building the PROX indexes will be: DocUtil STRUCT PROX1 DocUtil STRUCT PROX2 IV. Updating Text Search Indexes These are the steps required to update a Text Search Index: A. Index update consists of two steps: 1) Deleting the existing index values for the item; 2) inserting the new index values for the item. For simple indexes, this is a straight-forward operation. For Text Indexes, it is a bit more complicated. The point being that a Text Index can contain multiple values for a single item. For efficiency, instead of deleting all existing values and inserting all new values for an item; we perform a logical calculation to determine the net change. B. For a compound index, the update process requires the old and new values of all constituents of the index, whether they have changed or not. The design of the Text Search Engine is to allow these values to be supplied either as memory image text objects; or as memory image or file image meta data objects. However, at this time only the first of these interfaces (i.e., memory image text) exists. C. The DocUtil UPDATE command is intended primarily for testing. Its use in a production mode environment is not recommended. V. Registry Settings This application uses the WordNet subsystem and the WordNet Dictionary. Subsequent to 2005.02.12, the folder containing the WordNet Dictionary files is specified by the registry key: HKEY_LOCAL_MACHINE\Software\WhamTech\TextSearch\WordNet Prior to 2005.02.12, the folder containing the WordNet Dictionary files was specified by the environment variable: WNSEARCHDIR