RealTime Indexes
   Preliminary Description
       August 29, 2001
 
I. Terminology
 
   A data base "Index" is a data structure that is used to isolate 
   the occurrence of one or more "Values" in the data base. For example,
   an index for the data base field STATE would consist of all possible
   values for STATE; for example, "ALABAMA", "ALASKA", "ARKANSAS", etc
   Associated with each value is a "List" of records (or record numbers)
   that contain that value.
 
   The lists might look like this:
      Value      List
      -----      ---
      ALABAMA:   5, 12, 74
      ALASKA:    7
      ARKANSAS:  6, 8, 22, 47, ...
      etc
 
   If a value has exactly one list item, it is called a "unique value";
   e.g., ALASKA has exactly one list item.
 
   Note that a list may be represented either as an array of numbers,
   or as an array of bits.
 
   The number of values contained in an index is called its "uniqueness".
   For example, we would expect the uniqueness of STATE to be 50, assuming
   no mis-spellings.
 
   We also refer to the uniqueness of an index as its "cardinality".
   In some cases, we are particularly interested in indexes with 
   "low-cardinality" (we will define "low" later).
 
   We refer to the relative list size of a value as its "density".
   We refer to the absolute list size of a value as its "population".
   Note that a given index may contain both low-density and high-density
   values.
 
   Note also that we often use the terms density and population interchangably;
   however, this is valid only for large data bases. For small data bases,
   high density does not equate to high population.
 
II. Background
 
   The RealTime Index feature was created to overcome inherent problems
   with the existing indexing methodology. In particular, the simplistic
   two level list structure required frequent and lengthy "spool" operations
   when data was being created or modified. This made the existing 
   methodology unsuitable for real time data acquisition and massaging.
 
III. Description
 
   The RealTime Index feature created two new data structures:
   . RealTime Lists (RTLs) are lists similar to the existing lists 
      superimposed onto a tree structure. This allows fast index creation
      for all density/population configurations; and it allows fast
      index retrieval (i.e., querying) for low to medium density/population
      configurations.
   . RealTime Bitmaps (RTBs) are bitmaps that allow more efficient
      index retrieval for high density/population configurations.
 
   The user must declare in the data base schema which index(s) are to
   be RealTime. The user can optionally specify additional parameters
   to tune the RTI performance.
 
   For each RealTime Index, two additional files are created when the
   data base is created. The file names are the concatenation of the
   data base name and the index name, plus extensions ".rtl" and ".rtb".
 
IV. DBD Schema Statements
 
   A. REALTIME qualifier
        The "REALTIME" qualifier can be added to the "KEY" declaration
        to cause the relevant index to be a RealTime Index.
        For example:
           TD Customers 1000000 records
              CustNo      x(6)  KEY REALTIME
              ...
           TD Sales    10000000 records
              CustNo      x(6)  KEY 
              ...

        Output from the INFO command shows a qualifier "(r)" to
        indicate a RealTime Index.
 
   B. UNIQUE statement
        The UNIQUE statement can and should be used for RealTime Indexes
        in the same way it is used for conventional indexes to optimize
        performance and storage requirements.
 
   C. UNIQUERTL statement
        The UNIQUERTL statement is a new statement that allow the user
        optionally to specify RTL parameters. The syntax is:
           UNIQUERTL  
         specifies the number of empty nodes to be
           created when the data base is created; this is useful primarily
           for calculating disk storage requirements;
         is a number from the set {128, 256, 512, 1024, 2048, 4096}
           that specifies the RTL node size; the default is 4096.
           The high number should be used whenever the average population
           per value is greater than 1000. When the average population
           per value is less than 1000, one of the smaller 
           values should be used.
        For example,
           UNIQUERTL LastName 0 0 256
        uses 256 byte RTL blocks; this is optimum if the average population
        per value is about 64.
 
   D. UNIQUERTB statement
        The UNIQUERTB statement is a new statement that allow the user
        optionally to specify RTB parameters. The syntax is:
           UNIQUERTB    
         specifies the number of empty nodes to be
           created when the data base is created; this is useful primarily
           for calculating disk storage requirements;
         specifies the threshhold, as a percent of the
           total number of records, where an RTL is to be converted
           to an RTB; this feature is not yet implemented;
         specifies which values are to be RTBs; this list
           may be the literall ALL, to specify all values are to be RTBs;
           or it may be a list in the form
              {Value1, Value2, ...}
           The values in the list may be enclosed in single or double
           quotes if required.
        For example,
           UNIQUERTB state 0 0 {TX, NY, CA}
        treats TX, NY and CA as RTBs; the others are treated as RTLs.
           UNIQUERTB state 0 0 ALL
        treats all states as RTBs.
 
V. Commands
 
   A. FIND and MATCH
        The FIND and MATCH commands are syntactically unchanged.
        Functionally they will use the new index structures as needed.
 
   B. STRUCTURE
        The STRUCTURE command is syntactically unchanged; but one
        restriction has been removed. The STRUCTURE/Cn command will
        work on partially structured RealTime Indexes. In addition, 
        the STRUCTURE/P command is much faster.
 
   C. LOAD
        A new form of the LOAD command has been created to allow
        data to be imported directly from memory. The prototype is:
           void ddlLoadMem(CONTROL *ctl,
                    char *OptnString,
                    char *ArgString,
                    char *DataArray,
                    int DataItemSize,
                    long DataItemCount);
        ddlLoadMem() should always be used with the /B option,
        indicating fixed length records with no CRLF.
 
VI. Remaining Features
 
   As of this writing the RTL (Realtime List) and RTB (Realtime Bitmap) 
   capabilities are operational and are producing spectacular results.
 
   However, several planned features are incomplete. They are:
      . Realtime Indexes in ATTACHed data bases;
      . Locating RTL and/or RTB files on different logical units;
 
VII. Optimization
 
   The highest rates for "load & structure" are achieved by batching
   requests and using commands and command options that lend themselves
   to maximum thruput. As would be expected, the overall thruput is 
   directly related to the size of the batch; as the batch size increases
   past a certain threshhold, the rate of improvement decreases and
   approaches a steady state. In addition, the higher the cardinality
   of the data, the higher the steady state batch size becomes.
 
   This is especially true for high values of "Low Cardinality" 
   STRUCTURE/C. For example, STRUCTURE/C of 100,000 records with
   cardinality 10,000 would be, relatively speaking, very slow.
   In this case, it would be better to use STRUCTURE/P.
   We can generalize this concept by saying that when the ratio
       / 
   is less than 100, it is better to use STRUCTURE/P rather than
   STRUCTURE/C.
 
   In general, the technique for optimizing "load and structure" 
   thruput is:
      1) Batch load the data [with ddlLoadMem(), for example];
      2) Batch structure the incremental data;
 
   In particular, consider a data base with three low cardinality keys 
   and one high cardinality key. Consider further that we wish to
   batch process 100,000 records at a time. We could do the following:
      . Create a temporary file containing the 100,000 records;
      . ddlLoad(ctl,"/","TransActions TransActionsRD TransActionsFile.dat");
      . ddlStructure(ctl,"/C100","Field1 Field2 Field3");
      . ddlStructure(ctl,"/","Field4");
 
   Another way to approach the same example is:
      . Batch 100,000 records into a memory array
      . ddlLoadMem(ctl,"/B","TransActions TransActionsRD DummyFileName",...);
      . ddlStructure(ctl,"/C100","Field1 Field2 Field3");
      . ddlStructure(ctl,"/","Field4");
 
VII. Benchmarks
 
   A. Unbatched Insert and Structure in 80,000,000 record data base
      Record size 24 characters (All RTL)
      Three low cardinality keys (10, 10, 1200) and one high cardinality
      (10000) key
      . 100,000 records / 197 sec == 507 records per second
 
   B. Batched Insert and Structure in 80,000,000 record data base
      Record size 24 characters (All RTL)
      Three low cardinality keys (10, 10, 1200) and one high cardinality
      (10000) key; separate STRUCTURE/P for high cardinality key
      . 100,000 records / 11 sec == 9090 records per second
      . 1,000,000 records / 85 sec == 11764 records per second
      . 1,000,000 records / 84 sec == 11904 records per second
 
   C. Batched Insert and Structure in 80,000,000 record data base
      Record size 24 characters (All RTL)
      Three low cardinality keys (10, 10, 1200) and one high cardinality
      (10000) key; single STRUCTURE/C for all 4 keys
      . Load & Structure 1,000,000 records & 4 fields
                                          Load    Structure
         2001.05.23                       00:13   00:51
         2001.05.23                       00:12   00:50
      . Load & Structure 100,000 records & 4 fields
                                          Load    Structure
         2001.05.23                       00:01   00:33
 
   D. Batched Insert and Structure in 80,000,000 record data base
      Record size 24 characters (All RTL)
      Three low cardinality keys (10, 10, 1200) 
      . Load & Structure 100,000 records & 3 fields
                                          Load    Structure
         2001.05.23                       00:01   00:02
         2001.07.16                       00:02   00:03
 
   E. Batched Insert and Structure in 80,000,000 record data base (ships)
      Record size 105 characters (All RTL)
      One low cardinality key (3 chars, cardinality 100)
      . 1,000 records / 4 sec (1,3) == 250   
      . 1,000 records / 1 sec (0,1) == 1000   
      . 10,000 records / 2 sec (1,1) == 5000   
      . 10,000 records / 1 sec (0,1) == 10000   
      . 100,000 records / 7 sec (5,2) == 14285
      . 100,000 records / 5 sec (3,2) == 20000
      . 1,000,000 records / 51 sec (44,7) == 19607
      . 1,000,000 records / 42 sec (36,6) == 23809
      One low cardinality key (6 chars, cardinality 100)
      . 1,000 records / 1 sec (0,1) == 1000   
      . 1,000 records / 2 sec (0,2) == 500   
      . 10,000 records / 1 sec (0,1) == 10000   
      . 10,000 records / 1 sec (1,0) == 10000   
      . 100,000 records / 6 sec (4,2) == 16666
      . 100,000 records / 5 sec (4,1) == 20000
      . 1,000,000 records / 50 sec (43,7) == 20000
      . 1,000,000 records / 42 sec (36,6) == 23809
      One low cardinality key (10 chars, cardinality 100)
      . 1,000 records / 2 sec (0,2) == 500   
      . 1,000 records / 2 sec (0,2) == 500   
      . 10,000 records / 1 sec (0,1) == 10000   
      . 10,000 records / 1 sec (1,0) == 10000   
      . 100,000 records / 6 sec (5,1) == 16666
      . 100,000 records / 5 sec (4,1) == 20000
      . 1,000,000 records / 50 sec (44,6) == 20000
      . 1,000,000 records / 41 sec (35,6) == 24390
      One low cardinality key (15 chars, cardinality 100)
      . 1,000 records / 2 sec (0,2) == 500   
      . 1,000 records / 2 sec (0,2) == 500   
      . 10,000 records / 3 sec (1,2) == 3333   
      . 10,000 records / 2 sec (0,2) == 5000   
      . 100,000 records / 5 sec (4,1) == 20000
      . 100,000 records / 5 sec (3,2) == 20000
      . 1,000,000 records / 50 sec (44,6) == 20000
      . 1,000,000 records / 42 sec (36,6) == 23809
      Load 1,000,000 records with * 35/34
 
   F. Structure 80,000,000 record data base
      Record size 24 characters (Mixed RTL & RTB)
      One low cardinality key (2 chars, cardinality 10)
      . STRUCTURE/C 80,000,000 record RTB   07:17
      . STRUCTURE/C 80,000,000 record RTL   07:47
      . STRUCTURE/C 80,000,000 record RTB   07:26
      . STRUCTURE/C 80,000,000 record RTL   07:38
      . STRUCTURE/C 80,000,000 record RTB   06:52 (MODE TEMP C:\TEMP)
      . STRUCTURE/C 80,000,000 record RTL   07:27 (MODE TEMP C:\TEMP)
 
   G. Find 8,000,000 records in 80,000,000 record data base
      Record size 24 characters (Mixed RTL & RTB)
      One low cardinality key (2 chars, cardinality 10)
      . FIND single RTB                     00:01
      . FIND single RTL                     00:03
      . FIND single RTB                     00:00
      . FIND single RTL                     00:03
      . FIND single RTB                     00:01
      . FIND single RTL                     00:03
      . FIND double (03,05) RTB             00:01
      . FIND double (03,05) RTL             00:06
      . FIND double (03,05) RTB             00:01
      . FIND double (03,05) RTL             00:07
      . FIND double (03 OR 05) RTB          00:01
      . FIND double (03 OR 05) RTL          00:07
      . FIND double (03 OR 05) RTB          00:03
      . FIND double (03 OR 05) RTL          00:08
      . FIND double (03 OR 05) RTB          00:03
      . FIND double (03 OR 05) RTL          00:07
 
   H. Batched Insert and Structure in 80,000,000 record data base
      Record size 24 characters (One RTL, One RTB)
      . Load & Structure 100,000 records & 2 fields
                                          Load    Struct RTB  Struct RTL
         2001.08.01                 (4)   00:01   00:01       00:02
         2001.08.01                 (4)   00:01   00:01       00:02
      . Load & Structure 1,000,000 records & 2 fields
                                          Load    Struct RTB  Struct RTL
         2001.08.01                 (22)  00:13   00:05       00:04
         2001.08.01                 (21)  00:13   00:04       00:04
         2001.08.01                 (21)  00:12   00:04       00:05