RealTime Indexes Preliminary Description August 29, 2001 I. Terminology A data base "Index" is a data structure that is used to isolate the occurrence of one or more "Values" in the data base. For example, an index for the data base field STATE would consist of all possible values for STATE; for example, "ALABAMA", "ALASKA", "ARKANSAS", etc Associated with each value is a "List" of records (or record numbers) that contain that value. The lists might look like this: Value List ----- --- ALABAMA: 5, 12, 74 ALASKA: 7 ARKANSAS: 6, 8, 22, 47, ... etc If a value has exactly one list item, it is called a "unique value"; e.g., ALASKA has exactly one list item. Note that a list may be represented either as an array of numbers, or as an array of bits. The number of values contained in an index is called its "uniqueness". For example, we would expect the uniqueness of STATE to be 50, assuming no mis-spellings. We also refer to the uniqueness of an index as its "cardinality". In some cases, we are particularly interested in indexes with "low-cardinality" (we will define "low" later). We refer to the relative list size of a value as its "density". We refer to the absolute list size of a value as its "population". Note that a given index may contain both low-density and high-density values. Note also that we often use the terms density and population interchangably; however, this is valid only for large data bases. For small data bases, high density does not equate to high population. II. Background The RealTime Index feature was created to overcome inherent problems with the existing indexing methodology. In particular, the simplistic two level list structure required frequent and lengthy "spool" operations when data was being created or modified. This made the existing methodology unsuitable for real time data acquisition and massaging. III. Description The RealTime Index feature created two new data structures: . RealTime Lists (RTLs) are lists similar to the existing lists superimposed onto a tree structure. This allows fast index creation for all density/population configurations; and it allows fast index retrieval (i.e., querying) for low to medium density/population configurations. . RealTime Bitmaps (RTBs) are bitmaps that allow more efficient index retrieval for high density/population configurations. The user must declare in the data base schema which index(s) are to be RealTime. The user can optionally specify additional parameters to tune the RTI performance. For each RealTime Index, two additional files are created when the data base is created. The file names are the concatenation of the data base name and the index name, plus extensions ".rtl" and ".rtb". IV. DBD Schema Statements A. REALTIME qualifier The "REALTIME" qualifier can be added to the "KEY" declaration to cause the relevant index to be a RealTime Index. For example: TD Customers 1000000 records CustNo x(6) KEY REALTIME ... TD Sales 10000000 records CustNo x(6) KEY ... Output from the INFO command shows a qualifier "(r)" to indicate a RealTime Index. B. UNIQUE statement The UNIQUE statement can and should be used for RealTime Indexes in the same way it is used for conventional indexes to optimize performance and storage requirements. C. UNIQUERTL statement The UNIQUERTL statement is a new statement that allow the user optionally to specify RTL parameters. The syntax is: UNIQUERTLspecifies the number of empty nodes to be created when the data base is created; this is useful primarily for calculating disk storage requirements; is a number from the set {128, 256, 512, 1024, 2048, 4096} that specifies the RTL node size; the default is 4096. The high number should be used whenever the average population per value is greater than 1000. When the average population per value is less than 1000, one of the smaller values should be used. For example, UNIQUERTL LastName 0 0 256 uses 256 byte RTL blocks; this is optimum if the average population per value is about 64. D. UNIQUERTB statement The UNIQUERTB statement is a new statement that allow the user optionally to specify RTB parameters. The syntax is: UNIQUERTB specifies the number of empty nodes to be created when the data base is created; this is useful primarily for calculating disk storage requirements; specifies the threshhold, as a percent of the total number of records, where an RTL is to be converted to an RTB; this feature is not yet implemented; specifies which values are to be RTBs; this list may be the literall ALL, to specify all values are to be RTBs; or it may be a list in the form {Value1, Value2, ...} The values in the list may be enclosed in single or double quotes if required. For example, UNIQUERTB state 0 0 {TX, NY, CA} treats TX, NY and CA as RTBs; the others are treated as RTLs. UNIQUERTB state 0 0 ALL treats all states as RTBs. V. Commands A. FIND and MATCH The FIND and MATCH commands are syntactically unchanged. Functionally they will use the new index structures as needed. B. STRUCTURE The STRUCTURE command is syntactically unchanged; but one restriction has been removed. The STRUCTURE/Cn command will work on partially structured RealTime Indexes. In addition, the STRUCTURE/P command is much faster. C. LOAD A new form of the LOAD command has been created to allow data to be imported directly from memory. The prototype is: void ddlLoadMem(CONTROL *ctl, char *OptnString, char *ArgString, char *DataArray, int DataItemSize, long DataItemCount); ddlLoadMem() should always be used with the /B option, indicating fixed length records with no CRLF. VI. Remaining Features As of this writing the RTL (Realtime List) and RTB (Realtime Bitmap) capabilities are operational and are producing spectacular results. However, several planned features are incomplete. They are: . Realtime Indexes in ATTACHed data bases; . Locating RTL and/or RTB files on different logical units; VII. Optimization The highest rates for "load & structure" are achieved by batching requests and using commands and command options that lend themselves to maximum thruput. As would be expected, the overall thruput is directly related to the size of the batch; as the batch size increases past a certain threshhold, the rate of improvement decreases and approaches a steady state. In addition, the higher the cardinality of the data, the higher the steady state batch size becomes. This is especially true for high values of "Low Cardinality" STRUCTURE/C. For example, STRUCTURE/C of 100,000 records with cardinality 10,000 would be, relatively speaking, very slow. In this case, it would be better to use STRUCTURE/P. We can generalize this concept by saying that when the ratio / is less than 100, it is better to use STRUCTURE/P rather than STRUCTURE/C. In general, the technique for optimizing "load and structure" thruput is: 1) Batch load the data [with ddlLoadMem(), for example]; 2) Batch structure the incremental data; In particular, consider a data base with three low cardinality keys and one high cardinality key. Consider further that we wish to batch process 100,000 records at a time. We could do the following: . Create a temporary file containing the 100,000 records; . ddlLoad(ctl,"/","TransActions TransActionsRD TransActionsFile.dat"); . ddlStructure(ctl,"/C100","Field1 Field2 Field3"); . ddlStructure(ctl,"/","Field4"); Another way to approach the same example is: . Batch 100,000 records into a memory array . ddlLoadMem(ctl,"/B","TransActions TransActionsRD DummyFileName",...); . ddlStructure(ctl,"/C100","Field1 Field2 Field3"); . ddlStructure(ctl,"/","Field4"); VII. Benchmarks A. Unbatched Insert and Structure in 80,000,000 record data base Record size 24 characters (All RTL) Three low cardinality keys (10, 10, 1200) and one high cardinality (10000) key . 100,000 records / 197 sec == 507 records per second B. Batched Insert and Structure in 80,000,000 record data base Record size 24 characters (All RTL) Three low cardinality keys (10, 10, 1200) and one high cardinality (10000) key; separate STRUCTURE/P for high cardinality key . 100,000 records / 11 sec == 9090 records per second . 1,000,000 records / 85 sec == 11764 records per second . 1,000,000 records / 84 sec == 11904 records per second C. Batched Insert and Structure in 80,000,000 record data base Record size 24 characters (All RTL) Three low cardinality keys (10, 10, 1200) and one high cardinality (10000) key; single STRUCTURE/C for all 4 keys . Load & Structure 1,000,000 records & 4 fields Load Structure 2001.05.23 00:13 00:51 2001.05.23 00:12 00:50 . Load & Structure 100,000 records & 4 fields Load Structure 2001.05.23 00:01 00:33 D. Batched Insert and Structure in 80,000,000 record data base Record size 24 characters (All RTL) Three low cardinality keys (10, 10, 1200) . Load & Structure 100,000 records & 3 fields Load Structure 2001.05.23 00:01 00:02 2001.07.16 00:02 00:03 E. Batched Insert and Structure in 80,000,000 record data base (ships) Record size 105 characters (All RTL) One low cardinality key (3 chars, cardinality 100) . 1,000 records / 4 sec (1,3) == 250 . 1,000 records / 1 sec (0,1) == 1000 . 10,000 records / 2 sec (1,1) == 5000 . 10,000 records / 1 sec (0,1) == 10000 . 100,000 records / 7 sec (5,2) == 14285 . 100,000 records / 5 sec (3,2) == 20000 . 1,000,000 records / 51 sec (44,7) == 19607 . 1,000,000 records / 42 sec (36,6) == 23809 One low cardinality key (6 chars, cardinality 100) . 1,000 records / 1 sec (0,1) == 1000 . 1,000 records / 2 sec (0,2) == 500 . 10,000 records / 1 sec (0,1) == 10000 . 10,000 records / 1 sec (1,0) == 10000 . 100,000 records / 6 sec (4,2) == 16666 . 100,000 records / 5 sec (4,1) == 20000 . 1,000,000 records / 50 sec (43,7) == 20000 . 1,000,000 records / 42 sec (36,6) == 23809 One low cardinality key (10 chars, cardinality 100) . 1,000 records / 2 sec (0,2) == 500 . 1,000 records / 2 sec (0,2) == 500 . 10,000 records / 1 sec (0,1) == 10000 . 10,000 records / 1 sec (1,0) == 10000 . 100,000 records / 6 sec (5,1) == 16666 . 100,000 records / 5 sec (4,1) == 20000 . 1,000,000 records / 50 sec (44,6) == 20000 . 1,000,000 records / 41 sec (35,6) == 24390 One low cardinality key (15 chars, cardinality 100) . 1,000 records / 2 sec (0,2) == 500 . 1,000 records / 2 sec (0,2) == 500 . 10,000 records / 3 sec (1,2) == 3333 . 10,000 records / 2 sec (0,2) == 5000 . 100,000 records / 5 sec (4,1) == 20000 . 100,000 records / 5 sec (3,2) == 20000 . 1,000,000 records / 50 sec (44,6) == 20000 . 1,000,000 records / 42 sec (36,6) == 23809 Load 1,000,000 records with * 35/34 F. Structure 80,000,000 record data base Record size 24 characters (Mixed RTL & RTB) One low cardinality key (2 chars, cardinality 10) . STRUCTURE/C 80,000,000 record RTB 07:17 . STRUCTURE/C 80,000,000 record RTL 07:47 . STRUCTURE/C 80,000,000 record RTB 07:26 . STRUCTURE/C 80,000,000 record RTL 07:38 . STRUCTURE/C 80,000,000 record RTB 06:52 (MODE TEMP C:\TEMP) . STRUCTURE/C 80,000,000 record RTL 07:27 (MODE TEMP C:\TEMP) G. Find 8,000,000 records in 80,000,000 record data base Record size 24 characters (Mixed RTL & RTB) One low cardinality key (2 chars, cardinality 10) . FIND single RTB 00:01 . FIND single RTL 00:03 . FIND single RTB 00:00 . FIND single RTL 00:03 . FIND single RTB 00:01 . FIND single RTL 00:03 . FIND double (03,05) RTB 00:01 . FIND double (03,05) RTL 00:06 . FIND double (03,05) RTB 00:01 . FIND double (03,05) RTL 00:07 . FIND double (03 OR 05) RTB 00:01 . FIND double (03 OR 05) RTL 00:07 . FIND double (03 OR 05) RTB 00:03 . FIND double (03 OR 05) RTL 00:08 . FIND double (03 OR 05) RTB 00:03 . FIND double (03 OR 05) RTL 00:07 H. Batched Insert and Structure in 80,000,000 record data base Record size 24 characters (One RTL, One RTB) . Load & Structure 100,000 records & 2 fields Load Struct RTB Struct RTL 2001.08.01 (4) 00:01 00:01 00:02 2001.08.01 (4) 00:01 00:01 00:02 . Load & Structure 1,000,000 records & 2 fields Load Struct RTB Struct RTL 2001.08.01 (22) 00:13 00:05 00:04 2001.08.01 (21) 00:13 00:04 00:04 2001.08.01 (21) 00:12 00:04 00:05
Copyright © 2019 , WhamTech, Inc. All rights reserved. This
document is provided for information purposes only and the contents hereof are
subject to change without notice. Names may be
trademarks of their respective owners.