Indexed compressed flat file or ICFF can be considered as a special kind of lookup file which can store large volumes of data without compromising quick access to individual records.
Common lookup files have a limit to the amount of data one can store, which is not the limitation of ICFF. Other important features, as gathered from help:
ICFFs present advantages in a number of categories:
* Requires much less disk storage — as name suggests, ICFFs store compressed data in flat files without the overheads associated with a DBMS, hence requiring much less disk storage capacity than databases — on the order of 10 times less.
* Requires much less memory at one time — as ICFFs organize data in discrete blocks, only a small portion of the data needs to be loaded in memory at any one time.
* Comparatively much faster — ICFFs allow us to create successive generations of updated information without any pause in processing which significantly reduces the time between a transaction taking place and the results of that transaction being accessible.
* Greater Performance — Making large numbers of queries against database tables that are continually being updated can slow down a DBMS. In such applications, ICFFs outperform databases.
* Can accommodate volumes of data — ICFFs can easily accommodate very large amounts of data — so large, in fact, that it can be feasible to take hundreds of terabytes of data from archive tapes, convert it into ICFFs, and make it available for online access and processing.
An ICFF application
A typical application for ICFFs is one that involves:
* Large amounts of static data
* Frequent addition of new data
* Very short response times
A good example of such an application is one that updates and accesses a repository of credit card transactions — an archive of information about every purchase made by a company’s customers. A real credit card transaction repository includes copious information about each purchase, but for simplicity this example considers only the customer name, account number, purchase amount, and a timestamp.
Once stored, the data for a particular transaction seldom changes. But with millions of purchases made per hour, the volume of data grows rapidly. On average, several megabytes of data may get added to the repository every few minutes. A year’s worth of raw transactions can require hundreds of terabytes of storage. Most importantly, customer service personnel around the world often need to be able to retrieve information about any purchase within five minutes of its completion.
Separate Ab Initio graphs manage the input data processing and customer service queries. The input processing could be handled by running a mini-batch graph periodically, as in this example. But if the volume of data is high enough, it might be preferable to use a continuous graph, thereby avoiding the overhead of starting up the input processing graph repeatedly.
One could use several approaches to query the data in the repository, including MQ or JMS queues. For this example, suppose there is a Web-based interface that generates Simple Object Access Protocol (SOAP) requests. These SOAP requests are processed by a Web services graph.
But how to implement the transaction repository? A DBMS could certainly house all the data, but it would be an expensive solution requiring many terabytes of disk. Further, the DBMS would have trouble coping with the fact that the repository is being updated and queried simultaneously. Such a solution would not scale well as the frequency of updates and number of queries increases, and at some point it would fail to turn around transaction data within five minutes as required.
By contrast, implementing the repository as an ICFF using Ab Initio technology solves the problem in an integrated, scalable way.
How indexed compressed flat files work
To create an ICFF, you start with pre-sorted data. Your graph, using the WRITE BLOCK-COMPRESSED LOOKUP component, compresses and chunks the data into blocks of more or less equal size. This continues until the graph reaches the end of the input data — or, in the case of a continuous graph, a checkpoint or compute-point.
The graph then stores the set of compressed blocks in a data file, each file being associated with a separately stored index that contains pointers back to the individual data blocks. Together, the data file and its index form a single ICFF.
A crucial feature is that, during a lookup operation, most of the compressed lookup data remains on disk — the graph loads only the relatively tiny index file into memory.
Surrogate keys in ICFFs
You can avoid the requirement for the input data to be pre-sorted by having your graph generate a sequence of surrogate keys. Depending on the application, this can improve performance and reduce resource use. But if you plan to implement a very large dataset that will be searched by multiple keys, use “Direct-addressed ICFFs”.
Inside the index file
The index file for an ICFF is small, because it contains only a few vital pieces of information:
* The disk offset needed to access each data block
* The first key value stored in each data block
Notice that you can use the information in the index to figure out exactly which data block you need to search.
* NOTE: The index file also contains header information and an optional screening bitmap.
Accessing data stored in an ICFF
When your graph needs to access some of the lookup data stored in an ICFF, the graph uses the index to retrieve the relevant data block or blocks. Even though the total amount of lookup data on disk may be huge, the graph needs to load and uncompress only a small amount of it at any given time.
Having read a data block into memory and uncompressed it, a graph finds the needed record by scanning linearly through the block.
Common lookup files have a limit to the amount of data one can store, which is not the limitation of ICFF. Other important features, as gathered from help:
ICFFs present advantages in a number of categories:
* Requires much less disk storage — as name suggests, ICFFs store compressed data in flat files without the overheads associated with a DBMS, hence requiring much less disk storage capacity than databases — on the order of 10 times less.
* Requires much less memory at one time — as ICFFs organize data in discrete blocks, only a small portion of the data needs to be loaded in memory at any one time.
* Comparatively much faster — ICFFs allow us to create successive generations of updated information without any pause in processing which significantly reduces the time between a transaction taking place and the results of that transaction being accessible.
* Greater Performance — Making large numbers of queries against database tables that are continually being updated can slow down a DBMS. In such applications, ICFFs outperform databases.
* Can accommodate volumes of data — ICFFs can easily accommodate very large amounts of data — so large, in fact, that it can be feasible to take hundreds of terabytes of data from archive tapes, convert it into ICFFs, and make it available for online access and processing.
An ICFF application
A typical application for ICFFs is one that involves:
* Large amounts of static data
* Frequent addition of new data
* Very short response times
A good example of such an application is one that updates and accesses a repository of credit card transactions — an archive of information about every purchase made by a company’s customers. A real credit card transaction repository includes copious information about each purchase, but for simplicity this example considers only the customer name, account number, purchase amount, and a timestamp.
Once stored, the data for a particular transaction seldom changes. But with millions of purchases made per hour, the volume of data grows rapidly. On average, several megabytes of data may get added to the repository every few minutes. A year’s worth of raw transactions can require hundreds of terabytes of storage. Most importantly, customer service personnel around the world often need to be able to retrieve information about any purchase within five minutes of its completion.
Separate Ab Initio graphs manage the input data processing and customer service queries. The input processing could be handled by running a mini-batch graph periodically, as in this example. But if the volume of data is high enough, it might be preferable to use a continuous graph, thereby avoiding the overhead of starting up the input processing graph repeatedly.
One could use several approaches to query the data in the repository, including MQ or JMS queues. For this example, suppose there is a Web-based interface that generates Simple Object Access Protocol (SOAP) requests. These SOAP requests are processed by a Web services graph.
But how to implement the transaction repository? A DBMS could certainly house all the data, but it would be an expensive solution requiring many terabytes of disk. Further, the DBMS would have trouble coping with the fact that the repository is being updated and queried simultaneously. Such a solution would not scale well as the frequency of updates and number of queries increases, and at some point it would fail to turn around transaction data within five minutes as required.
By contrast, implementing the repository as an ICFF using Ab Initio technology solves the problem in an integrated, scalable way.
How indexed compressed flat files work
To create an ICFF, you start with pre-sorted data. Your graph, using the WRITE BLOCK-COMPRESSED LOOKUP component, compresses and chunks the data into blocks of more or less equal size. This continues until the graph reaches the end of the input data — or, in the case of a continuous graph, a checkpoint or compute-point.
The graph then stores the set of compressed blocks in a data file, each file being associated with a separately stored index that contains pointers back to the individual data blocks. Together, the data file and its index form a single ICFF.
A crucial feature is that, during a lookup operation, most of the compressed lookup data remains on disk — the graph loads only the relatively tiny index file into memory.
Surrogate keys in ICFFs
You can avoid the requirement for the input data to be pre-sorted by having your graph generate a sequence of surrogate keys. Depending on the application, this can improve performance and reduce resource use. But if you plan to implement a very large dataset that will be searched by multiple keys, use “Direct-addressed ICFFs”.
Inside the index file
The index file for an ICFF is small, because it contains only a few vital pieces of information:
* The disk offset needed to access each data block
* The first key value stored in each data block
Notice that you can use the information in the index to figure out exactly which data block you need to search.
* NOTE: The index file also contains header information and an optional screening bitmap.
Accessing data stored in an ICFF
When your graph needs to access some of the lookup data stored in an ICFF, the graph uses the index to retrieve the relevant data block or blocks. Even though the total amount of lookup data on disk may be huge, the graph needs to load and uncompress only a small amount of it at any given time.
Having read a data block into memory and uncompressed it, a graph finds the needed record by scanning linearly through the block.
Comments
Post a Comment