Skip to main content

Indexed compressed flat file or ICFF

Indexed compressed flat file or ICFF can be considered as a special kind of lookup file which can store large volumes of data without compromising quick access to individual records.

Common lookup files have a limit to the amount of data one can store, which is not the limitation of ICFF. Other important features, as gathered from help:

ICFFs present advantages in a number of categories:

* Requires much less disk storage — as name suggests, ICFFs store compressed data in flat files without the overheads associated with a DBMS, hence requiring much less disk storage capacity than databases — on the order of 10 times less.
* Requires much less memory at one time — as ICFFs organize data in discrete blocks, only a small portion of the data needs to be loaded in memory at any one time.
* Comparatively much faster — ICFFs allow us to create successive generations of updated information without any pause in processing which significantly reduces the time between a transaction taking place and the results of that transaction being accessible.
* Greater Performance — Making large numbers of queries against database tables that are continually being updated can slow down a DBMS. In such applications, ICFFs outperform databases.
* Can accommodate volumes of data — ICFFs can easily accommodate very large amounts of data — so large, in fact, that it can be feasible to take hundreds of terabytes of data from archive tapes, convert it into ICFFs, and make it available for online access and processing.

An ICFF application

A typical application for ICFFs is one that involves:
* Large amounts of static data
* Frequent addition of new data
* Very short response times

A good example of such an application is one that updates and accesses a repository of credit card transactions — an archive of information about every purchase made by a company’s customers. A real credit card transaction repository includes copious information about each purchase, but for simplicity this example considers only the customer name, account number, purchase amount, and a timestamp.

Once stored, the data for a particular transaction seldom changes. But with millions of purchases made per hour, the volume of data grows rapidly. On average, several megabytes of data may get added to the repository every few minutes. A year’s worth of raw transactions can require hundreds of terabytes of storage. Most importantly, customer service personnel around the world often need to be able to retrieve information about any purchase within five minutes of its completion.

Separate Ab Initio graphs manage the input data processing and customer service queries. The input processing could be handled by running a mini-batch graph periodically, as in this example. But if the volume of data is high enough, it might be preferable to use a continuous graph, thereby avoiding the overhead of starting up the input processing graph repeatedly.
One could use several approaches to query the data in the repository, including MQ or JMS queues. For this example, suppose there is a Web-based interface that generates Simple Object Access Protocol (SOAP) requests. These SOAP requests are processed by a Web services graph.

But how to implement the transaction repository? A DBMS could certainly house all the data, but it would be an expensive solution requiring many terabytes of disk. Further, the DBMS would have trouble coping with the fact that the repository is being updated and queried simultaneously. Such a solution would not scale well as the frequency of updates and number of queries increases, and at some point it would fail to turn around transaction data within five minutes as required.

By contrast, implementing the repository as an ICFF using Ab Initio technology solves the problem in an integrated, scalable way.

How indexed compressed flat files work

To create an ICFF, you start with pre-sorted data. Your graph, using the WRITE BLOCK-COMPRESSED LOOKUP component, compresses and chunks the data into blocks of more or less equal size. This continues until the graph reaches the end of the input data — or, in the case of a continuous graph, a checkpoint or compute-point.

The graph then stores the set of compressed blocks in a data file, each file being associated with a separately stored index that contains pointers back to the individual data blocks. Together, the data file and its index form a single ICFF.

A crucial feature is that, during a lookup operation, most of the compressed lookup data remains on disk — the graph loads only the relatively tiny index file into memory.

Surrogate keys in ICFFs

You can avoid the requirement for the input data to be pre-sorted by having your graph generate a sequence of surrogate keys. Depending on the application, this can improve performance and reduce resource use. But if you plan to implement a very large dataset that will be searched by multiple keys, use “Direct-addressed ICFFs”.

Inside the index file

The index file for an ICFF is small, because it contains only a few vital pieces of information:
* The disk offset needed to access each data block
* The first key value stored in each data block

Notice that you can use the information in the index to figure out exactly which data block you need to search.

* NOTE: The index file also contains header information and an optional screening bitmap.

Accessing data stored in an ICFF

When your graph needs to access some of the lookup data stored in an ICFF, the graph uses the index to retrieve the relevant data block or blocks. Even though the total amount of lookup data on disk may be huge, the graph needs to load and uncompress only a small amount of it at any given time.

Having read a data block into memory and uncompressed it, a graph finds the needed record by scanning linearly through the block.

Comments

Popular posts from this blog

Ab initio Parameters

Parameters A parameter is a value that you specify to control some part of an object’s behaviour. The object can be a project, component, graph, subgraph, plan, and so on. You type in a value for a parameter (or click a button or select a value from a list), and thus specify the aspect of the object’s behaviour identified by the parameter’s name. Every parameter has two main parts: * the declaration of its name * the definition of its value Parameters also have attributes that specify various details about what type of value it can hold, whether the parameter is input or local. The normal way to edit a component’s parameters is through the Parameters tab of the component dialog. Graph, subgraph and project parameters are edited through the Parameters Editor. Component parameters too can be edited with the Parameters Editor. Usually component parameters are edited through the components’ own dialogs (with Description, Parameters, Ports, and other tabs). Parameter sets ...

Ab initio Questions and Answers

1. Explain what is de-partition in Abinitio ? (Abinitio Interview Questions) Answer: De-partition is done in order to read data from multiple flow or operations and are used to re-join data records from different flows. There are several de-partition components available which includes Gather, Merge, Interleave, and Concatenation. Abinitio Interview Questions 2. Explain what is SANDBOX ? Answer: A SANDBOX is referred for the collection of graphs and related files that are saved in a single directory tree and behaves as a group for the purposes of navigation, version control, and migration. 3. What do you mean by the overflow errors ? Answer: While processing data, calculations which are bulky are often there and it is not always necessary that they fit the memory allocated for them. In case a character of more than 8-bits is stored there, this errors results simply. 4. What is data encoding ? Answer: Data needs...