Informix V12 - Fragmentation Vs. Sharding

Abstract

Fragmentation has been available in Informix since V7.00 which was released in the mid-nineties. It allows you to group data rows and/or indexes for a table according to a user-defined distribution scheme and physically place these fragments in separate dbspaces and on separate physical disks. This can bring benefits to query performance by spreading the I/O over multiple devices, by elimating the need to read fragments not relevant to a particular query or even scanning multiple fragments in parallel.


As data volumes grow, the ability to fragment large tables across multiple dbspaces can also reduce the requirement to create dbspaces with larger page sizes and the additional buffer pools required for them.


But, in today’s Big Data era, as data storage requirements grow at an ever increasing pace, what if the performance and capacity of a single server can no longer meet these demands ?


One possible answer could be Sharding.


Sharding was introduced at V12, it allows you to group data rows and index keys for a table according to a user-defined distribution scheme and physically place these fragments on separate servers, locally or remotely. This allows the resources of some or all of these servers to be used when processing queries.


As your database grows, rather than scaling up by adding more processors and/or RAM to an existing server, you can scale out by adding more servers. Also, as Sharding makes use of Informix Enterprise Replication, there is no requirement for the server hardware and operating systems to be the same.

Content

As an example of Fragmentation, the following SQL will create the lightning table, fragmented by expression using the country column and create a Primary Key Constraint in the IDXDBS1 dbspace :-

CREATE TABLE LIGHTNING (
     country INT NOT NULL,
     station INT NOT NULL,
     observed DATETIME YEAR TO SECOND NOT NULL,
     type SMALLINT NOT NULL)
FRAGMENT BY EXPRESSION
     country < 100 IN DATADBS1,
     country >= 100 AND country < 200 IN DATADBS2,
     country >= 200 AND country < 300 IN DATADBS3,
     country >= 300 AND country < 400 IN DATADBS4,
     REMAINDER IN DATADBS5);

CREATE UNIQUE INDEX lightning_pk ON lightning (observed, country, station) IN IDXDBS1;

ALTER TABLE lightning ADD CONSTRAINT PRIMARY KEY
(observed, country, station) CONSTRAINT lightning_pk;

Any row inserted with the country column < 100 will go into the datadbs1 dbspace, those with the country column between 100 and 199 will go into datadbs2 etc., any rows where the country column does not match one of the conditions, the remainder, will go into datadbs5.


One benefit of this distribution scheme may be the elimination of certain fragments when executing a query that includes a range of the column country, eg:-

SELECT   country,
         station,
         observed,
         type
FROM     lightning
WHERE    country > 120
AND      country < 150;

Another may be reading each fragment in parallel using the observed column, eg:-

SELECT   country,
         station,
         observed,
         type
FROM     lightning
WHERE    EXTEND (observed, YEAR TO DAY) = TODAY;

We can extend the principal of the fragmented lightning table further by creating a Sharded table. This will use five separate database servers, shard_serv_1 to shard_serv_5, as opposed to the five dbspaces of the fragmented table, DATADBS1 to DATADBS5. The examples below assume all of the required database servers are already built, On-Line and they belong to the same Enterprise Replication Domain.


Each Shard Server must have a unique SHARD_ID within the Shard Cluster, this ensures Sharded Queries run in parallel on all servers, this is an onconfig parameter and if not already present, it can be set with the following :-

onmode –wf SHARD_ID=<unique positive integer>

If it is already set, any change will require the editing of the onconfig file and a restart of the database server.


Each Shard Server must also have an identical database and table created, eg.

CREATE TABLE LIGHTNING (
     country INT NOT NULL,
     station INT NOT NULL,
     observed DATETIME YEAR TO SECOND NOT NULL,
     type SMALLINT NOT NULL) IN DATADBS1;

CREATE UNIQUE INDEX lightning_pk ON lightning(observed, country, station) IN IDXDBS1;

ALTER TABLE lightning ADD CONSTRAINT PRIMARY KEY
     (observed, country, station) CONSTRAINT lightning_pk;

Once all the tables are created, we can define the Sharded table using cdr define shardCollection :-

cdr define shardCollection lightning_shard test:informix.lightning \
        --type=delete --key=country --strategy=expression --versionCol=observed \
       g_shard_serv_1 “between 0 and 99” \
       g_shard_serv_2 “between 100 and 199” \
       g_shard_serv_3 “between 200 and 299” \
       g_shard_serv_4 “between 300 and 399” \
       g_shard_serv_5 “remainder”

(-- type delete is the default behaviour. This allows the insert of data to the table on any of the database servers in the Shard Cluster. The rows are replicated to the appropriate Shard Server and then deleted from the original server. This is also the same behaviour for sharding defined using MongoDB commands. For the other options please see https://www.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/com.ibm.erep.d...)


As with the earlier Fragmentation example, there must be a remainder Shard Server to catch any data that does not meet any earlier condition when sharding using an expression.


When querying, updating or deleting data from a Sharded table, operations are performed only on the local Shard of the table by default, to work with the complete Sharded table, you must set the USE_SHARDING environment variable from within your session, ie.

SET ENVIRONMENT USE_SHARDING “ON”;
SELECT * FROM LIGHTNING;

Finally, while Fragmentation is not available with the Informix Work Group Edition, interestingly, Sharding is, both in a purely Work Group Edition estate or mixed with Enterprise Edition instances.

Caveats

You cannot run a statement that contains an update to a Shard key that would require a row to move from one Shard server to another, the row must be deleted then inserted with the new values.

Conclusions

Fragmentation is a way to partition horizontally a single table across multiple dbspaces on a single server. Sharding extends this capability to allow the partitioning of a single table across multiple database servers in a shard cluster. You query both a fragmented table and a sharded table in the same way. You do not need to know where the data is. Queries that are performed on one shard server retrieve the relevant data from other servers in a shard cluster. You can add shard servers to the shard cluster as your data grows.


For Work Group Edition users, Sharding possibly provides a more cost-effective solution when catering for database growth than an upgrade to the Enterprise Edition. It allows for unlimited Enterprise Replication Nodes that tables could be sharded across.

Disclaimer

The above is provided "as is" without warranty of any kind, either express or implied, including without limitation any implied warranties of condition, uninterrupted use, merchantability, fitness for a particular purpose, or non-infringement.