Configuring Transparent Application Failover (TAF) in Oracle RAC

Transparent Application Failover: Overview

Transparent Application Failover

TAF is a runtime feature of the OCI driver. It enables your application to automatically reconnect to the service if the initial connection fails. During the reconnection, although your active transactions are rolled back, TAF can optionally resume the execution of a SELECT statement that was in progress. TAF supports two failover methods:

  • With the BASIC method, the reconnection is established at failover time. After the service has been started on the nodes (1), the initial connection (2) is made. The listener establishes the connection (3), and your application accesses the database (4) until the connection fails (5) for any reason. Your application then receives an error the next time it tries to access the database (6). Then, the OCI driver reconnects to the same service (7), and the next time your application tries to access the database, it transparently uses the newly created connection (8). TAF can be enabled to receive FAN events for faster down events detection and failover.
  • The PRECONNECT method is similar to the BASIC method except that it is during the initial connection that a shadow connection is also created to anticipate the failover. TAF guarantees that the shadow connection is always created on the available instances of your service by using an automatically created and maintained shadow service.

TAF Basic Configuration on Server-Side: Example

Before using TAF, it is recommended that you create and start a service that is to be used when establishing connections. By doing so, you benefit from the integration of TAF and services. When you wish to use BASIC TAF with a service, you should use the -failovermethod BASIC option when creating the service (TAF failover method is used for backward compatibility only). You can define the TAF policy by setting the number of times that a failed session attempts to reconnect to the service and how long it should wait between reconnection attempts using the -failoverretry and -failoverdelay parameters, respectively. After the service is created, you simply start it on your database.

$ srvctl add service -db RACDB -service APSVC -failovermethod BASIC -failovertype SELECT  -failoverretry 10 -failoverdelay 30 -serverpool sp1
$ srvctl start service -db RACDB -service APSVC

TAF can be configured at the client-side in tnsnames.ora or at the server side using the srvctl utility as shown below. Configuring it at the server is preferred as it is convenient to put the configuration in a single place (the server).

apsvc =
   (ADDRESS = (PROTOCOL = TCP)(HOST = cluster01-scan)
    (PORT = 1521))
      (SERVICE_NAME = apsvc)))

Your application needs to connect to the service by using a connection descriptor similar to the one shown above. In the example above, notice that the cluster SCAN is used in the descriptor. Once connected, the GV$SESSION view will reflect that the connection is TAF-enabled. The FAILOVER_METHOD and FAILOVER_TYPE column reflects this and confirms that the TAF configuration is correct.

$ sqlplus AP/[email protected]
SQL> SELECT inst_id, username, service_name, failover_type,
     FROM gv$session WHERE username='AP';

-------  --------   ------------  -------------   ----------
1         AP          apsvc         SELECT         BASIC

TAF Basic Configuration on a Client-Side: Example

To use client-side TAF, create and start your service using SRVCTL, and then configure TAF by defining a TNS entry for it in your tnsnames.ora file as shown below.

$ srvctl add service -db RACDB -service AP –serverpool sp1
$ srvctl start service -db RACDB -service AP

AP =
        (SERVICE_NAME = AP)
        (FAILOVER_MODE= (TYPE=select)

The FAILOVER_MODE parameter must be included in the CONNECT_DATA section of a connect descriptor. In the example above, if the instance fails after the connection, then the TAF application fails over to the other node’s listener, reserving any SELECT statements in progress. If the failover connection fails, then Oracle Net waits 15 seconds before trying to reconnect again. Oracle Net attempts to reconnect up to 20 times.

TAF Preconnect Configuration: Example

In order to use PRECONNECT TAF, it is mandatory that you create a service with preferred and available instances. Also, in order for the shadow service to be created and managed automatically by Oracle Clusterware, you must define the service with the –tafpolicy PRECONNECT option. TAF policy specification is for administrator-managed databases only. The shadow service is always named using the format[service_name]_PRECONNECT.

$ srvctl add service -db RACDB -service ERP -preferred I1 -available I2 -tafpolicy PRECONNECT
$ srvctl start service -db RACDB -service ERP

When the TAF service settings are defined on the client side only, you need to configure a special connection descriptor in your tnsnames.ora file to use the PRECONNECT method. One such connection descriptor is shown in the below.



The main differences with the previous example are that METHOD is set to PRECONNECT and an additional parameter is added. This parameter is called BACKUP and must be set to another entry in your tnsnames.ora file that points to the shadow service.

TAF Verification

To determine whether TAF is correctly configured and that connections are associated with a failover option, you can examine the V$SESSION view. To obtain information about the connected clients and their TAF status, examine the FAILOVER_TYPE, FAILOVER_METHOD, FAILED_OVER, and SERVICE_NAME columns. The example includes one query that you could execute to verify that you have correctly configured TAF. This example is based on the previously configured AP and ERP services, and their corresponding connection descriptors.

SELECT   machine, failover_method, failover_type,
         failed_over, service_name, COUNT(*)
FROM     v$session
GROUP BY machine, failover_method, failover_type,
         failed_over, service_name;

First node

------- ---------- ---------- --- -------- --------
node1     BASIC    SESSION    NO     AP     1
node1   PRECONNECT SESSION    NO     ERP    1

2nd Node

------- ---------- ---------- --- --------- --------
node2    NONE       NONE       NO ERP_PRECO  1

3rd Node

------- ---------- ---------- --- -------- --------
node2     BASIC      SESSION  YES  AP         1

The first output above is the result of the execution of the query on the first node after two SQL*Plus sessions from the first node have connected to the AP and ERP services, respectively. The output shows that the AP connection ended up on the first instance. Because of the load-balancing algorithm, it can end up on the second instance. Alternatively, the ERP connection must end up on the first instance because it is the only preferred one.

The second output is the result of the execution of the query on the second node before any connection failure. Note that there is currently one unused connection established under the ERP_PRECONNECT service that is automatically started on the ERP available instance.

The third output is the one corresponding to the execution of the query on the second node after the failure of the first instance. A second connection has been created automatically for the AP service connection, and the original ERP connection now uses the preconnected connection.

FAN Connection Pools and TAF Considerations

Because connection load balancing is a listener functionality, both FCF and TAF automatically benefit from connection load balancing for services. When you use FCF, there is no need to use TAF. For example, you do not need to preconnect if you use FAN in conjunction with connection pools. The connection pool is always preconnected.

With both techniques, you automatically benefit from VIPs at connection time. This means that your application does not rely on lengthy operating system connection timeouts at connect time, or when issuing a SQL statement. However, when in the SQL stack, and the application is blocked on a read/write call, the application needs to be integrated with FAN in order to receive an interrupt if a node goes down. In a similar case, TAF may rely on OS timeouts to detect the failure. This takes much more time to fail over the connection than when using FAN.