Golden Gate Process ABENDS

1) OGG-01031 Read/Write from trail file

To start off make sure that the extract from source to target really wont start:

$ cd $GG_HOME
$ ggsci
GGSCI (test) 1> start extract SR3_DP
EXTRACT SR3_DP starting
GGSCI (test) 2> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER  RUNNING
JAGENT STOPPED
EXTRACT RUNNING SR1_DP 00:00:00 00:00:01
EXTRACT RUNNING SR1_EXT 00:00:00 00:00:05
EXTRACT RUNNING SR2_DP 00:00:00 00:00:08
EXTRACT RUNNING SR2_EXT 00:00:00 00:00:03
EXTRACT ABENDED SR3_DP 00:00:00 00:00:26
REPLICAT RUNNING TR1_REP 00:00:00 00:00:03

Check GGSERR.LOG file for information on failure:-

2013-10-21 13:44:41 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, SR3_DP.prm: EXTRACT SR3_DP started.
2013-10-21 13:44:46 INFO OGG-01226 Oracle GoldenGate Capture for Oracle, SR3_DP.prm: Socket buffer size set to 27985 (flush size 27985).
2013-10-21 13:44:46 ERROR OGG-01031 Oracle GoldenGate Capture for Oracle, SR3_DP.prm: There is a problem in network communication, a remote
file problem, encryption keys for target and source do not match (if using ENCRYPT) or an unknown error. (Reply received is Expected 4 bytes,
but got 0 bytes, in trail /app/oracle/goldengate/dirdat/bd000755, seqno 755, reading record trailer token at RBA 75816525).
2013-10-21 13:44:46 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, SR3_DP.prm: PROCESS ABENDING.

The message is telling us that it cant read/write data to/from a trail file, so the only solution is to roll over to the next file.

So we need to find which files and positions withing these files are in use:

GGSCI (test) 1> send SR3_DP status
Sending STATUS request to EXTRACT SR3_DP ...
EXTRACT SR3_DP (PID 8040)
Current status: Recovery complete: At EOF
Current read position:
Sequence #: 762
RBA: 24099513
Timestamp: 2013-10-21 13:44:14.000000
Extract Trail: /app/oracle/goldengate/dirdat/ba
Current write position:
Sequence #: 755
RBA: 29275759
Timestamp: 2013-10-21 13:44:54.591200
Extract Trail: /app/oracle/goldengate/dirdat/ba

The above extracts information shows the extract from source to target has stalled at sequence file 755, the local file at the source end (763) is being written to by the bd_ext process which is running so that file is OK. The trail file 755 should be transferring to target and it is that process that has abended, so the files at each end are unusable at the moment.

THE FIX

Stop the Replicat process at the target end, in this case target, the source is abended so we don't have to stop that end.

Roll over each end to the next sequence as follows:-

At the extract end (source):-

ALTER EXTRACT SR3_DP, ETROLLOVER 
START EXTRACT SR3_DP

At the replicat end (target)

ALTER TR3_REP EXTSEQNO 756
ALTER TR3_REP EXTRBA 0
START REPLICAT TR3_REP

This rolls over to file 756 the next in the sequence identified earlier and changes the offset within that file to the start (0).

Do an 'info all' at each end to make sure things have returned to normal.

 

2) OGG-01031 Unable to open file

The process SR3_DP on the target system kept abending. Error message in the log file:

2014-06-04 14:06:15  ERROR   OGG-01031  Oracle GoldenGate Capture for Oracle, sr3_dp.prm:  There is a problem in network communication, a remote file problem, encryption keys for target 
and source do not match (if using ENCRYPT) or an unknown error. (Reply received is Unable to open file "/app/oracle/product/10.2.0/goldengate/dirdat/ba000920" (error 11, Resource temporarily unavailable)).

Investigations showed that there was a O/S Golden Gate process on the source:

oracle  8213     1   0   Apr 02 ?          10:56 ./server -w 1000 -p 7820-7899 -m 7809 -k -l /app/oracle/product/10.2.0/goldenga

The source database server was locking the file that SR3_DP was trying to access.

THE FIX

The process was killed on the OS:

test> kill -9 8213

test> ps -eaf | grep 8213

 oracle  6303  4350   0 14:21:35 pts/1       0:00 grep 8213

Start the SR3_DP process again


3) OGG-01921 Missing GETBEFORECOLS

The error Missing GETBEFORECOLS with conflict detection enabled in target table TEST.TESTA was in the ggerror.log on the source replicat process. Golden Gate loses where it is on the previous/current value during an outage.

THE FIX

Looking in the replicat report file TR1_REP.rpt search for the ERROR message:

Source Context :
SourceModule : [er.cdr]
SourceID : [/scratch/pradshar/view_storage/pradshar_sparc9_fbos/oggcore/OpenSys/src/app/er/cdr.cpp]
SourceFunction : [cdr_merge_fmask]
SourceLine : [1746]

2014-06-17 13:06:32  ERROR   OGG-01921  Missing GETBEFORECOLS with conflict detection enabled in target table TEST.TESTA.

***********************************************************************
* ** Run Time Statistics ** *
***********************************************************************

Reading /app/oracle/product/10.2.0/goldengate/dirdat/ba001144, current RBA 50525963, 0 records

Make a note the ba file name ie ba001144 and current RBA ie 50525963. This is the record GG has an issue with.

we now use logdump to view this record:

from your golden gate home type logdump, then

Logdump 2 >open /app/oracle/goldengate/dirdat/ba001144

Logdump 2642 >pos 50525963

Reading forward from RBA 50525963

Logdump 2643 >n

 

2014/06/13 08:36:21.057.303 FieldComp            Len   114 RBA 50525963

Name: TEST.TESTA

After  Image:                                             Partition 4   G  m

0000 000a 0000 0000 0000 0000 0001 0001 0005 0000 | ....................

4150 4e00 0200 0a00 0000 0000 0000 6fec 6200 0300 | APN...........o.b...

1f00 0032 3031 332d 3131 2d31 343a 3030 3a35 353a | ...2013-11-14:00:55:

3537 2e39 3137 3539 3730 3030 0004 0003 0000 4400 | 57.917597000......D.

0600 1f00 0032 3031 342d 3036 2d31 323a 3136 3a32 | .....2014-06-12:16:2

393a 3534 2e33 3030 3738 3430 3030                | 9:54.300784000

 

Logdump 2644 >pos rev

Reading in reverse from RBA 50525963

Logdump 2645 >n

 

2014/06/13 08:36:21.057.303 FieldComp            Len   114 RBA 50525737

Name: TEST.TESTA

Before Image:                                             Partition 4   G  m

0000 000a 0000 0000 0000 0000 0001 0001 0005 0000 | ....................

4150 4e00 0200 0a00 0000 0000 0000 6fec 6200 0300 | APN...........o.b...

1f00 0032 3031 332d 3131 2d31 343a 3030 3a35 353a | ...2013-11-14:00:55:

3537 2e39 3137 3539 3730 3030 0004 0003 0000 5900 | 57.917597000......Y.

0600 1f00 0032 3031 332d 3131 2d31 343a 3030 3a35 | .....2013-11-14:00:5

363a 3138 2e35 3938 3231 3230 3030                | 6:18.598212000


So from the above we can find the RBA of the previous record ie 50525373

We then start the REPLICAT from this position:

GGSCI (test) 11> alter replicat tr1_rep, extseqno 1144 extrba 50525737

GGSCI (test) 11> start replicat tr1_rep