Applies to: Oracle GoldenGate - Version: 10.4.0.0 and later [Release: 10.4.0 and later ]
Oracle GoldenGate - Version: 10.4.0.0 and later [Release: 10.4.0 and later]
Information in this document applies to any platform.
GoalReplicating a high volume of change data in a single data stream (one extract, one pump, one replicat, for example) often does not provide enough throughput to maintain latency requirements for real-time data feeds between source and target systems. Understanding how to split the data into evenly distributed data streams is fundamental to increasing throughput.
After a reasonable sample of change data has been captured and written to a trail, the trail data will be used as the primary input to methodically spread the tables across multiple replication processes for increased throughput. This is done by extracting change data over a representative period of time that might either reflect a specific load test or typical production transaction mix.
Once the trail(s) exist a feature of the logdump utility is then used log table details (bytes, inserts, updates, deletes, etc.) to a loosely structured text file. Before this data is used in a spreadsheet the log dump output will first be reformatted into horizontal records with one row per table and values separated by commas. Reformatting is done using standard UNIX commands strung together at the command line or in shell scripts.
Finally, the formatted data is imported into MS Excel (or any spreadsheet application) and percentages are calculated based on change data bytes. The resulting spreadsheet will then provide the insight needed to evenly distribute the data load across multiple processes. The spreadsheet may also double as a summary report to customers.
Undoubtedly some tables will account for a greater percentage of the change data than other tables. In instances when, for example, one table accounts for 90% or all the change data bytes then the load balancing method described here will highlight this fact but balancing the load across multiple process will typically require the use of additional methods such as the of the "range" function in table or map statements. Ranges are not covered in this document.
Solution .textcode { FONT-FAMILY: Courier New, Courier, monospace } .textcode { FONT-FAMILY: Courier New, Courier, monospace } .test { FONT-FAMILY: Courier New, Courier, monospace } .textcode { FONT-FAMILY: Courier New, Courier, monospace } .textcode { FONT-FAMILY: Courier New, Courier, monospace } .test { FONT-FAMILY: Courier New, Courier, monospace } .textcode { FONT-FAMILY: Courier New, Courier, monospace } .textcode { FONT-FAMILY: Courier New, Courier, monospace } .test { FONT-FAMILY: Courier New, Courier, monospace } .textcode { FONT-FAMILY: Courier New, Courier, monospace } .textcode { FONT-FAMILY: Courier New, Courier, monospace } .test { FONT-FAMILY: Courier New, Courier, monospace } Overview
Replicating a high volume of change data in a single data stream (one extract, one pump, one replicat, for example) often does not provide enough throughput to maintain latency requirements for real-time data feeds between source and target systems. Understanding how to split the data into evenly distributed data streams is fundamental to increasing throughput.
After a reasonable sample of change data has been captured and written to a trail, the trail data will be used as the primary input to methodically spread the tables across multiple replication processes for increased throughput. This is done by extracting change data over a representative period of time that might either reflect a specific load test or typical production transaction mix.
Once the trail(s) exist a feature of the logdump utility is then used log table details (bytes, inserts, updates, deletes, etc.) to a loosely structured text file. Before this data is used in a spreadsheet the log dump output will first be reformatted into horizontal records with one row per table and values separated by commas. Reformatting is done using standard UNIX commands strung together at the command line or in shell scripts.
Finally, the formatted data is imported into MS Excel (or any spreadsheet application) and percentages are calculated based on change data bytes. The resulting spreadsheet will then provide the insight needed to evenly distribute the data load across multiple processes. The spreadsheet may also double as a summary report to customers.
Undoubtedly some tables will account for a greater percentage of the change data than other tables. In instances when, for example, one table accounts for 90% or all the change data bytes then the load balancing method described here will highlight this fact but balancing the load across multiple process will typically require the use of additional methods such as the of the "range" function in table or map statements. Ranges are not covered in this document.
Logdump
Using logdump to generate a detail report on trail data is pretty straight forward but formatting the data so that it can be used in Excel requires reformatting the data, which is done here by using UNIX shell scripts. First we'll look at generating the detail report for logdump manually and then doing the same via a shell script.
Running Logdump Manually
Like all GoldenGate core product executables starting with 9.5, logdump must be run directly from the installation directory or the associated libraries will not be found regardless of environmental library path settings.
1. Change directories to the GoldenGate home (here identified by the environment variable GGS_HOME but this can be substituted by the actual path name).
cd $GGS_HOME
or
cd <full path of GG installation>
2. Start logdump
./logdump
3. Log the output to a text file
Logdump > log logdump_output01.txt
--- Session log logdump_output01.txt opened 2009/05/27 11:28:29.540.030 ---
4. Issue a detailed count on one or more trail files. (Wild cards and some regular expressions may be used.)
Logdump > detail on
Logdump > count ./dirdat/1a00*
or
Logdump > count detail ./dirdat/1a00*
Output will be both echoed to the screen and saved in the file named indicated by the log command. If the file does not exist logdump will create it. If the file already exists logdump will append output to the file so be careful to rename or remove the file prior to subsequent logdump runs or the files will need to be edited manually to remove redundant data.
5. Exit logdump
Logdump > exit
Running Logdump via a Script
The steps in the preceding section can also be scripted as shown below (See Attachment)
#!/bin/sh
LOGDUMP_OUTPUT_FILE=logdump_output_$$.txt
GGS_HOME=/home/ggconsul/ggs/ggs103
TRAIL_NAME=$GGS_HOME/dirdat/1a*
cd $GGS_HOME
./logdump << EOF > $LOGDUMP_OUTPUT_FILE
detail on
count $TRAIL_NAME
exit
EOF
echo "Logdump output can be found in the file: $LOGDUMP_OUTPUT_FILE"
The output log file will look something similar to:
Logdump 2243 >count dirdat/e1a/*00*
Current LogTrail is /apps/ggsadm/ggsm1/dirdat/e1a/1a000000
LogTrail /apps/ggsadm/ggsm1/dirdat/e1a/1a000000 has 249072 records
LogTrail /apps/ggsadm/ggsm1/dirdat/e1a/1a000000 closed
. . .
LogTrail /apps/ggsadm/ggsm1/dirdat/e1a/*00* has 1997810 records
Total Data Bytes 2944052112
Avg Bytes/Record 1473
Delete 3811
Insert 1972364
FieldComp 20873
GGSPKUpdate 754
Others 8
Before Images 3811
After Images 1993991
Average of 12330 Transactions
Bytes/Trans ..... 7777
Records/Trans ... 162
Files/Trans ..... 0
*FileHeader* Partition 0
Total Data Bytes 9321
Avg Bytes/Record 1165
Others 8
XXCCS_GGT_O.GGS_HEARTBEAT Partition 4
Total Data Bytes 117001
Avg Bytes/Record 154
FieldComp 5
GGSPKUpdate 754
After Images 759
XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4
Total Data Bytes 1504432167
Avg Bytes/Record 1349
Delete 3811
Insert 1090023
FieldComp 20868
Before Images 3811
After Images 1110891
Formatting with UNIX
Reformatting Lodump Data: Quick Minimalist Method
Covered in this section is how to get the minimum amount of data that is needed to load balance, which includes the combination of table owner with table name and the associated bytes. Load balancing is always done based on bytes.
Thus we will focus on two key lines in the logdump report.
1. The line that contains the string "Partition 4", which is originally an NSK thing but can be consistently associated with and only with the same line that contains owner and table names. This will eliminate unwanted data such as summary data for all tables and in GG 10+ the trail file header data
2. The line immediately following "Partition 4" will always begin with the line "Total Data Bytes". If the table shows up in the trail then it is implicit that there will be associated bytes. DDL bytes associated with a table are not included in the report
Quickest with Linux
The simplest way to do this is on Linux, which has an extended "grep" function that allows a line to be pulled (or "grepped") along with any number of lines above or bellow. If the system where logdump runs is not Linux the text file can still be copied to a Linux machine or VM (virtual machine) where quick formatting can occur.
Here we will grep the line with "Partition 4" and the immediately following line by using the "-A" option for grep on Linux followed by the number of subsequent lines (1):
grep -A 1 "Partition 4" logdump_output_1234.txt
This will produce output such as:
XXCCS_GGT_O.GGS_HEARTBEAT Partition 4
Total Data Bytes 117001
--
XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4
Total Data Bytes 1504432167
--
XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL Partition 4
Total Data Bytes 1439493623
A consequence of using the "-A" option is that it puts a line with "--" between each set of lines pulled out (the lined grepped plus the one after). These lines will be deleted using the UNIX sed (stream editor) command because it will facilitate joining every other line when next we use the UNIX paste command.
Once lines are joined, only the first column (owner.tablename) and seventh column (total bytes for that table) will be printed including a comma separating the two values. Finally, the output will be redirected to a new file that can be imported into Excel or viewed for a quick eyeball comparison (see section 4.1.2.1 for detail on how to sort the output by bytes).
All this can be done on one line by UNIX's ability to "pipe" (|) the output of one command into the input of the next command allowing several commands to be strung together. The following should be typed on one line followed by a return.
grep -h -A 1 "Partition 4" logdump_output_1234.txt sed '/^\-\-/d' paste - - awk '{print $1 "," $7}' > logdump_output_simple.csv
The output values will be separated commas which can be imported into Excel as a CSV (comma separated values) file which will look something like:
XXCCS_GGT_O.GGS_HEARTBEAT,117001
XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H,1504432167
XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL,1439493623
Second Quickest with all UNIX-like Systems
If not using Linux (or even if using Linux) then the next simplest method to pull out the owner names, table names and bytes is to use the extended grep (egrep) to grab lines based one of multiple expressions and then redirect the output to a temporary file. Next, the file will be manually edited as instructed below. Then using the UNIX paste and awk command to print out only needed data from the temporary file, this will be redirected to CSV file that can be imported to a spreadsheet. The following should be typed on one line followed by a return.
egrep "Partition 4 ^Total Data Bytes" logdump_output_1234.txt > logdump_output_simple.tmp
Sample output is:
Total Data Bytes 2944052112
Total Data Bytes 9321
XXCCS_GGT_O.GGS_HEARTBEAT Partition 4
Total Data Bytes 117001
XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4
Total Data Bytes 1504432167
XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL Partition 4
Total Data Bytes 1439493623
Notice that there are more lines starting with "Total Data Bytes" than there are table names listed. Delete all line beginning with "Total Data Bytes" that do no immediately follow a table name (denoted by "Partition 4") as these have to do with summary data for all tables and in GG 10+ will include trail header data.
The manually altered file will look like:
XXCCS_GGT_O.GGS_HEARTBEAT Partition 4
Total Data Bytes 117001
XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4
Total Data Bytes 1504432167
XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL Partition 4
Total Data Bytes 1439493623
In the final step to prepare the text file for input into Excel, join every other line with the UNIX paste command and then print out (awk) only the first and seventh columns and separate values by commas. The following should be typed on one line followed by a return.
cat logdump_output_simple.tmp paste - - awk '{print $1 "," $7}' > logdump_output_simple.csv
The output will be:
XXCCS_GGT_O.GGS_HEARTBEAT,117001
XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H,1504432167
XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL,1439493623
I Don't Care About Excel; I Just Want a Quick, Sorted Look
If a quick look is all that is needed to identify the heaviest tables then the last part of the last example (after the manual file edit) could be rewritten to sort in descending order based on bytes as follows (the following should be typed on one line followed by a return):
cat logdump_output_simple.tmp paste - - awk '{print $7 "\t" $1}' sort -nr
With output as:
1504432167 XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H
1439493623 XXCCS_GGT_O.ST_KS_XXCSS_QOT_LINES_ALL
117001 XXCCS_GGT_O.GGS_HEARTBEAT
In the above example the seventh column containing the bytes is written first, followed by a tab ("/t") for readability, then the owner.table name. The -n option tells sort to sort numerically instead of the default alphabetically while the -r option indicates to reverse the order making it descending so that the heaviest tables are highest on the list.
Reformatting Lodump Data: Complete Method
The previous section covered only that which is needed to load balance, namely the owner.table names and associated number of bytes. This section is more script intensive as it will cover formatting data for all reported details for each table. These details can be used to generate a more complete spreadsheet report for the customer and/or consultant. The possible details for each table include the following (always in this order):
Total Data Bytes
Number of data bytes each table took up in the trail (minus header and meta data?)
Insert
Number of inserts
Delete
Number of deletes
FieldComp
Number of non-primary key updates
GGSPKUpdate
Number of primary key updates
Before
Before bytes
After
After bytes
Looking again at a portion of the sample logdump output we see:
XXCCS_GGT_O.GGS_HEARTBEAT Partition 4
Total Data Bytes 117001
Avg Bytes/Record 154
FieldComp 5
GGSPKUpdate 754
After Images 759
XXCCS_GGT_O.ST_KS_CSI_ITEM_INSTANCES_H Partition 4
Total Data Bytes 1504432167
Avg Bytes/Record 1349
Delete 3811
Insert 1090023
FieldComp 20868
Before Images 3811
After Images 1110891
This data is parsed using the script parse_count_detail_vXY.sh, which is provided in Appendix A. However, the pseudo logic and usage is briefly described below.
1. Print column header information
2. Read each line of input and put each element (they are separated by but do not include white space) into an array
3. If the second array element is "Partion" and the third "4" then flag a new table record as found and split the first element based on the period (.) into the owner name and table name
4. Read each line of the record and capture the value for each of the following:
Insert
Delete
FieldComp
GGSPKUpdate
Before
After
5. If the first element of the line is "After" then print out all the saved table data to the screen and a file and reset the numbers and unset the found flag
6. Return to 3 until end of file
7. At the end of file print to screen the name of file with the formatted output and exit
Basic usage of the parse_count_detail_vXY.sh script consists of passing in the file name that contains the logdump count detail output.
./parse_count_detail_v01.sh logdump_output_1234.txt
Output is:
Ownername,Tablename,Bytes,Insert,Delete,Update,PKUpdate,After,Before,%ofTotal,RunTotal
XXCCS_GGT_O,GGS_HEARTBEAT,117001,0,0,5,754,759,0
XXCCS_GGT_O,ST_KS_CSI_ITEM_INSTANCES_H,1504432167,1090023,3811,20868,0,1110891,3811
XXCCS_GGT_O,ST_KS_XXCSS_QOT_LINES_ALL,1439493623,882341,0,0,0,882341,0
*****************************************
Excel friendly, comma delimited can be found in: count_detail_output_21768.txt
*****************************************
The output file contains all but the last three lines of what was printed to the screen and can be imported to Excel directly without modification.
Using Formatted Data in Excel
The following can be done with any spreadsheet application but Excel is used throughout this section because of its ubiquity.
Importing Formatted Data into Excel
Once the output has been formatted, saved and brought to a machine with Excel installed it can be imported using the subsequent steps.
Appendix A
Complete Script: parse_count_detail_v01.sh (See Attachment)
The following korn shell (ksh) script will parse the output from a count detail done with the logdump utility and produce commas separated value (CSV) lists in a file. This file will contain all table details for each table with each table name and associated data on a separate line.
The text of this script is further below.
If there is trouble running this script it is probably due to one or two reasons.
1. The script needs to be made executable. To do this issue the following command:
chmod 754 parse_count_detail_v01.sh
2. The script has hidden DOS carriage returns (they look like a ^M and the end of each line). First check to see if there are ^M characters with:
cat -tv parse_count_detail_v01.sh
To get rid of these hidden characters issue either of the following commands:
dos2unix parse_count_detail_v01.sh parse_count_detail_v01.sh
or if dos2unix in not installed:
sed 's/.$//' parse_count_detail_v01.sh
________________________________________________________________________
#!/bin/ksh
# ********************
# Created by: debuzna, GoldenGate Software 2009
# Usage: <this_file> <output from logdump "count detail">
# Description: This script will parse the unstructured "count detail" data
# from logdump and output the each table data record per row in a comma
# delimited format.
# For more information see the GoldenGate Best Practice document:
# Using the Goldengate Logdump Utility to Manually Load Balance
# ********************
# ********************
# Function to check of file exists and is readable
# ********************
function checkfile {
#Debugging: echo "Input file in checkfile is: $1"
if [ -r $1 ]
then
echo "Oops... $1 cannot be opened for reading!"
exit 1
fi
}
# ********************
# Check for input parameter
# ********************
if [ $# -ne 1 ]
then
echo "Input file from logdump "count detail" is required..."
echo "Usage: $0 <logdump_count_detail_file>"
exit 1
fi
# Input file to process
INPUT_FROM_LOGDUMP=$1
# Call function to check if input file is readable
checkfile INPUT_FROM_LOGDUMP
#Debugging: echo "Input is: $INPUT_FROM_LOGDUMP"
# This is the file to where output will be saved. $$ is current process ID
OUTPUT="count_detail_output_$$.csv"
# Comma delimited header for $csvstmt data
OUTPUT_HEADER="Ownername,Tablename,Bytes,Insert,Delete,Update,PKUpdate,After,Before,%ofTotal,RunTotal"
# First run, echo output header
echo $OUTPUT_HEADER
# Use redirect to overwrite output file if it happens to exist
echo $OUTPUT_HEADER > $OUTPUT
# Time traking variables for reporting
starttime=`date`
#Debugging: echo "Starting at: $starttime"
# ********************
# Set defaults
# ********************
# firstrun will ensure the header is printed only at output begining
firstrun=1
# Zero out table stats because not all tables take all operations
COUNT_INSERT=0
COUNT_DELETE=0
COUNT_UPDATE=0
COUNT_PKUPDATE=0
COUNT_BYTES=0
COUNT_BEFORE=0
COUNT_AFTER=0
TOTAL_BYTES=0
# ********************
# Parse the logdump count detail data and output to stdout & output file
# ********************
isfound=0
isdone=0
# Parse each line from $INPUT_FROM_LOGDUMP (below)
while read someline; do
# In sh the ()'s should convert someline into an array but seems buggy
# token=(${someline})
# So instead we'll use the ksh set -A trick here to break string into an arrray
set -A DATA ZERO $someline
# The string "Partition 4" indicates the line starting with owner.table name
if [ "${DATA[2]}" = "Partition" -a "${DATA[3]}" = "4" -a $isfound -eq 0 ] ; then
# echo "Found partition and isfound=0"
# Debugging: echo "************************* found partition"
isfound=1
isdone=0
# Use another ksh-only trick to split first array element on the "."
OWNERNAME=${DATA[1]%.*}
TABLENAME=${DATA[1]#*.}
#Debugging: echo "OWNERNAME=$OWNERNAME and TABLENAME=$TABLENAME"
elif [ $isfound -eq 1 ] ; then
# We have found a table record so parse it
#Debugging: echo "isfound=1"
# Record based on relavant keywords
case ${DATA[1]} in
Insert)
COUNT_INSERT=${DATA[2]}
#Debugging: echo Found insert
;;
Delete)
COUNT_DELETE=${DATA[2]}
;;
FieldComp)
COUNT_UPDATE=${DATA[2]}
;;
GGSPKUpdate)
COUNT_PKUPDATE=${DATA[2]}
;;
Total)
COUNT_BYTES=${DATA[4]}
TOTAL_BYTES=$(($TOTAL_BYTES + COUNT_BYTES))
;;
Before)
COUNT_BEFORE=${DATA[3]}
;;
After)
# Finding "After" means we are at the end of the record. Grab the after count
# create the
COUNT_AFTER=${DATA[3]}
csvstmt="$OWNERNAME,\
$TABLENAME,\
$COUNT_BYTES,\
$COUNT_INSERT,\
$COUNT_DELETE,\
$COUNT_UPDATE,\
$COUNT_PKUPDATE,\
$COUNT_AFTER,\
$COUNT_BEFORE"
# Send table data to output file
echo $csvstmt >> $OUTPUT
# Send table data to standard output
echo $csvstmt
# Resent counter values
isfound=0
isdone=1
COUNT_INSERT=0
COUNT_DELETE=0
COUNT_UPDATE=0
COUNT_PKUPDATE=0
COUNT_BYTES=0
COUNT_BEFORE=0
COUNT_AFTER=0
TOTAL_BYTES=0
;;
*)
# Do nothing
#Debugging: echo "Non-canidate row"
;;
esac
fi
if [ $isdone -eq 1 ]; then
isdone=0
fi
# End while loop that flips through each line in input
done < $INPUT_FROM_LOGDUMP
echo "*****************************************"
echo "Output is also in Excel friendly, CSV file: $OUTPUT"
echo "*****************************************"
# endtime=`date`
# echo "Total bytes: $TOTAL_BYTES"
# echo "Start logdump scripts : $starttime"
# echo "End logdump scripts : $endtime"
# echo "*****************************************"
exit 0
________________________________________________________________________
Example output:
Ownername,Tablename,Bytes,Insert,Delete,Update,PKUpdate,After,Before,%ofTotal,RunTotal
XXCCS_GGT_O,GGS_HEARTBEAT,117001,0,0,5,754,759,0
XXCCS_GGT_O,ST_KS_CSI_ITEM_INSTANCES_H,1504432167,1090023,3811,20868,0,1110891,3811
XXCCS_GGT_O,ST_KS_XXCSS_QOT_LINES_ALL,1439493623,882341,0,0,0,882341,0
The last two column headers have no corresponding data as they will be populated using Excel functions once imported into a spreadsheet.