External table to load compressed data residing on remote locations
Posted by fsengonul on April 8, 2011
The PREPROCESSOR option in the external tables has provided the DWH systems a lot of opportunities. Below you may find Allen Brumm’s and Mehant Baid’s notes on how to load remote files via external tables and my minor changes on the script in order to add gunzip support and flexibility for different proxy and target file names.
In our environment , in order to load a single table, 16 sqlldr jobs start from 8 ETL servers. (parallel & direct path) . The input files are on the ETL servers in gzip compressed format. The ETL tool , on the fly, unzips the files and feeds sqlldr. The target tables has always 16 hash partitions. As a result of this design for each table , I have 256 temp segments created on the database side. The difference of the hashing algorithm adds more spice to the situation. And also the network cards on both db nodes and ETL servers are 1 GbE. (do not ask why they’re not 10GbE. They will be.)
The best part of this method is the flexibility. You may choose to run the gunzip command on the target or on the source.In my case running the gunzip on the oracle machine decreases the data flow on the network and also decrease the cpu usage on the etl machine. I’m still searching for an empty period on both sides in order to collect cpu and network usage statisticks on both cases.
so the steps:
The first thing : passwordless ssh connectivity has to be established between the remote locations and database servers.
Then I’ve changed the script to add support for different proxy and file names. On my case the input file names are all the same but the locations and directory names are different.
The create script for external table:
CREATE TABLE INVOICE.EXT_ABC( INVxxx NUMBER(16) , ...))
ORGANIZATION EXTERNAL(TYPE oracle_loader DEFAULT DIRECTORY dp_data_dir
ACCESS PARAMETERS( RECORDS DELIMITED BY NEWLINE
PREPROCESSOR dp_exec_dir:'ora_xtra_pp.sh'
FIELDS TERMINATED BY ':'
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS ( FIELDS.... ))
LOCATION ('file01.gz','file02.gz','file03.gz','file04.gz','file05.gz','file06.gz','file07.gz','file08.gz', 'file09.gz','file10.gz','file11.gz','file12.gz','file13.gz','file14.gz','file15.gz','file16.gz' ))
parallel 16;
Mehant Baid’s pdf : ExternalTablesRemoteDataAccess
The mapping file: ora_xtra_map.txt
a.gz:etl_user:10.XXX.XXX.11:/mfs_16way_000/Applications/Datamart/INV_DM/main/a.gz:cat:YES a:etl_user:10.XXX.XXX.11:/mfs_16way_000/Applications/Datamart/INV_DM/main/a:cat:NO file01.gz:etl_user:10.XXX.XXX.11:/DIR1/mfs_16way_000/Applications/Datamart/INV_DM/main/input.gz:cat:YES file02.gz:etl_user:10.XXX.XXX.12:/DIR2/mfs_16way_001/Applications/Datamart/INV_DM/main/input.gz:cat:YES file03.gz:etl_user:10.XXX.XXX.13:/DIR3/mfs_16way_002/Applications/Datamart/INV_DM/main/input.gz:cat:YES file04.gz:etl_user:10.XXX.XXX.14:/DIR4/mfs_16way_003/Applications/Datamart/INV_DM/main/input.gz:cat:YES file05.gz:etl_user:10.XXX.XXX.15:/DIR5/mfs_16way_004/Applications/Datamart/INV_DM/main/input.gz:cat:YES file06.gz:etl_user:10.XXX.XXX.16:/DIR6/mfs_16way_005/Applications/Datamart/INV_DM/main/input.gz:cat:YES file07.gz:etl_user:10.XXX.XXX.17:/DIR7/mfs_16way_006/Applications/Datamart/INV_DM/main/input.gz:cat:YES file08.gz:etl_user:10.XXX.XXX.18:/DIR8/mfs_16way_007/Applications/Datamart/INV_DM/main/input.gz:cat:YES file09.gz:etl_user:10.XXX.XXX.11:/DIR1/mfs_16way_008/Applications/Datamart/INV_DM/main/input.gz:cat:YES file10.gz:etl_user:10.XXX.XXX.12:/DIR2/mfs_16way_009/Applications/Datamart/INV_DM/main/input.gz:cat:YES file11.gz:etl_user:10.XXX.XXX.13:/DIR3/mfs_16way_010/Applications/Datamart/INV_DM/main/input.gz:cat:YES file12.gz:etl_user:10.XXX.XXX.14:/DIR4/mfs_16way_011/Applications/Datamart/INV_DM/main/input.gz:cat:YES file13.gz:etl_user:10.XXX.XXX.15:/DIR5/mfs_16way_012/Applications/Datamart/INV_DM/main/input.gz:cat:YES file14.gz:etl_user:10.XXX.XXX.16:/DIR6/mfs_16way_013/Applications/Datamart/INV_DM/main/input.gz:cat:YES file15.gz:etl_user:10.XXX.XXX.17:/DIR7/mfs_16way_014/Applications/Datamart/INV_DM/main/input.gz:cat:YES file16.gz:etl_user:10.XXX.XXX.18:/DIR8/mfs_16way_015/Applications/Datamart/INV_DM/main/input.gz:cat:YES
The modified ora_xtra_pp.sh
#! /bin/sh
PATH=/bin:/usr/bin
export PATH
#set -x
# ora_xtra_pp: ORAcle eXternal Table Remote file Access Pre- Processor
# fs version
#Format for the Map file
#Consists of five fields seperated using a :
#Syntax
#file:rusr:rhost:rdir_file_name:rcmd1:comp
#comp is used for gzipped input files.YES/NO
#
#Example
#foo.dat:abrumm:abrumm-dev:/home/abrumm/xt_Data/foo.dat:cat:NO
#f1.dat:fsengonul:fsengonul-dev:/home/fsengonul/xt_Data/foo.dat:cat:NO
#f1.gz:fsengonul:fsengonul-dev:/home/fsengonul/xt_Data/foo.gz:cat:YES
#this gives the freedom to use same named files on different locations
#
#get filename component of LOCATION, the access driver
#provides the LOCATION as the first argument to the preprocessor
proxy_file_name=`basename $1`
data_dir_name=`dirname $1`
#Flag is set if the file name in the Map file matches the proxy file name
#where our data file is stored
flag_dirf=0
#loops through the map file and fetches details for ssh
#username,hostname and remote directory
file_not_found_err='ora_xtra_pp:
Map file missing.Needed ora_xtra_map.txt in data directory'
if [ -e $data_dir_name/ora_xtra_map.txt ]
then
while read line
do
map_file_name=`echo $line | cut -d: -f1`
if [ $map_file_name = $proxy_file_name ]
then
rdir_file_name=`echo $line | cut -d: -f4`
rusr=`echo $line | cut -d: -f2`
rhost=`echo $line | cut -d: -f3`
rcmd1=`echo $line | cut -d: -f5`
comp=`echo $line | cut -d: -f6`
flag_dirf=1
break
fi
done < $data_dir_name/ora_xtra_map.txt
else
echo $file_not_found_err 1>&2
echo $data_dir_name 1>&2
exit 1
fi
if [ $flag_dirf = 1 ]
then
if [ $comp = 'NO' ]
then
ssh -q $rusr@$rhost $rcmd1 $rdir_file_name
fi
if [ $comp = 'YES' ]
then
ssh -q $rusr@$rhost $rcmd1 $rdir_file_name | gunzip -c
fi
fi

yasinsaygili said
congrats ferhat brother,your blog on oracle blog
Henryk Gerlach said
Another solution to Mehant Baid’s problem of not being able to install NFS on the remote host may be to mount a remote directory locally using SSHFS.