Informatica Developers and Administrator Tips: April 2008

Monday, April 14, 2008

Informatica ETL Power Center 8

Extract, Transform and Load

The Informatica ETL product is called PowerCenter. It was one of the earliest ETL tools on the market and uses a user-friendly GUI designer interface to build jobs that run on one or more transformation servers. In PowerCenter 8 the product has pushdown optimization that executes transformation steps on a database server. This gives the an ETLT capability - extract, transform, load and transform.
PowerCenter 8 comes with options for expanding the capabilities of the ETL tool, these are a selection that impact how ETL jobs execute:
Partitioning Option: for partitioning data to process it in parallel across multiple CPUs or servers.
Real Time Option: for turning PowerCenter jobs into real time services.
Enterprise Grid Option: for automating the deployment and load balancing of PowerCenter jobs on a grid

Metadata Management

Informatica has some technical metadata functions and reports in the PowerCenter product to support ETL development. In addition Informatica has the Metadata Manager and the Metadata Exchange Option.
In 2003 Informatica launched the SuperGlue metadata management solution for additional metadata capabilities. It is a web based solution for assembling, visualizing and analyzing metadata from various enterprise systems. In 2005 Informatica launched PowerCenter Advanced Edition that came bundled with SuperGlue. With the release of PowerCenter 8 Advanced Edition the SuperGlue name is gone and the product is now known as Metadata Manager and is only available in this product bundle.
Informatica offers a PowerCenter option called Metadata Exchange that uses a wizard driven interface to import source definitions into PowerCenter. It provides more descriptive metadata and some metadata synchronization functions for when source metadata definitions change. It offers a combination of bi-directional and uni-directional exchange of metadata to products such as ERwin, Oracle Designer, Cognos Impromptu and Business Objects Designer.

Data Quality

In January of 2006 Informatica acquired Similarity Systems to considerably enhance its data quality and data profile functionality. The Similarity ATHANOR product has been renamed to Informatica Data Quality. It offers data cleansing, data matching and data quality reporting and monitoring. nformatica Data Quality processes can be added to a PowerCenter job within the PowerCenter Mapping Designer by selecting a data quality building block.
Informatica relies on additional third party vendors to deliver address validation capabilities to Data Quality. These products are licensed separately through Informatica.
There is also PowerCenter Data Cleanse and Match option that was built by Informatica prior to the Similarity acquisition. It offers data cleansing and parsing and data matching and is fully integrated with PowerCenter to seemlessly use the Data Partitioning and Enterprise Grid capabilities for scalable processing.

Data Profiling

In 2005 Similarity Systems acquired Evoke software to add the Evoke Axio profiling tool to the data quality suite and gain access to a wider customer base in the US and Asia. Similarity was in turn acquired by Informatica in January of 2006. Informatica renamed Evoke Axio to Information Data Explorer and offers it as a stand alone product or as a companion to PowerCenter.
Data Explorer provides data profiling functions to discover and define the content, structure and quality of data. It provides data mapping capabilities for source to target mapping that can be accessed from within PowerCenter.
PowerCenter also has a Mapping Generation Option that uses the free Informatica Data Stencil design tool for Microsoft Visio that generates PowerCenter mappings from Visio designs or reverse engineers a PowerCenter mapping into a Visio mapping template.

Sunday, April 13, 2008

Unix Scripts and Common Used Commands

FTP Shell script

Bourne Shell script that connects to a given FTP server and executes a few commands (e.g: download and delete a file) and disconnects.This script also assumes that you have the nessecary permissions.

#!/bin/sh
#These can be easily converted to be read from input
#or as command line params
FTP_SRVR=myftp #or ip
FTP_USR=my_login
FTP_PWD=my_passwd #not secure (but use a tied down login)

#change this to do some generate filename trick
#this is just something out of the blue
TODAYS_FILE=`date cut -c12-19`

ftp <open $FTP_SRVR
user $FTP_USR $FTP_PWD
binary
get $TODAYS_FILE
delete $TODAYS_FILE
bye
EOF

Friday, April 11, 2008

Informatica Developers and Administrator Tips: Awk Command to find the word frequencies in a text and sort them

Data Warehouse Knowledge Base- Informatica Discussion Board

http://datawarehouse.ittoolbox.com/groups/technical-functional/informatica-l

Tips on Handling Log File

1. How to get alert by reading log files

Many people write shell scripts to read and parse server log files, such as Oracle alert.log, Apache error_log..., and email them when certain strings in the log are found. The scripts are usually run as a cron job. That may incur a lot of disk I/O especially when the file is already large, and depending how often the cron job runs, it may not send notification as quickly as you would like.
These shortcomings can be easily overcome. Try tail -f on the log file and pipe the result to an action command. tail -f reads the tail of the file once a second and sleeps in between. That's close to "immediate" response and causes much less I/O and CPU usage. Instead of a cron job, all you need to do is type a command like this (replace mailx with mail as needed)

#Serious Oracle database error

nohup tail -f alert_ORCL.log nohup perl -nle 'system("mailx -s 'ORCL error' youremail < /dev/null") if /^ORA-00600/' &

But it may make more sense to put the command sequence in a script and worry about nohup at the time you launch it. So create a file myscript which has #Apache sends status 200 to user but not followed by data (i.e. "-" at line end)

tail -f access_log perl -nle 'system("mailx -s ChkApache you\@somecompany.com < /dev/null") if / 200 -$/'

and type the command nohup myscript &. (This is better than myscript having nohup tail... nohup perl... and you type myscript &).
You can also let it email to you the line in the log instead of a blank message

tail -f access_log perl -nle 'system("echo \"$_\" mailx -s ChkApache you\@somecompany.com him\@somecompany.com") if / 200 -$/'

A more sophisticated example that skips some errors you don't care

#!/bin/ksh
#Usage: $0 OracleSID
export SID=$1
tail -f alert_$SID.log perl -nle '$SID=$ENV{SID}; system("echo \"$_\" mailx -s \"$SID Database Error\" you\@somecompany.com") if (/^ORA-/ and !/^ORA-07445/)'

Instead of a Perl one-liner, your myscript can call a full-fledged script whattodo.pl with fairly complex programming logic in ittail -f /var/adm/messages perl -nl whattodo.pl
If you want to avoid creating nohup.out and getting "send output to nohup.out" message, run the command in csh or redirect stdout to /dev/null, or run it with the command echo 'nohup myscript &' csh -s. If you're more comfortable with awk, replace perl with nawk or gawk (but not awk which has a broken system command)

tail -f alert_ORCL.log nawk '/^ORA-00600/ {system("mailx -s 'ORCL error' youremail < /dev/null")}'

A cron job is preserved across reboot. But your tail... perl... command is not. So you may have to start it from one of your rc scripts.
The disadvantage of using this method to send alert is that it processes one line at a time; if you have 100 lines that match the pattern, you'll get 100 emails. While the traditional scan-the-whole-logfile approach easily groups the lines into one alert, you have to do a little programming here to achieve the same goal, probably by writing a temporary file followed by a few seconds wait, or setting a variable right in your code and wait some time before sending email.

By the way, on Solaris, you can check whether the job is indeed ignoring SIGHUP with command

/usr/proc/bin/psig pid head -2, where pid is the process ID of the submitted background job.
-->
Microsoft's Windows Services for UNIX: Heavy-duty. Comprehensive emulation of UNIX environment including /proc filesystem, truss utility. Completely run in Windows POSIX subsystem. Does not install on XP Home Edition.
Redhat's cygwin: Very popular.
MKS Software's MKS Toolkit
Hamilton Lab's Hamilton C Shell

2. How to read a large file

If you have a several hundred MB log file, how do you quickly find a portion of it you're interested in? Your vi may not be able to open the file, complaining file too large, and/or take too long and use too much memory. Here's what I usually do.
Let's say you want Mon Jul 15's data from July's access_log,

here's the commandsed -n '/^Mon Jul 15/p; /^Tue Jul 16/q' access_log > /tmp/qq

Without the quit command, sed would scan the file to the end or till you press ^C.

Note the undocumented semicolon that allows you to put two sed commands on one line. (By the way, on Solaris creating a file under /tmp may be faster because tmpfs should be memory-based, unless you're short on RAM. But remember to delete big files under /tmp when you're done because they reduce available swap space.)

If you know the data you want is not too far from the end tail -1000 access_log head -1

If you still see lines too late, increase 1000 to 5000, 10000 and so on. But it should not be too close to the head of the file. You should always have an idea of when the file starts and when it stops. If you know it only contains records for July, then it's of course July 1 to July 31. But even then you have to take into account very uneven data distribution.

If you don't know the start and end, do run

head -1 access_log and tail -1 access_log first.

Unless you have to, do not start with head as in head -1000 access_log tail.

That would actually print 1000 lines to the pipe reader which only cares about the last 10 lines. Count the total lines (wc -l access_log) only if you have to. And generally do not split the file into chunks (split -b 10000000 access_log).

Needless to say, pattern match should use the most restrictive regular expression when dealing with large files. This means that if you search for all entries in Apache access_log for the client 172.168.123.45 and you know the IP is at line beginning,

you should use /^172\.168\.123\.45/ instead of /172.168.123.45/.

grep may be faster than grep, which is in turn faster than fgrep. Therefore either of these two commands are OK

egrep '^172\.168\.123\.45' access_log
sed -n '/^172\.168\.123\.45/p' access_log