Synchronize Tables from AS400 to SQL Server on Schedule

Posted on 2011-05-04
Last Modified: 2012-05-11
Hi Experts,

The title basically says it all.  We have a AS400 located at a remote site so i would like to sync a few of the tables to our SQL server to improve the speed of queries.  I am really not sure where to start on this one, but any help would be greatly appreciated.

The tables can be huge (hundreds of thousands of rows) so it would be best to only update the new rows since the previous sync.

Question by:FamousMortimer
    LVL 3

    Assisted Solution

    There are a bunch of ways to do this.
    We currently use 2 methods.

    1. Write a SSIS script or program to pull the data.  This works fine if your pulling the whole file, or if there is some sort of timestamp in the file for "New" or "Changed" records.  I do this with some files that do not need to be moved "Real time"

    For files that need to be moved "Real Time"

    2. 3rd party software,  One we use is called "Integrator"  it basicly creates "Triggers" on the AS/400 files, and on Add/Change/Delete can do different things.  

    The Integrator software is at its "End of life". (of course it is because its one of those softwares that you set up and then forget about because it just works.)   It was purchased by "Vision solutions" and they have there own.
    It is being replaced by a software called "Double Take".   Converting our process over to the new software Is on my "List of things to do" but I have not done it yet so I can not automaticly recomend the software.  (Integrator I would)

    Hope this helps.
    LVL 3

    Expert Comment

    Oh the SSIS script thing is for MSSQL server 2005 and above.   If your using SQL 2000, then they call it a DTS.
    LVL 34

    Expert Comment

    by:Gary Patterson
    Lots of solutions, of varying complexities, costs, and levels of technical expertise.

    You can buy packaged software that handles these types of ETL chores for you.
    You can build your own Windows-based "pull" solution using BIDS, SSIS, etc.
    You can build your own AS/400 based "push" solution using RPG, COBOL, SQL, Java, triggers, and/or journaling
    You can build a Java solution using JDBC.

    The AS/400's DB2 database supports access, import, and export via a lot of interfaces:

    You can export data to flat files or delimited files (like CSV's) using the CPYTOIMPF command, and then download the exported stream file using FTP or other tools.
    You can access the database remotely using ODBC, JDBC, OleDB, ADO.NET, DRDA, or DDM.  IBM provides drivers for all of these interfaces.  All of these interface choices means that you can write programs in your Windows language of choice, or use just about any commercial or open source ETL tool you choose.

    I can help you narrow it down, but you'll need to give me more to go on.  

    I work for a consulting company, and I've done this type of project dozens of times now across numerous platforms, and for various needs.  

    Things you'll want to consider:

    Build or buy?  Lots of commercial products in this space, but they can be expensive.
    Timeline to implement?
    Volume of data for initial load, and how much is the daily delta?
    If you plan to build, what technologies are the best fit for your shop?  Do you understand the complexities of building applications that have to maintain synchronized data sets?  What happens when the sets get out of sync?  How will you know?  How will you correct it?
    How will you deal with data type differences between the two databases?
    How do you plan to identify all of the changed rows?  Adds are usually easy, assuming there is a timestamp or sequential key, but updates and deletes can be trickier, of that is an issue for you.  Comparing two tables on two different systems is often very tricky, especially if you haven't done this sort of thing before.
    How critical is the data?  Is there any tolerance for error between the two systems?
    Which system is considered the "master".
    Large volume batch data transfers can be taxing on both the source and target systems.  Do you have a low-usage good window adequate for the daily data transfer?
    How are the two systems physically connected?  You said "remote".  Do you have adequate bandwidth to transfer the volume of data in the window you have available?
    Would a near-real-time or real-time data replication solution be more appropriate then a batch-oriented ETL process?

    Anyway, post back if you want to discuss further.


    - Gary Patterson
    LVL 10

    Author Comment

    Wow, lots said, and lots to answer.

    I would prefer to build a solution if possible.  The data is real-time (it is production data from our manufacturing plant, i.e. what the machines are doing).  The reason to sync the data is mainly because our manufacturing monitors query every minute or two large amounts of data and it can be lengthy because of latency.  If the data ever goes out of sync, it would not be a huge ordeal, although the more accurate is better.  For most applications we are concerned about the most recent data (from 'today')

    We would have to use a Windows-based 'pull' solution, although if a 'push' solution is recommended, i may be able to have our 400 admin develop something for me.  We are using SQL 2008 Standard.  There is no specific timeline to implement.  The data type difference is handled through the manufacturing monitor program.  AS/400 is the master system.  Securing a time window and bandwidth is not a problem.  Closer to real-time the better.

    I think i answered most question.  With that info, do either of you have any specific recommendations or examples?
    LVL 34

    Accepted Solution

    Well, it is hard to be specific without doing some data analysis.  


    "Pull" works best when you very clearly know exactly what you want, and when your underlying AS/400 database is designed with ETL in mind.  Here are some guidelines for data that works well in a "pull" environment.

    Host data adds and updates are easy to identify, either by a reliable "created"/"updated" timestamp in the record, or a sequentially numbered entry in a transaction file, etc.
    "Pull" doesn't handle deleted records well, unless records are "soft-deleted" first (a status field in the record is updated to indicate the record is to be deleted, and the physical delete handled later).  Again, if these are transaction-log type files, probably not an issue.
    No exclusive table or record locking performed by source database applications (AS/400 applications in this case).

    I'm not going to spend a lot of time on "pull".  With "pull", you write programs that query the remote database and add/update/delete as appropriate.  With large data sets it can be tough to handle deletes in particular, in an efficient manner.

    I've torn out a LOT of poor-performing pull processes and replaced them with much-more efficient push processes.

    Pull is often performed using SSIS/BIDS in an SQL Server environment, since that is the native Microsoft tool,  MSDN has comprehensive documentation on these tools.


    Push works in just about any environment, especially if it is AS/400 trigger or journal-initiated.  Push can also be handled in near real-time, which has lots of benefits.

    Push is initiated from the AS/400-side, of course.  Both approaches are similar.  If you are already journaling the tables in question, however, I'd go that route.  I prefer journal-based replication anyway, unless you are severely disk-constrained on the system.  Push is nice because you can also chose to push changed data in real-time, or in small batches, throughout the day.  This means that you target stays up-to-date, and you can avoid big time and bandwidth-consuming batch transfers.  Journaling-driven push works like this (this is the same basic approach used in AS/400 data replication software):

    1) Tables that you wish to replicate are journaled (Create journal, create journal receiver, start journaling).  Multiple tables can be journaled to one journal.

    2) To do real-time (or near-real time) replication, you develop a program that monitors the journal for new entries, extract the entry, and send it to a partner program on the PC that applies the add/update/delete to the remote database, and then acknowledges the successful apply operation.  IBM supplies Journal API's to access the journal entries, as well as the CL command RCVJRNE (Receive Journal Entry)  The hardest part is the inter-system communication between the AS/400 and Windows.  I've used a lot of different techniques over the years.  My favorite (because it is fast and easy) is to use native AS/400 data queues.   You can use any transfer technique you like, IBM MQ Series, Java Message Service (JMS), and TCP Sockets are all viable alternatives.  regardless, the process is similar.  I'll use native AS/400 data queues for my example:

    a) Create a data queue long enough to hold your longest record. (CRTDTAQ)

    b) Create a program that uses the RCVJRNE command in a loop to process each new journal entry as it comes in.  When an entry comes in that needs to be replicated to the SQL Server DB, place the entry on the data queue using the QSNDDTAQ API.  You can do this in CL, RPG, Java, etc.

    c) Create a program that runs on your Windows server that receives the data queue entries, and applies the appropriate change to your SQL Server DB.  IBM's Java Toolbox for IBM i (or the parallel open-source JTOpen project) includes classes that allow you to easily connect to the AS/400 and receive data queue entries.  Of course your Java program can also use JDBC to access your SQL server database.  We're not limited to Java here.  If you're a .NET programmer, IBM offer programming tools as part of IBM Access for i (us old-timers call it Client Access) that include AS/400 data queue API's and connection tools.  

    Trigger-driven push works in a similar fashion, except instead of relying on journaling and the RCVJRNE command (or related APIs), you create DB2 triggers and code trigger programs that push the added/updated/deleted records onto the queue.

    I know it sounds a bit complicated, but I've turned around simple table replication projects in just a day or two in the past using similar techniques.  (The first one took a lot longer, but I didn't have any guidance on "how" the first time I tried it.  We also have a lot more tools at our disposal now.)

    Unfortunately, with either approach, there are just a number of "gotchas" that can crop up that are dependent on too many variables to discuss them all in a forum like this:

    Initially populating large replica tables where "over the line" transmission is impractical.
    Validating data synchronization between dissimilar systems
    Repairing out-of-sync data sets
    Optimizing "pull" query performance and bandwidth utilization
    Optimizing "push" performance and bandwidth utilization
    Journal receiver management
    ETL process interaction/interference with principal AS/400 applications
    Managing database schema changes
    Data type issues (date handling is often an issue, especially with legacy AS/400 databases that use numeric fields to store dates).

    I (and other experts) can point you to lots of resources out there, but just be aware that first-timers sometime struggle with these applications, especially if they don't have strong expertise on both platforms.  It can be very difficult to diagnose problems.

    That said, once you've selected a method and toolset, post back and I point you to some code samples (though you can Google most of the items I've mentioned and find many of them yourself, and I encourage you to do that before posting back, so we can focus on the hard-to-find items.)

    Finally, remember that there are numerous third-party tools out on the market that can do some or all of this process for you.  Many shops attempt to :grow their own" solutions for replication or ETL and eventually scrap them in favor of a robust commercial solution after years of struggling with reliability and synchronization issues.

    - Gary Patterson
    LVL 34

    Expert Comment

    by:Gary Patterson

    Note, too, that you can use a hybrid approach.  For example, you can use a journal or trigger based process on the AS/400 to build a table or table that contains only the keys of the records that need to be updated, for example, or copies of the records themselves.

    Then, when your target-system "pull" process runs, it doesn't have to touch the underlying "live" database, or at least it doesn't have to run massive "search for added records, search for changed records, search for deleted records" query batches.  Instead, it can just use the "staging" table to determine what records need to be added/updated/deleted.  This help eliminate the "deleted records problem" inherent in the "pull" architecture.

    - Gary Patterson
    LVL 10

    Author Closing Comment

    Thanks a lot for all of the info.  This gives me a good step in the right direction.  I still am not sure exactly how we are going to go about it, but me and our AS/400 admin are going to review all of this info an discuss our options.

    Featured Post

    Do You Know the 4 Main Threat Actor Types?

    Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

    Join & Write a Comment

    Suggested Solutions

    Does the idea of dealing with bits scare or confuse you? Does it seem like a waste of time in an age where we all have terabytes of storage? If so, you're missing out on one of the core tools in every professional programmer's toolbox. Learn how to …
    If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
    An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
    In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

    745 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now