staging model: How do I retain integrity and assure uniqueness?

Paula DiTallo
Paula DiTallo used Ask the Experts™
on
Techies--
 
I need to model a staging database that is used to process incoming xml files from one vendor and api data from another.  The incoming records per load can contain all unique header/detail_line/sub_detail_line data, the same record again with updates to elements, or the same record with an added/removed detail_line/sub_detail_line. I do not, or will not know which.

How do model this? Do I use something like an added_row_timedate column to capture the uniqueness of an instance of a record/data object along with the obvious keys?  Do I create concatenated primary keys that include the added_row_timedate column? Do I not do that and generate synthetic/sequential identifier keys? Do I bother with keys in the staging environment at all, or do I enforce integrity through unique constraints?

The code snippet has a sample XML stream and the obvious portion of the model. Please advise me.


--<AllData>
--<DataObj>
--  <HeaderObj>
--     <hdr_id>1000</hdr_id>
--     <hdr_elem_2>my header element</hdr_elem_2>
--     <DetailObj>
--        <dtl_id>10</dtl_id>
--        <dtl_elem_2>my detail line element</dtl_elem_2>
--        <SubDetailObj>
--          <sub_id>1</sub_id>
--          <sub_elem_2> my subdetail line element</sub_elem_2>
--        </SubDetailObj>
--     </DetailObj>
--  </HeaderObj>
--</DataObj>
--<DataObj>
--  <HeaderObj>
--     <hdr_id>1010</hdr_id>
--     <hdr_elem_2>my other header element</hdr_elem_2>
--     <DetailObj>
--        <dtl_id>10</dtl_id>
--        <dtl_elem_2>my other detail line element</dtl_elem_2>
--        <SubDetailObj>
--          <sub_id>1</sub_id>
--          <sub_elem_2> my other subdetail line element</sub_elem_2>
--        </SubDetailObj>
--     </DetailObj>
--  </HeaderObj>
--</DataObj>
--<DataObj>
--  <HeaderObj>
--     <hdr_id>1000</hdr_id>
--     <hdr_elem_2>my header element</hdr_elem_2>
--     <DetailObj>
--        <dtl_id>10</dtl_id>
--        <dtl_elem_2>my detail line element</dtl_elem_2>
--        <SubDetailObj>
--          <sub_id>1</sub_id>
--          <sub_elem_2> my subdetail line element</sub_elem_2>
--        </SubDetailObj>
--     </DetailObj>
--        <dtl_id>20</dtl_id>
--        <dtl_elem_2>my new detail line element</dtl_elem_2>
--     </DetailObj>
--         <SubDetailObj>
--         </SubDetailObj>
--  </HeaderObj>
--</DataObj>
--</AllData>

create table [academic].[load_xaction_data]
( LoadId int not null,          -- sequential identifier/synthetic key
  ImportSourceId int not null,  -- coming from another table that tracks whether source is api or xml files
  ImportDate datetime not null,
  LoadBeginTime datetime not null,
  LoadEndTime datetime not null);
  
create table [academic].header_obj
 ( hdr_id int not null,                  -- retained exisiting data object key
   hdr_element_2 varchar(50) not null ); -- other header element
   
create table [academic].detail_obj
( hdr_id int not null,					-- retained existing data object key from header
  dtl_id int not null,					-- retained existing data object key from detail
  dtl_elem_2 varchar(50) not null);		-- other detail header element
  
create table [academic].sub_obj
( hdr_id int not null,				    -- same pattern
  dtl_id int not null,
  sub_id int not null,
  sub_elem_2 varchar(50) not null);

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Senior Software Analyst
Commented:
Well, here's some questions I have that would affect the model I'd choose (please excuse the redundancy of the questions, I'm just not making any assumptions about your understanding of any particular phrase I use):

1.  Do you need to archive every input instance? That is, do you need to be able to report everything you received, whether or not you processed it any further than simply receiving it?  Copies of the files themselves might not be sufficient - I have a calendar-event system I'm working on where I process the files as records then store the records in the database so I can compare any instance of a calendar with any requested change in the past.

2.  What is the priority of your inputs?  Which should be applied first, or is it always first-come-first-served?

3.  Is the input from the API in XML format, and if so is the grammar from all XML inputs identical to the grammar of the API?  I'm asking "to the best of your current knowledge".

For occurrence inputs, where inputs are ordered by the time they occur, you have to define when that time is; by example, does sunrise begin when the sun is first visible, when it is halfway up, or only when it has completely cleared the horizon?  So, is the time an input is marked the time it was detected and processing began, or is it the time that the complete transaction was recorded to the database?

Personally, I like time of transaction completion, as it is the first time that everyone reading the database can see the data.  That said, the start of the transaction is available during the transaction itself, while the end time of the transaction is only available after the transaction is complete, thus another transaction is required to post this information.

So, let's take the easy route and mark each transaction according to the start time of the transaction.  What will compose a transaction?  Do we require that an entire XML requiest be completed before any part of another XML request is complete? If not, do we require each DataObj in an XML request to be processed completely before another DataObj is processed (assuming they affect identical DataObj instances)?  and so forth down the line until you idenify the minimum transaction.

Think about this.  Tell me what you think.  The best model will emerge from this discussion; the most significant pieces will have occurred to you as you were reading these paragraphs.  I look forward to your response.
Paula DiTalloIntegration developer

Author

Commented:
cpkilekofp,
Thanks for responding. I appreciate that very much. I've engaged the Business Analyst (BA) on this project do basically reveal what the workflow expectation can be regarding what the order (aka: priority) of the dataobjects should be once stored. Answer to your first question, yes--the data elements for each data object will need to be stored whether from the consumed data of the web service in the case of one vendor, or from the xml files in the case of the other vendor.

As a first draft, I've decided to assign a synthetic key to incoming dataobjects, while retaining the associated incoming dataobject key--since this is unique, but repeatable. I carry this synthetic key with the vendor key so that I can uniquely identify each transaction as it arrives in any order, with any priority.

I'd like to keep this thread open so that as more info filters in from the BA, futher discussion on the model is possible. I will leave closure up to you.
Christopher KileSenior Software Analyst

Commented:
Don't worry about me.  When you have more info, let me know.  If no one posts to this question in 20 days or so, the moderators will ask us if it's ok to delete it.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial