Parwise comparison in MS-Access

Experts:

I require some assistance/guidance with a general data management scenario. Allow me to provide some background first.

Background:
• I have a spreadsheet that may include thousands of records across approximately 25+ columns.   Columns can be of types [date], [number], [text].   Attached XLS is a very simplified version that only contains only 14 records across four columns.
• My actual worksheet, however, may contain 2,000 rows/records (+ header row).   And, I will have duplicate records (for a reason that doesn’t need to be covered in this thread) for each single record.  
• So, essentially, in the actual data set, I may have only 1,000 unique records (based on 3 or 4 key fields).   In the example data set, however, I only included 14 records where seven records are unique records (not across all fields though).
• Again, there are “slight differences” in each of the seven (7) paired records.   Please refer to additional notes in cell range B17:D31 that specifically explain details on the record variances.

What I Would Like to Achieve:
• In essence, I want to review the 14 records and remove their associate 2nd duplicate record and ultimately end up with only 7 records.
• The trick is though, there’s no “general algorithm” that states that I will always keep the 1st record and trash the 2nd one (or vice versa).   Instead, I want to **merge** two records into one but let the user decide that, e.g., for the record dated “07/15”, I will use [Submission ID] = 142 and overwrite ID 163.   Alternatively, for the record dated 10/02, [Submission ID] = 189 should be kept by overwriting ID 190.
• Similarly, the same holds true for the field “Example Summary”, the user will pick what he/she believes depicts a better summary.   Please keep in mind, for the [Example Summary] field, a longer summary does not mean more accurate summary.   Again, it’s the user’s decision to pick one of the two.

What Makes This Process Tricky:
• Now, for another reason which doesn’t require to be explained in detail here, the user prefers using an **Access database** over and **MS-Excel spreadsheet**.  
• If Excel were the tool of choice, a simple “drag ‘n drop” or “copying/pasting” one cell onto another cell would allow to accomplish the task somewhat easy (but it could be very tedious).  
• Please keep in mind though, the actual data set will have thousands of records and dozens of columns.   And the user does not want to add “helper columns” with formulae equal to “=If(B2=B3, “Ok”, “Error”.  
• So, again, this is where the user prefers using an Access database to automate or at least semi-automated the record update process.
• Thus, finally, I have attached the same XLS data structure in MS-Access as well.  
 
My questions:
• Based on the background & information provided in this thread (combined with the notes in the XLS), do you have a specific recommendation for developing, e.g., an Access query or module that executes a “pair-wise comparison” (again, in this example, only based on [Date] field)?  
• In other words, in the actual dataset, I want to shrink my, e.g., 2,000 records (or 1,000) pairs down to only 300 records (and 150 pairs) to be reviewed.    Then, depending on the field’s content, the user can either pick value from record #1 and overwrite value for record #2 or vice versa.  
• If so, how could this update process be accomplished (e.g., using a form) in Access?

Thank you for your assistance or advice in advance.

Cheers,
EEH
Information-about-Access-Database.xlsx
Database.accdb
ExpExchHelpAnalystAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Gustav BrockCIOCommented:
Add a boolean field to the table, Exclude, as a checkbox.
Let the user, in a form bound to the table, mark those records that are to be excluded.

When finished perusing the table, delete those records marked as excluded, or copy the not excluded to a new table.
0
ExpExchHelpAnalystAuthor Commented:
Gustav:

Thank you for the feedback.   I am not sure if I accurately described the problem.

There will be content from, e.g., paired record #1 that I need to use.   At the same time, there will be content from record #2 that also needs to be used.   Somehow I need to "copy paste" (in Access via an update query) the selected values (up to 25 fields) from one record to another.  

After the update, I can see the boolean field coming into play... that is, every record that was flagged can be deleted.   But again, prior to the deletion I need to figure out a way how to review/update the records so that a "master record" will be created.

Thanks,
EEH
0
Gustav BrockCIOCommented:
There is no easy or simple way to do this - indeed not if the selection of fields to be copied varies from record pairs to record pairs.
It can be done, but will require a lot of work, especially regarding the user interface. Not to mention: How to know which record pairs have been investigated and which have not.

To me, this seems like a piece of contract work, not large, but still. Or leave it in Excel.
0
The Five Tenets of the Most Secure Backup

Data loss can hit a business in any number of ways. In reality, companies should expect to lose data at some point. The challenge is having a plan to recover from such an event.

aikimarkCommented:
Are you allowed to have a primary key (autonumber) on your table?
0
ExpExchHelpAnalystAuthor Commented:
aikimark - thank you for chiming in.   Yes, as part of the data import from Excel to Access, I certainly can add a primary key.
0
Dale FyeOwner, Developing Solutions LLCCommented:
I would start by importing the Excel spreadsheet into Access and let access add a PK field to the table during the insert.

I would do this with a form.  At the top of the form I would create a subform which contains only the 4 fields which you would use to define each of the unique pairs.  The query for that would look something like:
SELECT Field1, Field2, Field3, Field4 FROM yourTable
GROUP BY Field1, Field2, Field3, Field4 
HAVING Count(Field1) = 2
ORDER BY Field1, Field2, Field3, Field4 

Open in new window

This would give you only those records which have duplicates, but don't have a third row.

In the Current event of the subform descibed above, I would define a query which filters on the row selected and would then display that info in a second subform.  The query for that subform would look something like:
SELECT *
FROM yourTable
WHERE Field1 = Forms!yourFormName!subform1!Form.Field1
AND Field2 = Forms!yourFormName!subform1!Form.Field2
AND Field3 = Forms!yourFormName!subform1!Form.Field3
AND Field4 = Forms!yourFormName!subform1!Form.Field4
ORDER BY ID

Open in new window

In this subform (probably a datasheet because this would allow you to widen columns or change the height of rows at runtim), I would hide fields 1-4 but keep them and set their default values to the values selected in the upper subform.

I would set the AllowAdditions property of this subform to Yes so that you could then add a new record to the 2nd subform.  You could then select a record in the first subform, move to the second subform, add a new record and move from one column to the next selecting and copying the "best" value from each of those into this new record.

I would include a BeforeUpdate event on this 2nd subform which would prevent the user from saving the record until all of the columns had been entered.

Once you have done this, you would simply create a query like the following to get the "added" record for each combination:
SELECT Field1, Field2, Field3, Field4, Max(ID)
FROM yourTable
Group By Field1, Field2, Field3, Field4
HAVING Count(Field1) = 3

Open in new window

Note that this would use the Count feature to only select those records for which you have added a row, which would eliminate the need to track which records have already been selected.  You might even change the recordsource of the first form to look like:
SELECT Field1, Field2, Field3, Field4, Count(Field1) as RecCount
FROM yourTable
GROUP BY Field1, Field2, Field3, Field4 
HAVING Count(Field1) >= 2
ORDER BY Field1, Field2, Field3, Field4

Open in new window

which would allow you to filter on the RecCount field to display those which only have two record, or those which have 3 (including the added record).
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
aikimarkCommented:
This might get you a candidate list of rows.  I created an autonumber ID column on Table1
SELECT Table1.ID, Table1_1.ID, Table1.[Example Date], Table1.Submitted, Table1_1.Submitted
FROM Table1 INNER JOIN Table1 AS Table1_1 ON Table1.[Example Date] = Table1_1.[Example Date]
WHERE (((Table1.ID)<[table1_1].[id]) 
AND (([Table1].[Submission ID]=[Table1_1].[Submission ID] 
And [Table1].[Example Summary]=[Table1_1].[Example Summary])=False));

Open in new window


You could potentially check for the identical row pairs by changing the "=False" to "=True".  To me, these are two different sets of rows that need manipulating.  For identical row pairs, you probably only need a single drop-down control to let the user determine the final disposition.

The non-identical row pairs will require the extra processing.
0
ExpExchHelpAnalystAuthor Commented:
Dale -- given the fieldnames are very generic in your proposed solution, I am not entirely certain how to interpret your solution.   Do you happen to have an example available?

aikimark -- I ran the query you proposed.   Attached JPG shows a screenshot of the output.   When running it on field [Submitted], I want to display only those records marked in "pink" (see spreadsheet example).   That is, those particular records differ and the user needs to make a decision which ones  need to change from either "No" to "Yes" or vice versa.    Again, for field [Submitted], only the 07/12 and 10/02 records are ok since both values are equal.

The same principle applies to the other two fields.   That is, only those record pairs where at least record is shown in pink require an update.   I hope this makes sense.

EEH
Query1.JPG
0
aikimarkCommented:
SELECT Table1.ID, Table1_1.ID, Table1.[Example Date], Table1.Submitted, Table1_1.Submitted
FROM Table1 INNER JOIN Table1 AS Table1_1 ON Table1.[Example Date] = Table1_1.[Example Date]
WHERE (((Table1.ID)<>[table1_1].[id]) 
AND ((Table1.Submitted)<>[Table1_1].[Submitted]) 
AND (([Table1].[Submission ID]=[Table1_1].[Submission ID] And [Table1].[Example Summary]=[Table1_1].[Example Summary])=False));

Open in new window

0
Dale FyeOwner, Developing Solutions LLCCommented:
Unfortunately, i dont have power at the moment, so provided my response from my iPad, without access to Access.  

My generic solution is for your more complex dataset, not the simplified dataset you provided.  Won't be able to help any more until i get my iPad charged.
0
ExpExchHelpAnalystAuthor Commented:
aikimark - not sure if I'm fully tracking on your proposed solution.   Looking at the XLS example -- field "Submitted" only for now -- I would expect the query to records 10 records (5 pairs).   Both records for:  01/14;, 05/22; 07/15; 07/20; and 08/15.   At this time, however, only 2 pairs (4 records) are returned in the query.

Dale -- copy, I understand.   Appreciate any help you may be able to offer later on.   Thank you!
0
aikimarkCommented:
You get two ID values for the 'difference' row pairs
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Thank you - based on [Submitted] I alone, I would expect to see the following ID pairs:
1, 2
3, 4
7, 8
9, 10
11, 12

Instead, however, the query produces:
4, 3
3, 4
8, 7
7, 8

In this case, each record is show twice (e.g., "4, 3" and "3,4").  Also, not all of the "pink" records for [SUBMITTED] are shown.   What am I missing here?

Thanks,
EEH
0
aikimarkCommented:
this version of the query:
SELECT Table1.ID, Table1_1.ID, Table1.[Example Date], Table1.Submitted, Table1_1.Submitted
FROM Table1 INNER JOIN Table1 AS Table1_1 ON Table1.[Example Date] = Table1_1.[Example Date]
WHERE (((Table1.ID)<table1_1.id) And ((Table1.Submitted)<>Table1_1.Submitted) And ((Table1.[Submission ID]=Table1_1.[Submission ID] And Table1.[Example Summary]=Table1_1.[Example Summary])=true));

Open in new window


produces
(1,2)
(9,10)
(11,12)
rows
0
ExpExchHelpAnalystAuthor Commented:
Good morning - w/o using a union query, is there a way I get the combined results?

Query 1:
(3,4)                      
(7,8)

+ Query 2:
(1,2)
 (9,10)
 (11,12)

with final query results of:
(1,2)
(3,4)
(7,8)
(9,10)
(11,12)
0
ExpExchHelpAnalystAuthor Commented:
Thank you for your assistance, Dale.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Databases

From novice to tech pro — start learning today.