asked on

Parwise comparison in MS-Access

Experts:

I require some assistance/guidance with a general data management scenario. Allow me to provide some background first.

Background:
• I have a spreadsheet that may include thousands of records across approximately 25+ columns. Columns can be of types [date], [number], [text]. Attached XLS is a very simplified version that only contains only 14 records across four columns.
• My actual worksheet, however, may contain 2,000 rows/records (+ header row). And, I will have duplicate records (for a reason that doesn’t need to be covered in this thread) for each single record.
• So, essentially, in the actual data set, I may have only 1,000 unique records (based on 3 or 4 key fields). In the example data set, however, I only included 14 records where seven records are unique records (not across all fields though).
• Again, there are “slight differences” in each of the seven (7) paired records. Please refer to additional notes in cell range B17:D31 that specifically explain details on the record variances.

What I Would Like to Achieve:
• In essence, I want to review the 14 records and remove their associate 2nd duplicate record and ultimately end up with only 7 records.
• The trick is though, there’s no “general algorithm” that states that I will always keep the 1st record and trash the 2nd one (or vice versa). Instead, I want to **merge** two records into one but let the user decide that, e.g., for the record dated “07/15”, I will use [Submission ID] = 142 and overwrite ID 163. Alternatively, for the record dated 10/02, [Submission ID] = 189 should be kept by overwriting ID 190.
• Similarly, the same holds true for the field “Example Summary”, the user will pick what he/she believes depicts a better summary. Please keep in mind, for the [Example Summary] field, a longer summary does not mean more accurate summary. Again, it’s the user’s decision to pick one of the two.

What Makes This Process Tricky:
• Now, for another reason which doesn’t require to be explained in detail here, the user prefers using an **Access database** over and **MS-Excel spreadsheet**.
• If Excel were the tool of choice, a simple “drag ‘n drop” or “copying/pasting” one cell onto another cell would allow to accomplish the task somewhat easy (but it could be very tedious).
• Please keep in mind though, the actual data set will have thousands of records and dozens of columns. And the user does not want to add “helper columns” with formulae equal to “=If(B2=B3, “Ok”, “Error”.
• So, again, this is where the user prefers using an Access database to automate or at least semi-automated the record update process.
• Thus, finally, I have attached the same XLS data structure in MS-Access as well.

My questions:
• Based on the background & information provided in this thread (combined with the notes in the XLS), do you have a specific recommendation for developing, e.g., an Access query or module that executes a “pair-wise comparison” (again, in this example, only based on [Date] field)?
• In other words, in the actual dataset, I want to shrink my, e.g., 2,000 records (or 1,000) pairs down to only 300 records (and 150 pairs) to be reviewed. Then, depending on the field’s content, the user can either pick value from record #1 and overwrite value for record #2 or vice versa.
• If so, how could this update process be accomplished (e.g., using a form) in Access?

Thank you for your assistance or advice in advance.

Cheers,
EEH
Information-about-Access-Database.xlsx
Database.accdb

Gustav Brock

Add a boolean field to the table, Exclude, as a checkbox.
Let the user, in a form bound to the table, mark those records that are to be excluded.

When finished perusing the table, delete those records marked as excluded, or copy the not excluded to a new table.

ExpExchHelp

ASKER

Gustav:

Thank you for the feedback. I am not sure if I accurately described the problem.

There will be content from, e.g., paired record #1 that I need to use. At the same time, there will be content from record #2 that also needs to be used. Somehow I need to "copy paste" (in Access via an update query) the selected values (up to 25 fields) from one record to another.

After the update, I can see the boolean field coming into play... that is, every record that was flagged can be deleted. But again, prior to the deletion I need to figure out a way how to review/update the records so that a "master record" will be created.

Thanks,
EEH

Gustav Brock

There is no easy or simple way to do this - indeed not if the selection of fields to be copied varies from record pairs to record pairs.
It can be done, but will require a lot of work, especially regarding the user interface. Not to mention: How to know which record pairs have been investigated and which have not.

To me, this seems like a piece of contract work, not large, but still. Or leave it in Excel.

aikimark

Are you allowed to have a primary key (autonumber) on your table?

ExpExchHelp

ASKER

aikimark - thank you for chiming in. Yes, as part of the data import from Excel to Access, I certainly can add a primary key.

ASKER CERTIFIED SOLUTION

Dale Fye

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aikimark

This might get you a candidate list of rows. I created an autonumber ID column on Table1

SELECT Table1.ID, Table1_1.ID, Table1.[Example Date], Table1.Submitted, Table1_1.Submitted
FROM Table1 INNER JOIN Table1 AS Table1_1 ON Table1.[Example Date] = Table1_1.[Example Date]
WHERE (((Table1.ID)<[table1_1].[id]) 
AND (([Table1].[Submission ID]=[Table1_1].[Submission ID] 
And [Table1].[Example Summary]=[Table1_1].[Example Summary])=False));

Open in new window

You could potentially check for the identical row pairs by changing the "=False" to "=True". To me, these are two different sets of rows that need manipulating. For identical row pairs, you probably only need a single drop-down control to let the user determine the final disposition.

The non-identical row pairs will require the extra processing.

ExpExchHelp

ASKER

Dale -- given the fieldnames are very generic in your proposed solution, I am not entirely certain how to interpret your solution. Do you happen to have an example available?

aikimark -- I ran the query you proposed. Attached JPG shows a screenshot of the output. When running it on field [Submitted], I want to display only those records marked in "pink" (see spreadsheet example). That is, those particular records differ and the user needs to make a decision which ones need to change from either "No" to "Yes" or vice versa. Again, for field [Submitted], only the 07/12 and 10/02 records are ok since both values are equal.

The same principle applies to the other two fields. That is, only those record pairs where at least record is shown in pink require an update. I hope this makes sense.

EEH
Query1.JPG

aikimark

SELECT Table1.ID, Table1_1.ID, Table1.[Example Date], Table1.Submitted, Table1_1.Submitted
FROM Table1 INNER JOIN Table1 AS Table1_1 ON Table1.[Example Date] = Table1_1.[Example Date]
WHERE (((Table1.ID)<>[table1_1].[id]) 
AND ((Table1.Submitted)<>[Table1_1].[Submitted]) 
AND (([Table1].[Submission ID]=[Table1_1].[Submission ID] And [Table1].[Example Summary]=[Table1_1].[Example Summary])=False));

Open in new window

Dale Fye

Unfortunately, i dont have power at the moment, so provided my response from my iPad, without access to Access.

My generic solution is for your more complex dataset, not the simplified dataset you provided. Won't be able to help any more until i get my iPad charged.

ExpExchHelp

ASKER

aikimark - not sure if I'm fully tracking on your proposed solution. Looking at the XLS example -- field "Submitted" only for now -- I would expect the query to records 10 records (5 pairs). Both records for: 01/14;, 05/22; 07/15; 07/20; and 08/15. At this time, however, only 2 pairs (4 records) are returned in the query.

Dale -- copy, I understand. Appreciate any help you may be able to offer later on. Thank you!

aikimark

You get two ID values for the 'difference' row pairs

ExpExchHelp

ASKER

aikimark:

Thank you - based on [Submitted] I alone, I would expect to see the following ID pairs:
1, 2
3, 4
7, 8
9, 10
11, 12

Instead, however, the query produces:
4, 3
3, 4
8, 7
7, 8

In this case, each record is show twice (e.g., "4, 3" and "3,4"). Also, not all of the "pink" records for [SUBMITTED] are shown. What am I missing here?

Thanks,
EEH

aikimark

this version of the query:

SELECT Table1.ID, Table1_1.ID, Table1.[Example Date], Table1.Submitted, Table1_1.Submitted
FROM Table1 INNER JOIN Table1 AS Table1_1 ON Table1.[Example Date] = Table1_1.[Example Date]
WHERE (((Table1.ID)<table1_1.id) And ((Table1.Submitted)<>Table1_1.Submitted) And ((Table1.[Submission ID]=Table1_1.[Submission ID] And Table1.[Example Summary]=Table1_1.[Example Summary])=true));

Open in new window

produces
(1,2)
(9,10)
(11,12)
rows

ExpExchHelp

ASKER

Good morning - w/o using a union query, is there a way I get the combined results?

Query 1:
(3,4)
(7,8)

+ Query 2:
(1,2)
(9,10)
(11,12)

with final query results of:
(1,2)
(3,4)
(7,8)
(9,10)
(11,12)

ExpExchHelp

ASKER

Thank you for your assistance, Dale.