De-duplication query on Access (SQL) database
Posted on 2013-01-17
I have a Access database that I need some help with (my Access and SQL skills are pretty basic). I have a single table containing information about documents. Including which other documents in the table are duplicates of it.
My objective is to create a list of unique documents, ie. All the documents that have no duplicates (truly unique) along with only one document for each set (2 or more) duplicates. This is further complicated by the marking of duplicates not including all (in direct) duplicates.
The marking of duplicates does not extend to every duplicate, eg.
Document A Document B
Document B Document A
Document B Document C
Document C Document B
Document A has been marked as a duplicate of Document B. Document B is marked as a duplicate of Document C. So Document A and C and also duplicates, but are not directly marked as duplicates.
Currently my information is stored in a table with this layout:
Filename Duplicate1 Duplicate2 Duplicate3 Duplicate4 Duplicate5
DocumentA DocumentB DocumentH
Where no duplicate could be identified the respective Duplicate 1-5 fields are blank.
This may not be the best way to store this information and I welcome any suggestions to restructure the table or implement additional tables.
The output I require is a list of unique documents, ie. No document in the list will have a duplicate (direct or indirect) in the list.
Many thanks in advance,