?
Solved

Duplicate ID values in CSV file result in error on import

Posted on 2015-02-04
6
Medium Priority
?
116 Views
Last Modified: 2015-02-05
I am having problems with a commercial product I purchased, and having trouble getting support from the company. I want to make sure I'm not doing something wrong, so checking here.

The file I purchased is a large CSV containing postal codes. The first step in the process is to create a table like this:

CREATE TABLE GeoPC_Places (
  ISO VARCHAR(2) NOT NULL,
  Country VARCHAR(50) NOT NULL,
  Language VARCHAR(2) NOT NULL,
  ID BIGINT NOT NULL,
  Region1 VARCHAR(80),
  Region2 VARCHAR(80),
  Region3 VARCHAR(80),
  Region4 VARCHAR(80),
  Locality VARCHAR(80),
  Postcode VARCHAR(15),
  Suburb VARCHAR(80),
  Latitude DECIMAL(10,6),
  Longitude DECIMAL(10,6),
  Elevation INT,
  ISO2 VARCHAR(10),
  FIPS VARCHAR(10),
  NUTS VARCHAR(12),
  HASC VARCHAR(12),
  STAT VARCHAR(20),
  Timezone VARCHAR(30),
  UTC VARCHAR(10),
  DST VARCHAR(10),
  CONSTRAINT GeoPC_Places_pkey PRIMARY KEY CLUSTERED (Language ASC, ID ASC) ON [PRIMARY]
);

Open in new window


I am doing a bulk insert query like this:

BULK INSERT GeoPC_Places FROM 'C:\Users\bbdesign\documents\GeoPC_EU_Places1.csv' WITH (FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', DATAFILETYPE = 'widechar')

Open in new window


I get this error message:

Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'GeoPC_Places_pkey'. Cannot insert duplicate key in object 'dbo.GeoPC_Places'. The duplicate key value is (EN, 2018188927).
The statement has been terminated.

Open in new window


When I check the CSV file, there are indeed two records with 2018188927 in the ID column. Upon further checking, I see many records that have non-distinct ID values.

My thinking is that there is something wrong with this file, but since this comes from a seemingly well-established company (with a slow-to-respond support department), I want to make sure the error is not mine.

I am also not quite sure I necessarily need the ID column to be distinct. I am mostly concerned with using this file for the postal codes, latitude and longitude values. So if I cannot find any other workaround, maybe I can not make the ID column a primary key. But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?

Any advice?
Thanks!
0
Comment
Question by:bbdesign
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 66

Expert Comment

by:Jim Horn
ID: 40589818
Can't speak to the company you bought the file from, but a common scenario is to load a file into a temporary 'staging' table without any constraints.  Some developers even go as far as having all varchar columns.

Then once it's loaded, perform whatever T-SQL validations are necessary to make sure ID rows are unique, dates are dates, numbers are numbers, US states are valid US states, etc.  

Then deal with the offenders in T-SQL, then insert from the 'staging' table into your final production table.
0
 

Author Comment

by:bbdesign
ID: 40589825
Thanks. Yes, I could definitely do that.

My question is: I'm not doing something stupid here, am I? If their query has that constraint on the table, they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
0
 
LVL 66

Expert Comment

by:Jim Horn
ID: 40589840
>If their query has that constraint on the table
I see a table definition, not a query.

>they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
Perhaps, but doing ETL work I've learned to not trust source data unless there's a written contract signed in the vendor's blood, with a first born child as collateral, that the data is as described by that vendor.   So having to validate in this manner is pretty normal.  

If this duplication is actually correct data, then they have to back that up, and of course there goes your constraint, with some possible downstream impacts.

>But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?
Correct, but based on my first comment, I'd scrub the data so that it conforms to the constraint, then insert it into the table with the constraint.
0
Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

 

Author Comment

by:bbdesign
ID: 40589872
I imported the CSV into a table with no constraints. I found there are 90,972 duplicate ID values. I guess I may need to wait for another contact from their support people to see what that is all about (which could be days, if it comes at all). Here is an example (both have ID=2018188927):

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;51;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;22;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

The column header looks like this:

iso;country;language;id;region1;region2;region3;region4;locality;postcode;suburb;latitude;longitude;elevation;iso2;fips;nuts;hasc;stat;timezone;utc;dst

The latitude and longitude values look the same. The only thing that is not a duplicate is the "elevation" value. I don't care about the elevation value. So I suppose I could get rid of the duplicates. I just hate having to modify a supposedly professionally-prepared file like this. But as you suggested, perhaps professionally-prepared doesn't mean its going to be right, even if I paid for it.
0
 
LVL 69

Accepted Solution

by:
Scott Pletcher earned 2000 total points
ID: 40589926
I strongly agree with using a staging table.  

Load the raw file into a staging table, like "GeoPC_Places__Staging", then use a query to load the final "GeoPC_Places" table.

Drop the Elevation from the GeoPC_Places table, since you don't need it anyway, and then try loading it this way and see if you still get dups:

INSERT INTO GeoPC_Places
SELECT DISTINCT
ISO,
Country,
Language,
ID,
Region1,
Region2,
Region3,
Region4,
Locality,
Postcode,
Suburb,
Latitude,
Longitude,
ISO2,
FIPS,
NUTS,
HASC,
STAT,
Timezone,
UTC,
DST
FROM GeoPC_Places__Staging
0
 

Author Comment

by:bbdesign
ID: 40591180
Thanks for your advice!
0

Featured Post

Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
Recently we ran in to an issue while running some SQL jobs where we were trying to process the cubes.  We got an error saying failure stating 'NT SERVICE\SQLSERVERAGENT does not have access to Analysis Services. So this is a way to automate that wit…
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question