Link to home
Start Free TrialLog in
Avatar of Brad Bansner
Brad Bansner

asked on

Duplicate ID values in CSV file result in error on import

I am having problems with a commercial product I purchased, and having trouble getting support from the company. I want to make sure I'm not doing something wrong, so checking here.

The file I purchased is a large CSV containing postal codes. The first step in the process is to create a table like this:

CREATE TABLE GeoPC_Places (
  ISO VARCHAR(2) NOT NULL,
  Country VARCHAR(50) NOT NULL,
  Language VARCHAR(2) NOT NULL,
  ID BIGINT NOT NULL,
  Region1 VARCHAR(80),
  Region2 VARCHAR(80),
  Region3 VARCHAR(80),
  Region4 VARCHAR(80),
  Locality VARCHAR(80),
  Postcode VARCHAR(15),
  Suburb VARCHAR(80),
  Latitude DECIMAL(10,6),
  Longitude DECIMAL(10,6),
  Elevation INT,
  ISO2 VARCHAR(10),
  FIPS VARCHAR(10),
  NUTS VARCHAR(12),
  HASC VARCHAR(12),
  STAT VARCHAR(20),
  Timezone VARCHAR(30),
  UTC VARCHAR(10),
  DST VARCHAR(10),
  CONSTRAINT GeoPC_Places_pkey PRIMARY KEY CLUSTERED (Language ASC, ID ASC) ON [PRIMARY]
);

Open in new window


I am doing a bulk insert query like this:

BULK INSERT GeoPC_Places FROM 'C:\Users\bbdesign\documents\GeoPC_EU_Places1.csv' WITH (FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', DATAFILETYPE = 'widechar')

Open in new window


I get this error message:

Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'GeoPC_Places_pkey'. Cannot insert duplicate key in object 'dbo.GeoPC_Places'. The duplicate key value is (EN, 2018188927).
The statement has been terminated.

Open in new window


When I check the CSV file, there are indeed two records with 2018188927 in the ID column. Upon further checking, I see many records that have non-distinct ID values.

My thinking is that there is something wrong with this file, but since this comes from a seemingly well-established company (with a slow-to-respond support department), I want to make sure the error is not mine.

I am also not quite sure I necessarily need the ID column to be distinct. I am mostly concerned with using this file for the postal codes, latitude and longitude values. So if I cannot find any other workaround, maybe I can not make the ID column a primary key. But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?

Any advice?
Thanks!
Avatar of Jim Horn
Jim Horn
Flag of United States of America image

Can't speak to the company you bought the file from, but a common scenario is to load a file into a temporary 'staging' table without any constraints.  Some developers even go as far as having all varchar columns.

Then once it's loaded, perform whatever T-SQL validations are necessary to make sure ID rows are unique, dates are dates, numbers are numbers, US states are valid US states, etc.  

Then deal with the offenders in T-SQL, then insert from the 'staging' table into your final production table.
Avatar of Brad Bansner
Brad Bansner

ASKER

Thanks. Yes, I could definitely do that.

My question is: I'm not doing something stupid here, am I? If their query has that constraint on the table, they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
>If their query has that constraint on the table
I see a table definition, not a query.

>they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
Perhaps, but doing ETL work I've learned to not trust source data unless there's a written contract signed in the vendor's blood, with a first born child as collateral, that the data is as described by that vendor.   So having to validate in this manner is pretty normal.  

If this duplication is actually correct data, then they have to back that up, and of course there goes your constraint, with some possible downstream impacts.

>But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?
Correct, but based on my first comment, I'd scrub the data so that it conforms to the constraint, then insert it into the table with the constraint.
I imported the CSV into a table with no constraints. I found there are 90,972 duplicate ID values. I guess I may need to wait for another contact from their support people to see what that is all about (which could be days, if it comes at all). Here is an example (both have ID=2018188927):

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;51;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;22;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

The column header looks like this:

iso;country;language;id;region1;region2;region3;region4;locality;postcode;suburb;latitude;longitude;elevation;iso2;fips;nuts;hasc;stat;timezone;utc;dst

The latitude and longitude values look the same. The only thing that is not a duplicate is the "elevation" value. I don't care about the elevation value. So I suppose I could get rid of the duplicates. I just hate having to modify a supposedly professionally-prepared file like this. But as you suggested, perhaps professionally-prepared doesn't mean its going to be right, even if I paid for it.
ASKER CERTIFIED SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for your advice!