[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

Duplicate ID values in CSV file result in error on import

Posted on 2015-02-04
6
Medium Priority
?
117 Views
Last Modified: 2015-02-05
I am having problems with a commercial product I purchased, and having trouble getting support from the company. I want to make sure I'm not doing something wrong, so checking here.

The file I purchased is a large CSV containing postal codes. The first step in the process is to create a table like this:

CREATE TABLE GeoPC_Places (
  ISO VARCHAR(2) NOT NULL,
  Country VARCHAR(50) NOT NULL,
  Language VARCHAR(2) NOT NULL,
  ID BIGINT NOT NULL,
  Region1 VARCHAR(80),
  Region2 VARCHAR(80),
  Region3 VARCHAR(80),
  Region4 VARCHAR(80),
  Locality VARCHAR(80),
  Postcode VARCHAR(15),
  Suburb VARCHAR(80),
  Latitude DECIMAL(10,6),
  Longitude DECIMAL(10,6),
  Elevation INT,
  ISO2 VARCHAR(10),
  FIPS VARCHAR(10),
  NUTS VARCHAR(12),
  HASC VARCHAR(12),
  STAT VARCHAR(20),
  Timezone VARCHAR(30),
  UTC VARCHAR(10),
  DST VARCHAR(10),
  CONSTRAINT GeoPC_Places_pkey PRIMARY KEY CLUSTERED (Language ASC, ID ASC) ON [PRIMARY]
);

Open in new window


I am doing a bulk insert query like this:

BULK INSERT GeoPC_Places FROM 'C:\Users\bbdesign\documents\GeoPC_EU_Places1.csv' WITH (FIRSTROW = 2, FIELDTERMINATOR = ';', ROWTERMINATOR = '\n', DATAFILETYPE = 'widechar')

Open in new window


I get this error message:

Msg 2627, Level 14, State 1, Line 1
Violation of PRIMARY KEY constraint 'GeoPC_Places_pkey'. Cannot insert duplicate key in object 'dbo.GeoPC_Places'. The duplicate key value is (EN, 2018188927).
The statement has been terminated.

Open in new window


When I check the CSV file, there are indeed two records with 2018188927 in the ID column. Upon further checking, I see many records that have non-distinct ID values.

My thinking is that there is something wrong with this file, but since this comes from a seemingly well-established company (with a slow-to-respond support department), I want to make sure the error is not mine.

I am also not quite sure I necessarily need the ID column to be distinct. I am mostly concerned with using this file for the postal codes, latitude and longitude values. So if I cannot find any other workaround, maybe I can not make the ID column a primary key. But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?

Any advice?
Thanks!
0
Comment
Question by:bbdesign
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 66

Expert Comment

by:Jim Horn
ID: 40589818
Can't speak to the company you bought the file from, but a common scenario is to load a file into a temporary 'staging' table without any constraints.  Some developers even go as far as having all varchar columns.

Then once it's loaded, perform whatever T-SQL validations are necessary to make sure ID rows are unique, dates are dates, numbers are numbers, US states are valid US states, etc.  

Then deal with the offenders in T-SQL, then insert from the 'staging' table into your final production table.
0
 

Author Comment

by:bbdesign
ID: 40589825
Thanks. Yes, I could definitely do that.

My question is: I'm not doing something stupid here, am I? If their query has that constraint on the table, they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
0
 
LVL 66

Expert Comment

by:Jim Horn
ID: 40589840
>If their query has that constraint on the table
I see a table definition, not a query.

>they shouldn't be providing me with a CSV that contains dozens (if not hundreds) of duplicate ID values, correct?
Perhaps, but doing ETL work I've learned to not trust source data unless there's a written contract signed in the vendor's blood, with a first born child as collateral, that the data is as described by that vendor.   So having to validate in this manner is pretty normal.  

If this duplication is actually correct data, then they have to back that up, and of course there goes your constraint, with some possible downstream impacts.

>But removing some of the CONSTRAINT part of the "create table" query could make this very large table not perform quite as well?
Correct, but based on my first comment, I'd scrub the data so that it conforms to the constraint, then insert it into the table with the constraint.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:bbdesign
ID: 40589872
I imported the CSV into a table with no constraints. I found there are 90,972 duplicate ID values. I guess I may need to wait for another contact from their support people to see what that is all about (which could be days, if it comes at all). Here is an example (both have ID=2018188927):

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;51;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

GB;United Kingdom;EN;2018188927;England;North West England;Merseyside;Liverpool;Liverpool;L16 3NF;;53.405617;-2.881958;22;GB-LIV;UKH8;UKD52;GB.LV;E08000012;Europe/London;+00:00;+01:00

The column header looks like this:

iso;country;language;id;region1;region2;region3;region4;locality;postcode;suburb;latitude;longitude;elevation;iso2;fips;nuts;hasc;stat;timezone;utc;dst

The latitude and longitude values look the same. The only thing that is not a duplicate is the "elevation" value. I don't care about the elevation value. So I suppose I could get rid of the duplicates. I just hate having to modify a supposedly professionally-prepared file like this. But as you suggested, perhaps professionally-prepared doesn't mean its going to be right, even if I paid for it.
0
 
LVL 70

Accepted Solution

by:
Scott Pletcher earned 2000 total points
ID: 40589926
I strongly agree with using a staging table.  

Load the raw file into a staging table, like "GeoPC_Places__Staging", then use a query to load the final "GeoPC_Places" table.

Drop the Elevation from the GeoPC_Places table, since you don't need it anyway, and then try loading it this way and see if you still get dups:

INSERT INTO GeoPC_Places
SELECT DISTINCT
ISO,
Country,
Language,
ID,
Region1,
Region2,
Region3,
Region4,
Locality,
Postcode,
Suburb,
Latitude,
Longitude,
ISO2,
FIPS,
NUTS,
HASC,
STAT,
Timezone,
UTC,
DST
FROM GeoPC_Places__Staging
0
 

Author Comment

by:bbdesign
ID: 40591180
Thanks for your advice!
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
When trying to connect from SSMS v17.x to a SQL Server Integration Services 2016 instance or previous version, you get the error “Connecting to the Integration Services service on the computer failed with the following error: 'The specified service …
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.

656 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question