PostgreSQL: Copy from File missing data error

I'm trying to build a PostgreSQL/PostGIS database using Entire country files dataset but I'm getting missing data error

I'm wondering if the copy command is affected by diacritics or I've not set the database up properly
Created a new database with UTF8 encoding

I built the table schema based on the given format (but using type text for NM_MODIFY_DATE not varchar,  having these as dates didn't make a diferance)
PostgreSQL docs state fields can be NULL by default  so

create table my_table(
id SERIAL PRIMARY KEY,
test1 text;  -- can be null
test2 text NOT NULL;  -- Has to have a value
}

Open in new window


In my  theory every column apart from id doesn't have to contain a value
 
I used large text viewer to open the uncompressed countries.txt file and copied the top 5 rows into a test file

Using   PostgreSQL Copy this test file imported correctly so I know my schema is correct
copy my_table(List of columns ) from '\\Path\\To\\TestFile.txt' WITH delimiter E'\t' csv HEADER;

Open in new window



When tried to ingest the larger  countries.txt (2.9GB) file I get an error "missing data" for column xyz at line 12345 (Last column with  data in that row, NM_MODIFY_DATE)

Using large text viewer again I located this line with proceeding and following lines into my test file and tried the copy again but get the same error

I opened the test file in excel to  see if there is columns  missing although not every row has data in each column they do  all match
 
the problem row has UFI -3373955 & UNI 6329083  

I can't edit the large text file to remove the problem row (Software not available on system) also the file is UTF8 and I don't want to corrupt the character set


I contemplated writing a perl script to parse the file replacing the tab with a '|' pipe (and removing the problem row,  although perl nativity reads / writes in UTF8 I don't want to possibly corrupt the file especially the diacritics

I'm going to ask perl experts

PostGreSQL doesn't have the ability to skip problem rows if there is a problem the whole file fails to copy so there could be other rows that errors

Any one got any ideas?
LVL 1
trevor1940Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Martyn SpencerManaging DirectorCommented:
I am going to suggest that you ask this on the Postgresql IRC channel. There are a few guys there who are incredibly knowledgeable about this type of question and would probably have an answer for you quite quickly.

Edit: It may be helpful if you want some more answers here to include a file that contains a row that works and one where it does not. I imagine it would also help if you go down the IRC channel route.
0
trevor1940Author Commented:
Postgresql IRC channel

Do you mean here?

I don't know if this is relevant but looking at the database properties the 'collection' &  'Character type' are both set as "English_United Kingdom, 1252 " I  didn't set this and creating a  new DB the options are "C",  "English_United Kingdom, 1252 " or "POSIX"

BTW I'm on PostgreSQL 9.0
0
Martyn SpencerManaging DirectorCommented:
Yes, the IRC channel is on Freenode. The people there are generally very helpful and also quick to respond. I believe that many of the devs can be found there regularly.

It may well be that if you are using a certain encoding that it would cause problems. This would more likely be the case if you are using a single byte encoding scheme and are attempting to store UTF characters. I have not tested this theory for myself and a quick scan of the docs did not indicate whether or not it would be a problem.

I generally create my databases using something like the following:

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'
       CONNECTION LIMIT = -1;

Open in new window

Naturally, the values of each parameter may need changing to suit your environment, but the ENCODING, LC_COLLATE and LC_CTYPE are relevant to you.

The Postgresql help pages do advise which character sets are supported: https://www.postgresql.org/docs/9.0/static/multibyte.html
0
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

trevor1940Author Commented:
Hi
Had to do this from home
clicked on the link pop up box says mibbit
Enter nickname
irc.freenode.net: Terminated

I wasn't able to get the irc chanel to work in any browser so I joined and sent email to the general mailing list
0
trevor1940Author Commented:
I tried the following

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF-8'
       LC_CTYPE = 'en_GB.UTF-8'
       CONNECTION LIMIT = -1;
-- AND THIS

CREATE DATABASE "database_name"
  WITH OWNER = owner_name
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_GB.UTF8'
       LC_CTYPE = 'en_GB.UTF8'
       CONNECTION LIMIT = -1;

Open in new window


both give "ERROR: invalid local name en_GB.UTF8  /  en_GB.UTF-8"

The doc you linked to suggests  "On Windows, however, UTF-8 encoding can be used with any locale."

This Page it isn't possible to set local to  en_GB / en_US

If I'm understanding correctly you don't have to because  UTF-8 encoding is valid with English_United Kingdom, 1252

the general mailing list hasn't yet helped with suggestions of upgrading & to use the unstable pg_bukloader and isn't supported under windows

However 1 suggestion is an special character not being escaped properly

I assumed the copy command would  handle this??

eg

\tO'Tool\t

-- doesn't need to be escaped to

\tO\'Tool\t

Open in new window

0
Martyn SpencerManaging DirectorCommented:
The script that I use works with en_GB.UTF-8, since I am using Linux and it is a valid choice. The LC_COLLATE and LC_CTYPE are probably a default and tied to the client OS you are using. You could omit these and you will find that Postgresql will usually infer them. I would have expected that the file you are importing would be escaped appropriately. So long as the database encoding is set to UTF8 I would expect that it would handle any UTF characters you wish to import.

Do you have the script that you used to generate the table? Do you mind providing it. I will take a look at importing the file myself and report back my findings.
0
trevor1940Author Commented:
Do you have the script that you used to generate the table? Do you mind providing it. I will take a look at importing the file myself and report back my findings.

That isn't simple to do it's on a closed system

I can retype it here if that will help?
0
Martyn SpencerManaging DirectorCommented:
Yes, please. I would like to create the same table in one of my databases and import the same file to see first-hand what is happening, if that's OK?
0
trevor1940Author Commented:
CREATE DATABASE geonames
  WITH OWNER = postgres
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
      TEMPLATE = postgis
       LC_COLLATE = 'English_United Kingdom, 1252'
       LC_CTYPE = 'English_United Kingdom, 1252'
       CONNECTION LIMIT = -1;


CREATE TABLE  features(
  id SERIAL PRIMARY KEY,
rc INTEGER,
ufi INTEGER,
uni INTEGER,
lat REAL,
long REAL,
dms_lat text,
dms_long text,
mgrs CHARACTER(15); 
jog CHARACTER(7);
fc CHARACTER(1);
dsg CHARACTER(5);
pc INTEGER,
ccl CHARACTER(255);
adm1 CHARACTER(2);
pop INTERGER,
elev REAL,
cc2 CHARACTER(255);
nt CHARACTER(2);
lc CHARACTER(3);
short_form CHARACTER(128);
generic CHARACTER(28);
short_name_ro  CHARACTER(255);
full_name_ro  CHARACTER(255);
full_name_nd_ro  CHARACTER(255);
short_name_rg  CHARACTER(255);
full_name_rg  CHARACTER(255);
full_name_nd_rg  CHARACTER(255);
note TEXT,
mod_date TEXT;
name_rank  TEXT;
name_link  TEXT;
transl_cd  TEXT;
nm_modify_date  TEXT;
f_effective_dt  TEXT;
f_term_dt TEXT


);

Open in new window


the above has been typed in so may contain typo's it was ordinarily taken from the header row of the countrynames.txt
the column types from given format

I then used the copy command as in opening remarks

hope that helps
0
Martyn SpencerManaging DirectorCommented:
Thanks. Taking a look now. I have replaced the semicolons with commas in your script. I would also (for your own reference) change your longer "character" columns to "character varying" since the Postgresql docs imply that the extra storage required by "character" can impose a small performance penalty. This will not be related to the error you are experiencing.
0
trevor1940Author Commented:
I have replaced the semicolons with commas in your script

Ooops sorry I did say there may be typo's

I down loaded the separate country files and found the problem row

looking at it I'm wondering if I've been barking up the wrong tree and nothing to with UTF-8 but more to do with not escaping the quotes

4	-3373955	6329083	32.739348	64.994051	32:44:22N	64:59:39E	41SPS8683824150	NI41-16	P	PPL		AF	10				V	rus			DEKHYAK	Dekh”yak	Dekh"yak	DEKHYAK	Dekh”yak	Dekh"yak		2010-05-03	1,2,3,4,5				2005-06-01

Open in new window


It may not be obvious from this it looks as if there are 2 different double quotes not 2 singles as I thought I've uploaded a snip which is clearer GeoN.JPG
If this is the issue I'm guessing the only way to resolve this will via a perl script
0
Martyn SpencerManaging DirectorCommented:
It looks as though when the file was generated it used tab delimiters but they have also duplicated the quotes. With tab delimiters I would not have expected it to be necessary to do this and I would not have thought it necessary to escape them either, since there is no conflict with the delimiter chosen. What I expect, and what is, may not necessarily prove to be the case. If the usage of quotes is completely consistent, you could more than likely use sed to modify the file (or a perl script if you prefer). I will still have a go at importing the file shortly and see if there is a way of handling the file as-is, rather than tweaking it. This is for personal interest now :)
0
trevor1940Author Commented:
I agree with you I'd rater use data in it's raw form 1st so my actions don't corrupt the data, I'm confidant they won't but it's a possibility especially when dealing with diacritics and large data sets 2nd We may wish to update this periodicaly

"sed" Is that a Linux  command?
0
Martyn SpencerManaging DirectorCommented:
sed is a Linux command, yes. But there is a Windows alternative here (probably others exist as well) and if you look at Cygwin, you would definitely be able to use it. Cygwin is basically a command interpreter environment that lets you run Linux commands on Windows. I used it a lot before i transitioned to Linux for my main environment.
0
Martyn SpencerManaging DirectorCommented:
OK. Looking at the file, there are " characters. A workaround for me was to use the QUOTE parameter and set it to the ~ character, since it did not exist in the file. Naturally, the columns will still have the " character in there, but that may have been intentional. You could remove this character and replace it with something else.

I also noticed that you were missing a column. My definition that worked was:

CREATE TABLE countries
(
  rc integer,
  ufi integer,
  uni integer,
  lat real,
  "long" real,
  dms_lat text,
  dms_long text,
  mgrs character varying(15),
  jog character varying(7),
  fc character varying(1),
  dsg character varying(5),
  pc integer,
  ccl character varying(255),
  adm1 character varying(2),
  pop integer,
  elev real,
  cc2 character varying(255),
  nt character varying(2),
  lc character varying(3),
  short_form character varying(128),
  generic character varying(128),
  short_name_ro character varying(255),
  full_name_ro character varying(255),
  full_name_nd_ro character varying(255),
  short_name_rg character varying(255),
  full_name_rg character varying(255),
  full_name_nd_rg character varying(255),
  note text,
  modify_date text,
  display text, 
  name_rank text,
  name_link text,
  transl_cd text,
  nm_modify_date text,
  f_effective_dt text,
  f_term_dt text
)
WITH (
  OIDS=FALSE
);

Open in new window


And the import command I used was:
copy countries from '/tmp/Countries.txt' with delimiter E'\t' csv HEADER QUOTE '~';

Open in new window


If you want the ID column, you can add it and then specify the columns as you did.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
trevor1940Author Commented:
WOW!

Thanx very much

That appears to be working I've left it running
I did use an id field primarily because it's a requirement of QGIS and simple row location
0
Martyn SpencerManaging DirectorCommented:
Excellent. Glad to hear that it's working for you :)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PostgreSQL

From novice to tech pro — start learning today.