I am importing a table with geographic co-ordinates of about 5.5 million places on the globe. The table is rather large with 5.5 M records and about 30 data fields. The file is in text file format delimited by a tab. Now I have two questions regarding this data:

IMPORT
How can I best import this data into my SQL Server 2000 database? What I did now was to import it by uploading it to my server and importing it as a new table. Problem now is that I can easily drop the columns that I do not need but all other fields are set to CHAR 8000. Which is much too big. Changing these columns to for instance CHAR 200 takes a very long time... in fact it seems my server is never getting done with changing 1 field! Would it be a better idea to create an empty table first with the specs that I need and then import it?
Also: what is the SQL Server data type for decimal numbers? I need a data type that will support values like: 51.4166667 so at least 6-8 decimals.

CALCULATIONS
The moment I have this table inserted I will then want to be able to use it to look up the coordinates of locations on the globe. So for instance if I have something that is linked to Washington then I know the coordinates of Washington. Now I want to be able to search in our database for things that are in a 10 mile radius from washington. For this I need to be able to query a table including a calculation is this possible? Perhaps in a stored procedure?

An example of those calculations:

Simplest Calculation:
Distance = sqrt(x * x + y * y)
where x = 69.1 * (lat2 - lat1)
and y = 53.0 * (lon2 - lon1)

More Precise Calculation:
Distance = sqrt(x * x + y * y)
where x = 69.1 * (lat2 - lat1)
and y = 69.1 * (lon2 - lon1) * cos(lat1/57.3)

Most Precise Calculation:
Distance = 3963.0 * arctan[sqrt(1-x^2))/x]
where x = [sin(lat1/57.2958) * sin(lat2/57.2958)] + [cos(lat1/57.2958) * cos(lat2/57.2958) * cos(lon2/57.2958 - lon1/57.2958)]

I am fine if only the simplest calculation is feasible. But how would I put this in a query? Let say that Washington is at latitude=x and longtitude=y how would I query for places 10 miles from there?

SELECT * FROM tblX WHERE (sqrt(69.1 * (LATITUDE- x) * 69.1 * (LATITUDE- x) + y = 53.0 * (LONGITUDE- y) * y = 53.0 * (LONGITUDE- y)) < 10)

Can you query like this? If so: does SQL Server support sin, cos, tan, arcsin, arccos, arctan? What would be the expected impact of such a query? I suppose this would be a very heavy query to run, right?

Regarding Import: Definitely create a table with the correct columns and not STRINGs . A subsequent alter table on 5.5 million rows will take a long time
Regarding decimals, in banking we use NUMERIC because it allows you to specify the precision (unlike FLOAT); eg NUMERIC(20,12)

Yes, this query should work if the math is right. 5 million rows really isn't all that much, but I'm not going to wager a guess as to how long that will run for.

If this is something you are going to run regularly, I would consider making an extra column with computed ranges for distances from major cities (assuming this would be the majority of the queries). You could have about a dozen or so extra columns such as DIST_FROM_NYC, DIST_FROM_PHILA. This would cut the calculation time for those queries to seconds.

Regarding locations, I'm no cartographer, but if you're storing coordinates of set places, then using your formulae above, you can find all sample points in your table within a certain tolerance limit (say 10%) of 10 miles

0

At Springboard, we know how to get you a job in data science. With Springboard’s Data Science Career Track, you’ll master data science with a curriculum built by industry experts. You’ll work on real projects, and get 1-on-1 mentorship from a data scientist.

The best way to import the file would be to use a DTS.
YOu probably should create your target table yourself, because, as you found, with a text file SQL Server is not able to detect the correct data types automatically.

The data type you want is probably FLOAT, but you could possibly use DECIMAL(10,8)

You would be best off creating a function to do your trug - and you can do the most precise quite easily :

CREATE FUNCTION dbo.calc_distance (@lon1 float, @lat1 float, @lon2 float, @lat2 float)
RETURNS float
AS
begin
declare @dist float
set @dist =
(sin(@lat1/57.2958) * sin(@lat2/57.2958)) + (cos(@lat1/57.2958) * cos(@lat2/57.2958) * cos(@lon2/57.2958 - @lon1/57.2958))
return @dist
end

then you can query your table as :
SELECT * FROM MYTABLE
WHERE dbo.calc_distance (longitude, latitide, X,Y) < 10.0

substituting X,Y with the actual values for Washington or wherever you want....

If your data is very big, this will take a long time.
One possible way to speed this up would be to put some extra where clauses, eg
AND longitude between X-10 AND X+10
AND latitude between Y - 10 AND Y + 10

or some such thing, i.e. without any calculations, or functions.
Then you could put an index on these columns, so it would be able to narrow down to this effective square, very quickly, then use the accurate calc only for those small number of potential places

>> How can I best import this data into my SQL Server 2000 database? What I did now was to import it by uploading it to my server and importing it as a new table. <<

Use BULK INSERT since that is the fastest way to import data.

Agree with others on data typing: *unless there is non-numeric data* in the input, use a numeric data type if the data is numeric. And I too recommend using DECIMAL(nn, n) rather than float because of the exactness.

You should start with a select statement to find the points in a square around the search point. If you place an index on the LONGITUDE and LATITUDE tables, this should go reasonably fast. After that, you can select a calculated field which contains the distance, and perform a check on that field.

In a stored procedure, you could get something like:

SELECT sqrt((LATITUDE-@X)*(LATITUDE-@X) + (LONGITUDE-@Y)*(LONGITUDE-@Y)) DISTANCE
FROM
(SELECT * FROM tblCoordinates WHERE LATITUDE >= @X -10 AND LATITUDE <= @X + 10 AND LONGITUDE >= @Y - 10 AND LONGITUDE <= @Y + 10) t1
WHERE DISTANCE <= 10

0

pgkooijmanAuthor Commented:

Wow, lot of good ideas. Will let you know how that works out.....

First things first.... I imported the text file again and all went very well. The table I imported contains place names and their co-ordinates so the queries that will be ran against it would mostly be: SELECT Lat,Lon FROM tblX WHERE PlaceName LIKE '%placename%' or SELECT Lat,Lon FROM tblX WHERE PlaceName='placename'

How can I speed these queries up? I tried making PlaceName a unique key but that took forever and then gave a duplicate key error. What are other tricks... I am just not great at SQL Server..... I suppose placing an index would do the trick but how do that best? How do I do that at all?

It's a good idea to use a unique key; sound like a composite key of coordiantes fits the bill here since no two locations can have the same coordinates (I don't know the implementation, but I assume as much)

You can then create a separate nonunique (nonclustered) index on PlaceName

Guy Hengel [angelIII / a3]Billing EngineerCommented:

>>WHERE PlaceName LIKE '%placename%
you cannot speed that query up unless you remove the first %.
Indexes are not of any help here, unless you start using full-text indexing.

>> I tried making PlaceName a unique key but
Of course, don't use a UNIQUE key, but a simple key (without unique property). To locate (after the index creation) which entries are duplicate:
select PlaceName, count(*)
from YourTable
group by PlaceName
having count(*)>1

CHeers

0

pgkooijmanAuthor Commented:

Have divided the points according to the answers that gave the solutions I needed. Will start a new question with the remaining open issues. This thread is getting a bit chaotic. Thanks for all the help!

0

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

Regarding decimals, in banking we use NUMERIC because it allows you to specify the precision (unlike FLOAT); eg NUMERIC(20,12)