Link to home
Start Free TrialLog in
Avatar of trevor1940
trevor1940

asked on

PostgreSQL / PostGIS: Joining two tables using nearest neighbor

I have two tables with point geometry both with SRID 4326  (WGS84)

How do refine the query bellow to loop through every row / point in the pointonsurface  table locating it's  nearest neighbor  in the geonames_org table?

Given this is WGS84 distance is in degrees which is fine
CREATE TABLE pointonsurface_link_geonames(
   id SERIAL PRIMARY KEY,
   pos_id int,
   geoname_id int,
   distance int
   ); 

-- run in psgl
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%'
ORDER BY dist LIMIT 1;

Open in new window


UPDATE:

I think this can be done using LATERAL JOIN
Something like this

SELECT id FROM pointonsurface AS pid LIMIT 5 -- LIMIT set just to test
 CROSS JOIN LATERAL (
    INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
      SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1;
) g
GROUP BY pid
ORDER BY pid;

Open in new window

Gives:

ERROR:  syntax error at or near "CROSS"
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Cannot help at all with the GIS aspect.

Check the docs for the syntax on LIMIT:
https://www.postgresql.org/docs/12/queries-limit.html

It goes at the end...

You have it right in the inner query.

Also looks like your insert is in the wrong position.

So try:
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT id FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1;
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

Avatar of trevor1940

ASKER

Hi
I'm getting the following error

ERROR:  operator does not exist: integer = pointonsurface
LINE 6:       WHERE g.fcode LIKE 'ADM%' AND p.id = pid
                                                 ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 305

Open in new window

I tried to CAST but the pointonsurface id is already an integer
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT CAST (id AS INTEGER) FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
      ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window

>>I tried to CAST but the pointonsurface id is already an integer

I already mentioned I cannot help with the GIS aspects at all.  As far as the error, looks like you might need to rethink your design so you have compatible data types.

BUT, you no longer the the original error...
The only GIS part is "p.geometry <-> g.geometry" replacing it with  a number gives the same error
However replacing p.id = pid with p.id = 123 gives the error bellow

INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT id  FROM pointonsurface AS pid
 CROSS JOIN LATERAL (
         SELECT p.id as pos_id, g.geoname_id as gn_id, 10 AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id = 123
      ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window


Error
ERROR:  column "pid.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: SELECT id FROM pointonsurface AS pid
               ^
SQL state: 42803
Character: 79

Open in new window

Error is pretty straight forward.

Why do you have a "group by"?  If you have it, ALL non-aggregated columns MUST be in the group by.  I don't see any aggregate functions so not sure why you have the group by.
OK removing the GROUP BY and ignoring the insert for now
I think I need to  work out how to pass the id from the outer query into the inner query returning the 3 values back to the outer query
Any idea how I do that? Google hasn't helped
Not sure.   You really doesn't use values between things with inline views.  What does your query look like now?
From this page I've got this

SELECT * FROM (
   SELECT id AS pid FROM pointonsurface 
   ) pos
 LEFT JOIN LATERAL (
         SELECT p.id, g.geoname_id, 10 AS dist 
      FROM pointonsurface p, geonames_org g
      WHERE g.fcode LIKE 'ADM%' AND p.id =pos.pid
      ORDER BY dist LIMIT 1
) 

ORDER BY pos.pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window


How would you of done this is Oracle?
I'm not all that familiar with the LATERAL join in Postgres so not really sure how to port it to Oracle.

I understand you want the 5 closest but not sure you have to use a lateral join to do so.

If you can provide a simplified test case, with data, and expected results, I can try to provide the SQL on how I would solve it.  If you can simplify it down to the point that doesn't involve GIS points, I would appreciate it.
OK
Will try and knock something up in the morning
OK this is a bit contrived but hopefully  it shows what I'm trying to do

If I was doing it in an external script like perl or C# I'd do something like this pseudo code
my res = SELECT pos_id FROM loop_tbl;

foreach(my pid in res){
   num1 = SELECT num FROM loop_tbl WHERE pos_id = pid;
  (geoname_id, num2) = SELECT geoname_id, num FROM data_tbl;
   dist = num1 / num2; 
   INSERT INTO link_tbl(pos_id,geoname_id,distance)
   VALUES (pid,geoname_id,dist);
}

Open in new window




SQL

CREATE TABLE loop_tbl(
   pos_id SERIAL PRIMARY KEY,
   num int
);

CREATE TABLE data_tbl(
   geoname_id SERIAL PRIMARY KEY,
   num int
);

CREATE TABLE link_tbl(
   id SERIAL PRIMARY KEY,
   pos_id int,
   geoname_id int,
   distance double precision
);


INSERT INTO loop_tbl(num) VALUES(5);
INSERT INTO loop_tbl(num) VALUES(6);
INSERT INTO loop_tbl(num) VALUES(7);

INSERT INTO data_tbl(num) VALUES(2);

--SELECT round( CAST(float8 '3.1415927' as numeric), 2);

select * FROM loop_tbl;
select * FROM data_tbl;

-- Expected in link_tbl
-- 1,1,2.5
-- 2,1,3
-- 3,1,3.5

-- FROM previous efforts

INSERT INTO link_tbl(pos_id,geoname_id,distance)
SELECT * FROM (
   SELECT pos_id AS pid FROM loop_tbl
   ) pos
 LEFT JOIN LATERAL (
          SELECT p.pos_id, g.geoname_id, p.num / g.num AS dist 
      FROM lopo_tbl p, data_tbl g
      p.pos_id = pos.pid
      ORDER BY dist LIMIT 1 
) 

ORDER BY pos.pid
 LIMIT 5 -- LIMIT set just to test;

Open in new window



NOTE: the math is just to replace the GIS function which finds the distance between points by putting ORDER BY dist LIMIT 1 this finds the closest point


I'm wondering if the inner query can or should be done as a function that takes the id from the first table, performs the computations and returns the 3 required values?  
Since the inner query uses the same table in the outer query, not seeing the need for the outer query.  Creating a function will likely make it more complex than it needs to be.

I saw the test case you posted this morning.  I haven't had time to get around to looking at it yet.
ASKER CERTIFIED SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanx
This works with the mock data unsure how to scale it up for the real data maybe I either write external script or create a  new question
You still probably don't need an external script or function or anything else other than straight SQL.

We can continue to talk about scaling it up here as long as it doesn't deviate away from the original question.

What don't you think will scale?
I need to be able to loop through through the "loop_tbl" so I can use a WHERE clause inside the inner query
...... from opening remarks
-- DO GIS Query 
      WHERE g.fcode LIKE 'ADM%' AND p.id = pid
       ORDER BY dist LIMIT 1; 

Open in new window


The GIS & above basically means the closest point to a given point otherwise it  will  find the distance to every point

if the loop_tbl has 10 points and the data has 100
I'm expecting a 1 to 1 link so the link_tbl will have 10 rows not 1,000

To simulate add another row to the data_tbl

Hope that makes sense 
You don't want the closest distance to ALL values in the data table?

Even if you don't, just create a CTE to do them 1 by 1 and cross join that?
insert into link_tbl(pos_id,geoname_id, distance)
with single_data_row as (
   select * from data_tbl limit 1
)
select l.pos_id,d.geoname_id, l.num/d.num::decimal
from loop_tbl l cross join single_data_row d;

Open in new window



Fiddle here:
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=132882393a60fb813dc5ce7fb7b3b171
>>  You don't want the closest distance to ALL values in the data table?
no
point 1 in loop table find the closest point in the data table The GIS does this without the LIMIT it will return the distance to every point and crash the database

Put another way if  the loop table were a list of bust stops and  the data table a list of pubs with open or closed bool
I'm  at 1 bus stop  find me the nearest open pub I don't need every pub in the data table

Then find the nearest pub to each bus stop


Your query gives

idpos_idgeoname_iddistance
1112.5
2213
3313.5
If
INSERT INTO data_tbl(num) VALUES(999);
INSERT INTO data_tbl(num) VALUES(123);

Open in new window

Assuming 3 row in each table I would expect something like this:  but like I said the data table has a lot more points than the loop table

idpos_idgeoname_iddistance
1112.5
222166.5
33317.57

If my math is correct

In theory 1 point /pub in the data table could be closest  to the same point / bus stop in the loop table but we'll skip that for now 
>>I'm  at 1 bus stop  find me the nearest open pub I don't need every pub in the data table

But to find the nearest you have to look at ALL.

>>Then find the nearest pub to each bus stop

Doesn't that contradict the "no" response to my question "You don't want the closest distance to ALL values in the data table"?


>>Your query gives

It grabs a random single row from data_tbl.  Could be the first, could be the last, could be any one somewhere in the middle.

>>I would expect something like this

Not sure how you get that and I don't have to.  Take the SQL I provided and apply your logic to it?
I think we maybe we have our wires crossed

The GIS query uses the GIST index to find the distance from a bus stop to every pub so you might in assumption 
Then use ORDER BY dist & LIMIT 1 to return the closest

I run your fiddle again I got the same results

Yes, we have definitely crossed wires.  You understand your exact requirements.  Many times trying to explain them to others in a Q&A site like this can be a problem.  This is why we ask for sample data and the results from that data.  It helps us visualize the requirements you are trying to explain.

You can use ROW_NUMBER to get the "first" of things.  I don't understand the need to use LIMIT and ORDER by in some inner query.

I can only go by the data you provide and base SQL on what I believe your requirements are.

>>Assuming 3 row in each table I would expect something like this:

I would need to see more data in the test case and the expected results from the data you provide.
>> I would need to see more data in the test case and the expected results from the data you provide.

Very hard to provide mock geo data when the question isn't directly related and convert it to a non geo example

Taking my bus route scenario
The route through MyTown has 15 stops 8 going north & 7 going south some are open & some are closed
MyTown has 57 pubs

At each stop I need to find 1 open pub with a garden (limit & distance) I have time to have 1 drink then catch the next bus

I'm at stop 1 The Angel is 50m away open  & the Ferret is  5km and closed  The Pie & Eel is 3km but only 600m from stop 3  & opens in 1 hour
At stop 1 I'm not concerned with the Ferret or The Pie & Eel are because they are closed I'm can buy a Drink in The Angel
At stop 2 The Green Man is Open I don't care about the Black Cat it is closed
by the time i get to stop 3 The Pie & Eel  is now open

At each bus stop you effectively search all 57 pubs returning 1 that is open because the pub status changes in  relation to each stop

Now do you see why a new search needs to be made at each stop?
Unsure how I can convert that into an database with out adding more complexity to change the status of The Pie & Eel assuming it happens
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=a3c8f18a4c07df2a6b047035c3d826b6

I'd expect (ignore distance my math not that good)
idpos_idgeoname_idpub_namedistance
111The Angel15724.03777240155
225
The Green Man
141501.61141908905
334
 The Pie & Eel  
298679.3506199334


>> You can use ROW_NUMBER to get the "first" of things.  
Will that not always return the lowest value




SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Took me a while to work out where in the previous SQL you put the insert statement but got there 

with all_open as (
   select l.pos_id,d.geoname_id, pub_name,l.geog <-> d.geog  AS distance
   from loop_tbl l cross join data_tbl d
   where open
),
ranked as (
   select pos_id, geoname_id, pub_name, distance,
      row_number() over(partition by pos_id order by distance asc) rn
   from all_open
)
INSERT INTO link_tbl(pos_id,geoname_id, pub_name,distance) 
select pos_id, geoname_id, pub_name,  round(CAST(distance  As numeric),2) 
from ranked
where rn=1

Open in new window


BTW: I hope you realize the Bus route pub crawl is just my attempt at trying to explain my logic  

I really appreciate your help on this