trevor1940
asked on
PostgreSQL / PostGIS: Joining two tables using nearest neighbor
I have two tables with point geometry both with SRID 4326 (WGS84)
How do refine the query bellow to loop through every row / point in the pointonsurface table locating it's nearest neighbor in the geonames_org table?
Given this is WGS84 distance is in degrees which is fine
UPDATE:
I think this can be done using LATERAL JOIN
Something like this
ERROR: syntax error at or near "CROSS"
How do refine the query bellow to loop through every row / point in the pointonsurface table locating it's nearest neighbor in the geonames_org table?
Given this is WGS84 distance is in degrees which is fine
CREATE TABLE pointonsurface_link_geonames(
id SERIAL PRIMARY KEY,
pos_id int,
geoname_id int,
distance int
);
-- run in psgl
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%'
ORDER BY dist LIMIT 1;
UPDATE:
I think this can be done using LATERAL JOIN
Something like this
SELECT id FROM pointonsurface AS pid LIMIT 5 -- LIMIT set just to test
CROSS JOIN LATERAL (
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%' AND p.id = pid
ORDER BY dist LIMIT 1;
) g
GROUP BY pid
ORDER BY pid;
Gives:ERROR: syntax error at or near "CROSS"
ASKER
Hi
I'm getting the following error
I'm getting the following error
ERROR: operator does not exist: integer = pointonsurface
LINE 6: WHERE g.fcode LIKE 'ADM%' AND p.id = pid
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
SQL state: 42883
Character: 305
I tried to CAST but the pointonsurface id is already an integerINSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT CAST (id AS INTEGER) FROM pointonsurface AS pid
CROSS JOIN LATERAL (
SELECT p.id as pos_id, g.geoname_id as gn_id, p.geometry <-> g.geometry AS dist
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%' AND p.id = pid
ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
LIMIT 5 -- LIMIT set just to test;
>>I tried to CAST but the pointonsurface id is already an integer
I already mentioned I cannot help with the GIS aspects at all. As far as the error, looks like you might need to rethink your design so you have compatible data types.
BUT, you no longer the the original error...
I already mentioned I cannot help with the GIS aspects at all. As far as the error, looks like you might need to rethink your design so you have compatible data types.
BUT, you no longer the the original error...
ASKER
The only GIS part is "p.geometry <-> g.geometry" replacing it with a number gives the same error
However replacing p.id = pid with p.id = 123 gives the error bellow
Error
However replacing p.id = pid with p.id = 123 gives the error bellow
INSERT into pointonsurface_link_geonames (pos_id,geoname_id,distance)
SELECT id FROM pointonsurface AS pid
CROSS JOIN LATERAL (
SELECT p.id as pos_id, g.geoname_id as gn_id, 10 AS dist
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%' AND p.id = 123
ORDER BY dist LIMIT 1
) g
GROUP BY pid
ORDER BY pid
LIMIT 5 -- LIMIT set just to test;
Error
ERROR: column "pid.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: SELECT id FROM pointonsurface AS pid
^
SQL state: 42803
Character: 79
Error is pretty straight forward.
Why do you have a "group by"? If you have it, ALL non-aggregated columns MUST be in the group by. I don't see any aggregate functions so not sure why you have the group by.
Why do you have a "group by"? If you have it, ALL non-aggregated columns MUST be in the group by. I don't see any aggregate functions so not sure why you have the group by.
ASKER
OK removing the GROUP BY and ignoring the insert for now
I think I need to work out how to pass the id from the outer query into the inner query returning the 3 values back to the outer query
Any idea how I do that? Google hasn't helped
I think I need to work out how to pass the id from the outer query into the inner query returning the 3 values back to the outer query
Any idea how I do that? Google hasn't helped
Not sure. You really doesn't use values between things with inline views. What does your query look like now?
ASKER
From this page I've got this
How would you of done this is Oracle?
SELECT * FROM (
SELECT id AS pid FROM pointonsurface
) pos
LEFT JOIN LATERAL (
SELECT p.id, g.geoname_id, 10 AS dist
FROM pointonsurface p, geonames_org g
WHERE g.fcode LIKE 'ADM%' AND p.id =pos.pid
ORDER BY dist LIMIT 1
)
ORDER BY pos.pid
LIMIT 5 -- LIMIT set just to test;
How would you of done this is Oracle?
I'm not all that familiar with the LATERAL join in Postgres so not really sure how to port it to Oracle.
I understand you want the 5 closest but not sure you have to use a lateral join to do so.
If you can provide a simplified test case, with data, and expected results, I can try to provide the SQL on how I would solve it. If you can simplify it down to the point that doesn't involve GIS points, I would appreciate it.
I understand you want the 5 closest but not sure you have to use a lateral join to do so.
If you can provide a simplified test case, with data, and expected results, I can try to provide the SQL on how I would solve it. If you can simplify it down to the point that doesn't involve GIS points, I would appreciate it.
ASKER
OK
Will try and knock something up in the morning
Will try and knock something up in the morning
ASKER
OK this is a bit contrived but hopefully it shows what I'm trying to do
If I was doing it in an external script like perl or C# I'd do something like this pseudo code
SQL
NOTE: the math is just to replace the GIS function which finds the distance between points by putting ORDER BY dist LIMIT 1 this finds the closest point
If I was doing it in an external script like perl or C# I'd do something like this pseudo code
my res = SELECT pos_id FROM loop_tbl;
foreach(my pid in res){
num1 = SELECT num FROM loop_tbl WHERE pos_id = pid;
(geoname_id, num2) = SELECT geoname_id, num FROM data_tbl;
dist = num1 / num2;
INSERT INTO link_tbl(pos_id,geoname_id,distance)
VALUES (pid,geoname_id,dist);
}
SQL
CREATE TABLE loop_tbl(
pos_id SERIAL PRIMARY KEY,
num int
);
CREATE TABLE data_tbl(
geoname_id SERIAL PRIMARY KEY,
num int
);
CREATE TABLE link_tbl(
id SERIAL PRIMARY KEY,
pos_id int,
geoname_id int,
distance double precision
);
INSERT INTO loop_tbl(num) VALUES(5);
INSERT INTO loop_tbl(num) VALUES(6);
INSERT INTO loop_tbl(num) VALUES(7);
INSERT INTO data_tbl(num) VALUES(2);
--SELECT round( CAST(float8 '3.1415927' as numeric), 2);
select * FROM loop_tbl;
select * FROM data_tbl;
-- Expected in link_tbl
-- 1,1,2.5
-- 2,1,3
-- 3,1,3.5
-- FROM previous efforts
INSERT INTO link_tbl(pos_id,geoname_id,distance)
SELECT * FROM (
SELECT pos_id AS pid FROM loop_tbl
) pos
LEFT JOIN LATERAL (
SELECT p.pos_id, g.geoname_id, p.num / g.num AS dist
FROM lopo_tbl p, data_tbl g
p.pos_id = pos.pid
ORDER BY dist LIMIT 1
)
ORDER BY pos.pid
LIMIT 5 -- LIMIT set just to test;
NOTE: the math is just to replace the GIS function which finds the distance between points by putting ORDER BY dist LIMIT 1 this finds the closest point
ASKER
I'm wondering if the inner query can or should be done as a function that takes the id from the first table, performs the computations and returns the 3 required values?
Since the inner query uses the same table in the outer query, not seeing the need for the outer query. Creating a function will likely make it more complex than it needs to be.
I saw the test case you posted this morning. I haven't had time to get around to looking at it yet.
I saw the test case you posted this morning. I haven't had time to get around to looking at it yet.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanx
This works with the mock data unsure how to scale it up for the real data maybe I either write external script or create a new question
This works with the mock data unsure how to scale it up for the real data maybe I either write external script or create a new question
You still probably don't need an external script or function or anything else other than straight SQL.
We can continue to talk about scaling it up here as long as it doesn't deviate away from the original question.
What don't you think will scale?
We can continue to talk about scaling it up here as long as it doesn't deviate away from the original question.
What don't you think will scale?
ASKER
I need to be able to loop through through the "loop_tbl" so I can use a WHERE clause inside the inner query
...... from opening remarks
The GIS & above basically means the closest point to a given point otherwise it will find the distance to every point
if the loop_tbl has 10 points and the data has 100
I'm expecting a 1 to 1 link so the link_tbl will have 10 rows not 1,000
To simulate add another row to the data_tbl
Hope that makes sense
...... from opening remarks
-- DO GIS Query
WHERE g.fcode LIKE 'ADM%' AND p.id = pid
ORDER BY dist LIMIT 1;
The GIS & above basically means the closest point to a given point otherwise it will find the distance to every point
if the loop_tbl has 10 points and the data has 100
I'm expecting a 1 to 1 link so the link_tbl will have 10 rows not 1,000
To simulate add another row to the data_tbl
Hope that makes sense
You don't want the closest distance to ALL values in the data table?
Even if you don't, just create a CTE to do them 1 by 1 and cross join that?
Fiddle here:
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=132882393a60fb813dc5ce7fb7b3b171
Even if you don't, just create a CTE to do them 1 by 1 and cross join that?
insert into link_tbl(pos_id,geoname_id, distance)
with single_data_row as (
select * from data_tbl limit 1
)
select l.pos_id,d.geoname_id, l.num/d.num::decimal
from loop_tbl l cross join single_data_row d;
Fiddle here:
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=132882393a60fb813dc5ce7fb7b3b171
ASKER
>> You don't want the closest distance to ALL values in the data table?
no
point 1 in loop table find the closest point in the data table The GIS does this without the LIMIT it will return the distance to every point and crash the database
Put another way if the loop table were a list of bust stops and the data table a list of pubs with open or closed bool
I'm at 1 bus stop find me the nearest open pub I don't need every pub in the data table
Then find the nearest pub to each bus stop
Your query gives
If
If my math is correct
In theory 1 point /pub in the data table could be closest to the same point / bus stop in the loop table but we'll skip that for now
no
point 1 in loop table find the closest point in the data table The GIS does this without the LIMIT it will return the distance to every point and crash the database
Put another way if the loop table were a list of bust stops and the data table a list of pubs with open or closed bool
I'm at 1 bus stop find me the nearest open pub I don't need every pub in the data table
Then find the nearest pub to each bus stop
Your query gives
id | pos_id | geoname_id | distance |
---|---|---|---|
1 | 1 | 1 | 2.5 |
2 | 2 | 1 | 3 |
3 | 3 | 1 | 3.5 |
INSERT INTO data_tbl(num) VALUES(999);
INSERT INTO data_tbl(num) VALUES(123);
Assuming 3 row in each table I would expect something like this: but like I said the data table has a lot more points than the loop tableid | pos_id | geoname_id | distance |
---|---|---|---|
1 | 1 | 1 | 2.5 |
2 | 2 | 2 | 166.5 |
3 | 3 | 3 | 17.57 |
If my math is correct
In theory 1 point /pub in the data table could be closest to the same point / bus stop in the loop table but we'll skip that for now
>>I'm at 1 bus stop find me the nearest open pub I don't need every pub in the data table
But to find the nearest you have to look at ALL.
>>Then find the nearest pub to each bus stop
Doesn't that contradict the "no" response to my question "You don't want the closest distance to ALL values in the data table"?
>>Your query gives
It grabs a random single row from data_tbl. Could be the first, could be the last, could be any one somewhere in the middle.
>>I would expect something like this
Not sure how you get that and I don't have to. Take the SQL I provided and apply your logic to it?
But to find the nearest you have to look at ALL.
>>Then find the nearest pub to each bus stop
Doesn't that contradict the "no" response to my question "You don't want the closest distance to ALL values in the data table"?
>>Your query gives
It grabs a random single row from data_tbl. Could be the first, could be the last, could be any one somewhere in the middle.
>>I would expect something like this
Not sure how you get that and I don't have to. Take the SQL I provided and apply your logic to it?
ASKER
I think we maybe we have our wires crossed
The GIS query uses the GIST index to find the distance from a bus stop to every pub so you might in assumption
Then use ORDER BY dist & LIMIT 1 to return the closest
I run your fiddle again I got the same results
The GIS query uses the GIST index to find the distance from a bus stop to every pub so you might in assumption
Then use ORDER BY dist & LIMIT 1 to return the closest
I run your fiddle again I got the same results
Yes, we have definitely crossed wires. You understand your exact requirements. Many times trying to explain them to others in a Q&A site like this can be a problem. This is why we ask for sample data and the results from that data. It helps us visualize the requirements you are trying to explain.
You can use ROW_NUMBER to get the "first" of things. I don't understand the need to use LIMIT and ORDER by in some inner query.
I can only go by the data you provide and base SQL on what I believe your requirements are.
>>Assuming 3 row in each table I would expect something like this:
I would need to see more data in the test case and the expected results from the data you provide.
You can use ROW_NUMBER to get the "first" of things. I don't understand the need to use LIMIT and ORDER by in some inner query.
I can only go by the data you provide and base SQL on what I believe your requirements are.
>>Assuming 3 row in each table I would expect something like this:
I would need to see more data in the test case and the expected results from the data you provide.
ASKER
>> I would need to see more data in the test case and the expected results from the data you provide.
Very hard to provide mock geo data when the question isn't directly related and convert it to a non geo example
Taking my bus route scenario
The route through MyTown has 15 stops 8 going north & 7 going south some are open & some are closed
MyTown has 57 pubs
At each stop I need to find 1 open pub with a garden (limit & distance) I have time to have 1 drink then catch the next bus
I'm at stop 1 The Angel is 50m away open & the Ferret is 5km and closed The Pie & Eel is 3km but only 600m from stop 3 & opens in 1 hour
At stop 1 I'm not concerned with the Ferret or The Pie & Eel are because they are closed I'm can buy a Drink in The Angel
At stop 2 The Green Man is Open I don't care about the Black Cat it is closed
by the time i get to stop 3 The Pie & Eel is now open
At each bus stop you effectively search all 57 pubs returning 1 that is open because the pub status changes in relation to each stop
Now do you see why a new search needs to be made at each stop?
Unsure how I can convert that into an database with out adding more complexity to change the status of The Pie & Eel assuming it happens
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=a3c8f18a4c07df2a6b047035c3d826b6
I'd expect (ignore distance my math not that good)
>> You can use ROW_NUMBER to get the "first" of things.
Will that not always return the lowest value
Very hard to provide mock geo data when the question isn't directly related and convert it to a non geo example
Taking my bus route scenario
The route through MyTown has 15 stops 8 going north & 7 going south some are open & some are closed
MyTown has 57 pubs
At each stop I need to find 1 open pub with a garden (limit & distance) I have time to have 1 drink then catch the next bus
I'm at stop 1 The Angel is 50m away open & the Ferret is 5km and closed The Pie & Eel is 3km but only 600m from stop 3 & opens in 1 hour
At stop 1 I'm not concerned with the Ferret or The Pie & Eel are because they are closed I'm can buy a Drink in The Angel
At stop 2 The Green Man is Open I don't care about the Black Cat it is closed
by the time i get to stop 3 The Pie & Eel is now open
At each bus stop you effectively search all 57 pubs returning 1 that is open because the pub status changes in relation to each stop
Now do you see why a new search needs to be made at each stop?
Unsure how I can convert that into an database with out adding more complexity to change the status of The Pie & Eel assuming it happens
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=a3c8f18a4c07df2a6b047035c3d826b6
I'd expect (ignore distance my math not that good)
id | pos_id | geoname_id | pub_name | distance |
---|---|---|---|---|
1 | 1 | 1 | The Angel | 15724.03777240155 |
2 | 2 | 5 | The Green Man | 141501.61141908905 |
3 | 3 | 4 | The Pie & Eel | 298679.3506199334 |
>> You can use ROW_NUMBER to get the "first" of things.
Will that not always return the lowest value
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Took me a while to work out where in the previous SQL you put the insert statement but got there
BTW: I hope you realize the Bus route pub crawl is just my attempt at trying to explain my logic
I really appreciate your help on this
with all_open as (
select l.pos_id,d.geoname_id, pub_name,l.geog <-> d.geog AS distance
from loop_tbl l cross join data_tbl d
where open
),
ranked as (
select pos_id, geoname_id, pub_name, distance,
row_number() over(partition by pos_id order by distance asc) rn
from all_open
)
INSERT INTO link_tbl(pos_id,geoname_id, pub_name,distance)
select pos_id, geoname_id, pub_name, round(CAST(distance As numeric),2)
from ranked
where rn=1
BTW: I hope you realize the Bus route pub crawl is just my attempt at trying to explain my logic
I really appreciate your help on this
Check the docs for the syntax on LIMIT:
https://www.postgresql.org/docs/12/queries-limit.html
It goes at the end...
You have it right in the inner query.
Also looks like your insert is in the wrong position.
So try:
Open in new window