MFredin
asked on
Removing duplicate cities in huge geo database using cfloop. Is there a faster way to do this?
I've got a huge database of all city names, long, lat, etc for every zip code in North America. Many cities have multiple zip codes and therefore many entries. What I'm trying to do is make my database smaller by removing all the duplicate rows based on state and city name. This way, I will only have 1 record for each city.
Here is what I have now. The problem is it is deathly slow. Is there a better way I could be doing this without using Coldfusion to loop over each record? MySQL 5, Coldfusion 9
Here is what I have now. The problem is it is deathly slow. Is there a better way I could be doing this without using Coldfusion to loop over each record? MySQL 5, Coldfusion 9
<cfquery name="states">
SELECT states
FROM states
</cfquery>
<cfloop query="states">
<cfquery name="cities">
SELECT DISTINCT(city)as city
FROM cities
WHERE region1 = '#states.state#'
ORDER BY city
</cfquery>
<cfoutput query="cities">
<cfquery name="info">
SELECT *
FROM cities
WHERE region1 = '#state#'
AND city = '#cities.city#'
LIMIT 1
</cfquery>
<cfquery>
INSERT INTO cities_new
(country, language, regISO2, Region1, Region2, Region3, Region4, zip, city, area1, area2, lat, lng, tz, utc, dst)
VALUES
('#UCASE(info.country)#', '#UCASE(info.language)#', '#UCASE(info.regiso2)#', '#UCASE(info.region1)#', '#UCASE(info.region2)#', '#UCASE(info.region3)#', '#UCASE(info.region4)#', '#UCASE(info.zip)#', '#UCASE(info.city)#', '#UCASE(info.area1)#', '#UCASE(info.area2)#', #info.lat#, #info.lng#, '#UCASE(info.tz)#', '#UCASE(info.utc)#', '#UCASE(info.dst)#')
</cfquery>
</cfoutput>
</cfloop>
DELETE FROM your_table WHERE id NOT IN (SELECT id FROM your_table GROUP BY Region1, city)
ASKER
I tried this... but getting the following error: You can't specify target table 'cities' for update in FROM clause
<cfquery>
DELETE
FROM cities
WHERE country = 'US'
AND id NOT IN (SELECT id FROM cities WHERE country = 'US' GROUP BY Region1, city)
</cfquery>
<cfquery>
DELETE
FROM cities
WHERE country = 'US'
AND id NOT IN (SELECT id FROM cities WHERE country = 'US' GROUP BY Region1, city)
</cfquery>
> DELETE FROM your_table WHERE id NOT IN (SELECT id FROM your_table GROUP BY Region1, city)
seems to me this will not delete any records, the inner SELECT will fetch every single ID so there are no IDs not included by the inner select... therefore, no records will be removed
how about this code...
I hope you've made a backup or are testing in a copy of the table or something.. !
seems to me this will not delete any records, the inner SELECT will fetch every single ID so there are no IDs not included by the inner select... therefore, no records will be removed
how about this code...
I hope you've made a backup or are testing in a copy of the table or something.. !
delete cities
WHERE country = 'US'
and ID In (select min(id)
from cities
group by city, region
having count(*) > 1)
oh, I forgot to mention. That will deleted one record of the duplicate. You keep running it until it returns 0 results. Some cities may have 3 or 4 duplicates, if they do, just run the script until no records are returned. Of course not the best solution for automation, but an easy way if you're doing it manually just one time.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
wow, that's a nasty restriction for a database
select distinct... can avoid that, see my previous post
select distinct... can avoid that, see my previous post
ASKER
Thanks! Inserting into a new table was the best solution for me. Fast too! Thanks!