KarlisB
asked on
optimize query
Greetings!
I have a following query:
as you see, i have joined atfirst one "item" id table with a auto generated table of dates.
as a result i got table where every item_id has assigned about 365 last dates.
After that i join it to a massive table which contains every day statistics for items. in the end query returns item_id's and dates where data is missing and is not present in statistics table.
I wish to know, is there anything i can do to optimize the query?
or is there another way how to get id's of missing items at certain dates?
item table has about 12 000 records;(after filtering)
statistic table has about 50 000 000 records
first time query call takes 160 sec, after cashing 20 secs ;
am using multicolumn index acording to all 3 fields in where clause on statistics table;
query plan:
"Aggregate (cost=375007.39..375007.40 rows=1 width=0)"
" -> Nested Loop Anti Join (cost=0.01..375007.39 rows=1 width=0)"
" -> Nested Loop (cost=0.00..3130.29 rows=30000 width=8)"
" -> Seq Scan on preces (cost=0.00..2455.29 rows=30 width=4)"
" Filter: (sort_a AND (NOT stop_prece) AND (id = id))"
" -> Function Scan on generate_series a (cost=0.00..12.50 rows=1000 width=4)"
" -> Index Scan using realizacija_pkey on realizacija (cost=0.01..12.90 rows=1 width=8)"
" Index Cond: ((realizacija.veikala_id = 1) AND (preces.id = realizacija.prece_id) AND (realizacija.datums = (('now'::text)::date - a.a)))"
database type: Postgres
I have a following query:
as you see, i have joined atfirst one "item" id table with a auto generated table of dates.
as a result i got table where every item_id has assigned about 365 last dates.
After that i join it to a massive table which contains every day statistics for items. in the end query returns item_id's and dates where data is missing and is not present in statistics table.
I wish to know, is there anything i can do to optimize the query?
or is there another way how to get id's of missing items at certain dates?
item table has about 12 000 records;(after filtering)
statistic table has about 50 000 000 records
first time query call takes 160 sec, after cashing 20 secs ;
am using multicolumn index acording to all 3 fields in where clause on statistics table;
query plan:
"Aggregate (cost=375007.39..375007.40
" -> Nested Loop Anti Join (cost=0.01..375007.39 rows=1 width=0)"
" -> Nested Loop (cost=0.00..3130.29 rows=30000 width=8)"
" -> Seq Scan on preces (cost=0.00..2455.29 rows=30 width=4)"
" Filter: (sort_a AND (NOT stop_prece) AND (id = id))"
" -> Function Scan on generate_series a (cost=0.00..12.50 rows=1000 width=4)"
" -> Index Scan using realizacija_pkey on realizacija (cost=0.01..12.90 rows=1 width=8)"
" Index Cond: ((realizacija.veikala_id = 1) AND (preces.id = realizacija.prece_id) AND (realizacija.datums = (('now'::text)::date - a.a)))"
database type: Postgres
select count(*)
from(
select *
from(
select current_date - a as dat from generate_series(2,370) as a) Q1
join(
select id, artikuls from preces where sort_a=true and stop_prece=false) Q2
on q2.id=q2.id
left join(
select prece_id,datums
from realizacija
where veikala_id=1) Q3
on q3.datums=q1.dat and q2.id=q3.prece_id) as Q1
where datums is null
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I am far from being a PostGreSQL specialist, I only posted because nobody else had so far.
do you get the same execution plan?
you have a index on realizacija covering both fields dat and prece_id (1 single index with both fields!)?
if not: create one
if yes: in which order are the 2 fields in the index compared to the selectivity of the values in the fields?
=> to explain: if a single value of datums alone makes less rows to be returned from that table then prece_id, that field should be first in the index (for this query), otherwise it should be prece_id being first, followed by datums
do you get the same execution plan?
you have a index on realizacija covering both fields dat and prece_id (1 single index with both fields!)?
if not: create one
if yes: in which order are the 2 fields in the index compared to the selectivity of the values in the fields?
=> to explain: if a single value of datums alone makes less rows to be returned from that table then prece_id, that field should be first in the index (for this query), otherwise it should be prece_id being first, followed by datums
ASKER
realizacija table has an multicolumn index in following order:
veikala_id integer (contains about 20 unique values)
prece_id integer (contains about 8000-60000 unique values)
datums date (contains 400 unique dates)
yes, 3 fields in one index. this way select query's are fastest
veikala_id integer (contains about 20 unique values)
prece_id integer (contains about 8000-60000 unique values)
datums date (contains 400 unique dates)
yes, 3 fields in one index. this way select query's are fastest
can you add another index, with only:
prece_id integer (contains about 8000-60000 unique values)
datums date (contains 400 unique dates)
and eventually, add veikala_id at the end.
because this query does not use veikala_id as selective criteria, really.
also, compare the explain plans of your and my syntax for the 2 queries and 2 additional index scenarios, please.
prece_id integer (contains about 8000-60000 unique values)
datums date (contains 400 unique dates)
and eventually, add veikala_id at the end.
because this query does not use veikala_id as selective criteria, really.
also, compare the explain plans of your and my syntax for the 2 queries and 2 additional index scenarios, please.
ASKER
with field order
it looks like query runs up to 3 secs faster.
"because this query does not use veikala_id as selective criteria, really."
can you explain?
i always thought that it should be fields with less unique values to be out first as primary fields for index.
it looks like query runs up to 3 secs faster.
"because this query does not use veikala_id as selective criteria, really."
can you explain?
i always thought that it should be fields with less unique values to be out first as primary fields for index.
ASKER
doh, forgot to post order ...
prece_id integer (contains about 8000-60000 unique values)
datums date (contains 400 unique dates)
veikala_id integer (contains about 20 unique values)
prece_id integer (contains about 8000-60000 unique values)
datums date (contains 400 unique dates)
veikala_id integer (contains about 20 unique values)
ASKER
just noticed something, after re-indexing index i used at first, query-planer is using it again.
speed is same as your's ...
looks like field order does not mater ... .
or i might be mistaking something...
speed is same as your's ...
looks like field order does not mater ... .
or i might be mistaking something...
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
realizacija is.
ASKER
it might be that query speed is limited to disk read speed? since table itself is 2-3 GB heavy,
index itself takes up 700 -1000 MB.
although am having set up raid 5 with 3 hdd's.
index itself takes up 700 -1000 MB.
although am having set up raid 5 with 3 hdd's.
>first time query call takes 160 sec, after cashing 20 secs ;
I fear you won't get any better, unless you plan to run this as a scheduled job, results stored in a table, so the user(s) just query that table ...
or are the parameters for generate_series(2,370) "dynamic, aka user-input" ?
I fear you won't get any better, unless you plan to run this as a scheduled job, results stored in a table, so the user(s) just query that table ...
or are the parameters for generate_series(2,370) "dynamic, aka user-input" ?
ASKER
sad, but it's true,
current_date - generate_series(2,370)
is dynamically generated, there is no other way...
current_date - generate_series(2,370)
is dynamically generated, there is no other way...
ASKER
select current_date - a as dat from generate_series(2,370) as a
generates a temporary table with all dates in 368 day range ...
generates a temporary table with all dates in 368 day range ...
>select current_date - a as dat from generate_series(2,370) as a
you could store that result in a table, and update it once, per day...
you could store that result in a table, and update it once, per day...
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER