performance - Slow regex query on 80M record in PostgreSQL -
i have read-only table 80 million rows :
column | type | modifiers | storage | stats target | description -------------+------------------------+-----------+----------+--------------+------------- id | character(11) | not null | extended | | gender | character(1) | | extended | | postal_code | character varying(10) | | extended | | operator | character varying(5) | | extended | | indexes: "categorised_phones_pkey" primary key, btree (id) "operator_idx" btree (operator) "postal_code_trgm_idx" gin (postal_code gin_trgm_ops)
id
primary key , contains unique mobile numbers. table rows looks this:
id | gender | postal_code | operator ----------------+--------------+----------------+------------ 09567849087 | m | 7414776788 | mtn 09565649846 | f | 1268398732 | mci 09568831245 | f | 7412556443 | mtn 09469774390 | m | 5488312790 | mci
this query takes ~65 seconds first time , ~8 seconds next times:
select operator,count(*) categorised_phones postal_code '1%' group operator;
and output looks this:
operator | count ----------+--------- mci | 4050314 mtn | 6235778
and output of explain alanyze
:
hashaggregate (cost=1364980.61..1364980.63 rows=2 width=10) (actual time=8257.026..8257.026 rows=2 loops=1) group key: operator -> bitmap heap scan on categorised_phones (cost=95969.17..1312915.34 rows=10413054 width=2) (actual time=1140.803..6332.534 rows=10286092 loops=1) recheck cond: ((postal_code)::text ~~ '1%'::text) rows removed index recheck: 25105697 heap blocks: exact=50449 lossy=237243 -> bitmap index scan on postal_code_trgm_idx (cost=0.00..93365.90 rows=10413054 width=0) (actual time=1129.270..1129.270 rows=10287127 loops=1) index cond: ((postal_code)::text ~~ '1%'::text) planning time: 0.540 ms execution time: 8257.392 ms
how can make query faster?
any idea great appreciated.
p.s:
i'm using postgresql 9.6.1
update
i updated question. disabled parallel query
, results changed.
for queries involve comparisons of form like '%start'
, , following postgresql own advice, can use following index:
create index postal_code_idx on categorised_phones (postal_code varchar_pattern_ops) ;
with index in place, , simulated data, execution plan like:
| query plan | | :------------------------------------------------------------------------------------------------------------------------------------- | | hashaggregate (cost=2368.65..2368.67 rows=2 width=12) (actual time=18.093..18.094 rows=2 loops=1) | | group key: operator | | -> bitmap heap scan on categorised_phones (cost=536.79..2265.83 rows=20564 width=4) (actual time=2.564..12.061 rows=22171 loops=1) | | filter: ((postal_code)::text ~~ '1%'::text) | | heap blocks: exact=1455 | | -> bitmap index scan on postal_code_idx (cost=0.00..531.65 rows=21923 width=0) (actual time=2.386..2.386 rows=22171 loops=1) | | index cond: (((postal_code)::text ~>=~ '1'::text) , ((postal_code)::text ~<~ '2'::text)) | | planning time: 0.119 ms | | execution time: 18.122 ms |
you can check @ dbfiddle here
if have both queries like 'start%'
, like '%middle%'
, should add index, keep 1 in place. trigram indexes might prove useful second kind of match.
why?
from postgresql documentation on operator classes:
the operator classes
text_pattern_ops
,varchar_pattern_ops
, ,bpchar_pattern_ops
support b-tree indexes on types text, varchar, , char respectively. difference default operator classes values compared strictly character character rather according locale-specific collation rules. makes these operator classes suitable use queries involving pattern matching expressions (like or posix regular expressions) when database not use standard "c" locale.
from postgresql documentation on index types
the optimizer can use b-tree index queries involving pattern matching operators
like
,~
if pattern constant , anchored beginning of string — example,col 'foo%'
orcol ~ '^foo'
, not collike '%bar'
. however, if database not use c locale need create index special operator class support indexing of pattern-matching queries; see section 11.9 below. possible use b-tree indexesilike
,~*
, if pattern starts non-alphabetic characters, i.e., characters not affected upper/lower case conversion.
update
if queries performed involved fix (and relatively small) number of like 'x%'
expressions, consider using partial indexes
.
for instance, like '1%'
, you'd have following index, , following query plan (it shows 3x improvement):
create index idx_1 on categorised_phones (operator) postal_code '1%'; vacuum categorised_phones ;
| query plan | | :-------------------------------------------------------------------------------------------------------------------------------------------- | | groupaggregate (cost=0.29..658.74 rows=3 width=12) (actual time=3.235..6.493 rows=2 loops=1) | | group key: operator | | -> index scan using idx_1 on categorised_phones (cost=0.29..554.10 rows=20921 width=4) (actual time=0.028..3.266 rows=22290 loops=1) | | heap fetches: 0 | | planning time: 0.293 ms | | execution time: 6.517 ms |
Comments
Post a Comment