performance - Slow regex query on 80M record in PostgreSQL -

i have read-only table 80 million rows :

   column    |          type          | modifiers | storage  | stats target | description  -------------+------------------------+-----------+----------+--------------+-------------  id          | character(11)          | not null  | extended |              |   gender      | character(1)           |           | extended |              |   postal_code | character varying(10)  |           | extended |              |   operator    | character varying(5)   |           | extended |              |   indexes:     "categorised_phones_pkey" primary key, btree (id)     "operator_idx" btree (operator)     "postal_code_trgm_idx" gin (postal_code gin_trgm_ops)

id primary key , contains unique mobile numbers. table rows looks this:

      id        |     gender   |   postal_code  |   operator ----------------+--------------+----------------+------------  09567849087    |      m       |   7414776788   |     mtn  09565649846    |      f       |   1268398732   |     mci  09568831245    |      f       |   7412556443   |     mtn  09469774390    |      m       |   5488312790   |     mci

this query takes ~65 seconds first time , ~8 seconds next times:

select operator,count(*) categorised_phones postal_code '1%' group operator;

and output looks this:

operator |  count   ----------+---------  mci      | 4050314  mtn      | 6235778

and output of explain alanyze :

hashaggregate  (cost=1364980.61..1364980.63 rows=2 width=10) (actual time=8257.026..8257.026 rows=2 loops=1)    group key: operator    ->  bitmap heap scan on categorised_phones  (cost=95969.17..1312915.34 rows=10413054 width=2) (actual time=1140.803..6332.534 rows=10286092 loops=1)          recheck cond: ((postal_code)::text ~~ '1%'::text)          rows removed index recheck: 25105697          heap blocks: exact=50449 lossy=237243          ->  bitmap index scan on postal_code_trgm_idx  (cost=0.00..93365.90 rows=10413054 width=0) (actual time=1129.270..1129.270 rows=10287127 loops=1)                index cond: ((postal_code)::text ~~ '1%'::text)  planning time: 0.540 ms  execution time: 8257.392 ms

how can make query faster?

any idea great appreciated.

p.s:

i'm using postgresql 9.6.1

update

i updated question. disabled parallel query , results changed.

for queries involve comparisons of form like '%start', , following postgresql own advice, can use following index:

create index postal_code_idx  on categorised_phones (postal_code varchar_pattern_ops) ;

with index in place, , simulated data, execution plan like:

 | query plan                                                                                                                             | | :------------------------------------------------------------------------------------------------------------------------------------- | | hashaggregate  (cost=2368.65..2368.67 rows=2 width=12) (actual time=18.093..18.094 rows=2 loops=1)                                     | |   group key: operator                                                                                                                  | |   ->  bitmap heap scan on categorised_phones  (cost=536.79..2265.83 rows=20564 width=4) (actual time=2.564..12.061 rows=22171 loops=1) | |         filter: ((postal_code)::text ~~ '1%'::text)                                                                                    | |         heap blocks: exact=1455                                                                                                        | |         ->  bitmap index scan on postal_code_idx  (cost=0.00..531.65 rows=21923 width=0) (actual time=2.386..2.386 rows=22171 loops=1) | |               index cond: (((postal_code)::text ~>=~ '1'::text) , ((postal_code)::text ~<~ '2'::text))                               | | planning time: 0.119 ms                                                                                                                | | execution time: 18.122 ms                                                                                                              |

you can check @ dbfiddle here

if have both queries like 'start%' , like '%middle%', should add index, keep 1 in place. trigram indexes might prove useful second kind of match.

why?

from postgresql documentation on operator classes:

the operator classes text_pattern_ops, varchar_pattern_ops, , bpchar_pattern_ops support b-tree indexes on types text, varchar, , char respectively. difference default operator classes values compared strictly character character rather according locale-specific collation rules. makes these operator classes suitable use queries involving pattern matching expressions (like or posix regular expressions) when database not use standard "c" locale.

from postgresql documentation on index types

the optimizer can use b-tree index queries involving pattern matching operators like , ~ if pattern constant , anchored beginning of string — example, col 'foo%' or col ~ '^foo', not col like '%bar'. however, if database not use c locale need create index special operator class support indexing of pattern-matching queries; see section 11.9 below. possible use b-tree indexes ilike , ~*, if pattern starts non-alphabetic characters, i.e., characters not affected upper/lower case conversion.

update

if queries performed involved fix (and relatively small) number of like 'x%' expressions, consider using partial indexes.

for instance, like '1%', you'd have following index, , following query plan (it shows 3x improvement):

create index idx_1 on categorised_phones (operator) postal_code '1%'; vacuum categorised_phones ;

 | query plan                                                                                                                                    | | :-------------------------------------------------------------------------------------------------------------------------------------------- | | groupaggregate  (cost=0.29..658.74 rows=3 width=12) (actual time=3.235..6.493 rows=2 loops=1)                                                 | |   group key: operator                                                                                                                         | |   ->  index scan using idx_1 on categorised_phones  (cost=0.29..554.10 rows=20921 width=4) (actual time=0.028..3.266 rows=22290 loops=1) | |         heap fetches: 0                                                                                                                       | | planning time: 0.293 ms                                                                                                                       | | execution time: 6.517 ms                                                                                                                      |

Search This Blog

Insert

performance - Slow regex query on 80M record in PostgreSQL -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -