Skip to content

gin_post file in a single part abnormally large #101

@ardenwick

Description

@ardenwick

Describe the unexpected behaviour
gin_post file in a single part can get really huge.

How to reproduce

  1. create table http_log_local, and insert sample data.
  2. add gin index
  3. materialize the index

wait for the mutation activity to finish.

  1. inspect the total size of data parts on filesystem:
clickhouse@node-1-0-0:~/data/business/http_log_local$ du -sch * | sort -k 1 -h
4.0K    detached
4.0K    format_version.txt
76K     e7efb3c67698badd53ab74b731772f76_10_10_0_21
112K    2ba8abac480f6c1100bffd944c4ecc48_42_65_12_77
128K    03a57d5e8fe0ec0fab9e1c68dc210ef0_0_24_7_35
132K    78e93e4cc3886fa2c9f2ef23967198e7_0_36_10_48
1.9M    26efcc325bb68f310ed15a6a9b2eff95_255_255_0_266
4.1M    1e11287921c2d5c214973c711c58a36f_10609_10609_0
3.5M    983a1a1dfef4543e6d83515d25983333_533_533_0_544
6.1M    tmp_mut_300bbf56ad64c576b4d79a9172fcfe31_0_2401_24_5776
59M     0c924ec6272d27d57f728aaf94071b39_4245_4295_3_4306
172M    0c924ec6272d27d57f728aaf94071b39_490_4244_27_4306
206M    0c924ec6272d27d57f728aaf94071b39_194_489_8_4306
868M    983a1a1dfef4543e6d83515d25983333_287_532_10_544
1.2G    300bbf56ad64c576b4d79a9172fcfe31_5738_5765_2_5776
1.5G    983a1a1dfef4543e6d83515d25983333_5_286_6_544
3.6G    26efcc325bb68f310ed15a6a9b2eff95_44_254_4_266
5.2G    300bbf56ad64c576b4d79a9172fcfe31_5579_5737_5_5776
24G     300bbf56ad64c576b4d79a9172fcfe31_2402_5578_22_5776
157G    300bbf56ad64c576b4d79a9172fcfe31_0_2401_24_5775
190G    total

notice that there is an abnormally large part 300bbf56ad64c576b4d79a9172fcfe31_0_2401_24_5775.

clickhouse@node-1-0-0:~/data/business/http_log_local/300bbf56ad64c576b4d79a9172fcfe31_0_2401_24_5775$ du -sch *
8.0K    checksums.txt
4.0K    columns.txt
4.0K    count.txt
4.0K    partition.dat
56K     primary.idx
268K    recordTimestamp.bin
24K     recordTimestamp.mrk2
6.3M    requestBody.bin
24K     requestBody.mrk2
538M    requestHead.bin
24K     requestHead.mrk2
13M     responseBody.bin
24K     responseBody.mrk2
17M     responseHead.bin
24K     responseHead.mrk2
226M    skp_idx_ginIndex.gin_dict
158G    skp_idx_ginIndex.gin_post
24K     skp_idx_ginIndex.gin_seg
4.0K    skp_idx_ginIndex.gin_sid
12K     skp_idx_ginIndex.idx
28K     skp_idx_ginIndex.mrk2
40K     tenant.bin
4.0K    tenant.dict.bin
24K     tenant.dict.mrk2
24K     tenant.mrk2
270M    uuId.bin
24K     uuId.mrk2
160G    total

Size of the index file skp_idx_ginIndex.gin_post is so big.

While total size of corresponding data files (I mean the .bin files the index is created against) is about 571MB, how would the index file be amazingly sized 158GB !

  • Which ClickHouse server version to use
    custom built ftsearch branch

  • CREATE TABLE statements for all tables involved

CREATE TABLE http_log_local
(
    -- irrelevant columns removed
    
    `tenant` LowCardinality(String),
    `recordTimestamp` Int64,
    `uuId` String,
    `requestHead` String,
    `responseHead` String,
    `requestBody` String,
    `responseBody` String,
    `rowLog` String DEFAULT concat(requestHead, '--', responseHead, '--', requestBody, '--', responseBody),
    INDEX ginIndex rowLog TYPE gin(3) GRANULARITY 1
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/http_log_local', '{replica}')
PARTITION BY (tenant, toYYYYMMDD(toDate(recordTimestamp)))
PRIMARY KEY uuId
ORDER BY (uuId, recordTimestamp)
SETTINGS index_granularity = 8192

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions