Cassandra DB user data size estimation (per table)

Asked 1 months ago, Updated 1 months ago, 5 views

Regarding the estimation of the disk usage of the table in Cassandra DB3.0.
I would estimate the size of the partition using the following two formulas.

Nv = Nr Nc-Npk-Ns + Ns
Enter a description of the image here

In this case, since it is a per-partition unit, you can estimate it by table by table by
of each partion size. I think the sum is necessary.
Currently, the table that needs an estimate is PK with only the Partition key, so
Nr is a table-by-table estimate of all rows in the table.

However, this value is several times the DISK size of the table where you simply input the data.
This is for each transaction (especially one that could create snapshots such as TRANCATE)
Am I correct in understanding that DISK capacity will be tolerated before the compression?

I think so, but I don't have the skill of Cassandra for a few minutes, so
from someone else than the user. You are being asked to provide a basis for .

If anyone knows, could you please let me know?

Thank you for your cooperation.


2022-09-30 14:15

1 Answers

The formula provided is a pure estimate of the size of the partition only.
In the formula, Nr represents the number of CQL lines stored in the partition.

Apache Cassandra can store partitions that are tied to partition keys divided according to the values in the clustering column.For example, in the following table definitions,

  event_id text,
  event_time int,
  value text,
  PRIMARY KEY(event_id), event_time)

The first column event_id in the PRIMARY KEY definition is the partition key and the event_time is the clustering column.
If you INSERT the following two into this table,

INSERT INTO events (event_id, event_time, value) VALUES('event1', 1, 'aaa');
INSERT INTO events (event_id, event_time, value) VALUES('event1', 2, 'bbb');

Roughly enough, it will be stored in Cassandra as follows:

| 'event1' | 1 | 'aaa' | 2 | 'bbb' |

If you apply this to the sizing formula,

  • CkPartition Key='event1'
  • CsStatic column=none
  • NrCQL lines=2
  • CrColumns = 'aaa' and 'bbb'
  • CcClustering columns=1 and 2

will be .(You can ignore the Nv... part because it is a timestamp that you want to keep internally.)

If there is no clustering column, the expression Nr is always 1.

because only one row is stored in a partition.

Apart from the above, the replication factor and compaction overhead are important when sizing.
The replication factor is what you set when defining a key space and will have several copies of the data.If the replication factor is 3, it keeps three copies of one data in the cluster, so you keep three times as much data per node.

Additionally, the overhead of the compaction is to ensure that the data on the Apache Cassandra disk is immutable and that there is space required to perform the merge operation.In the worst case scenario, you need the same amount of free space as the data you have.

2022-09-30 14:15

If you have any answers or tips

© 2022 OneMinuteCode. All rights reserved.