Growing blockchain storage requirements
At the core of a blockchain is a ledger that records all transactions that occur on the network. This ledger is distributed across nodes that participate in consensus and transaction validation. As the blockchain processes more transactions over time, the ledger grows consistently.
For example, the Bitcoin blockchain adds blocks containing transaction data every 10 minutes. Each block is 1MB in size, so 144 blocks are added each day for a total of 144MB per day. At this rate, the Bitcoin ledger will grow by more than 50GB per year. With more complex data structures and smart contract capabilities, the Ethereum blockchain generates data even faster, at one block every 15 seconds with larger block sizes. This causes the size of the Ethereum ledger to grow rapidly, nearly doubling every year.
All this data needs to be stored by a full node that holds a full copy of the blockchain. By mid-2023, the Bitcoin ledger size will approach 500GB. In the case of Ethereum, it was already over 1TB. Storing such large amounts of data can be prohibitively expensive and technically challenging for individual node operators. As a result, it becomes more difficult to run a fully decentralized and independent node infrastructure.
StateDB’s role in blockchain storage
In many blockchain platforms, such as Ethereum and Klaytn, a key data structure called the State Database (StateDB) takes up a large amount of storage space. StateDB stores important information such as account balances, contract data, and blockchain state needed to validate transactions.
As transactions occur, StateDB is continually updated to reflect the latest state. This means frequent changes such as changes to account balances or revisions to contract data.
StateDB uses a Merkle Patricia Trie (MPT) structure to efficiently store state data. However, the drawback is that even small changes in data can dramatically change the entire MPT. Research shows that when one node is updated, more than 10 new nodes can be added to the MPT. This exponential growth characteristic causes StateDB to grow in size quickly. The large capacity required is difficult to manage for validators and miners running full nodes.
Why existing pruning methods have limitations
To accommodate StateDB’s ever-expanding storage needs, blockchain developers have been working on pruning techniques to securely remove old and unnecessary data.
Pruning methods such as StateDB offline pruning require node servers to be taken offline, data migrated to new infrastructure, and the blockchain resynchronized. This results in significant node downtime, ranging from hours to days, depending on the size of the blockchain. Such complex adjustments also limit the feasibility of frequent pruning operations.
The main limitation of pruning is that data structures like MPT use nested data that is shared between nodes.This means node A may refer to the same data as Node B.simply delete Node B In such cases, irrecoverable data loss or corruption may occur. node A. This challenge of managing multi-referenced data has traditionally hindered efficient online pruning.
Introducing Exthash to enable live pruning
To overcome the problem of multiple references, Klaytn developers introduced a modified hash function called Exthash. Exthash works by adding a 7-byte serial number to the regular 32-byte hash. This serial number acts as a unique identifier and prevents the same data from having the same hash.
Duplicate references can be eliminated by replacing regular hashes with Exthash throughout StateDB.if node A previously linked Node B Using the same hash, this link now contains a unique Exthash identifier.later, when Node B It will be removed during pruning, but node A References are unaffected because they use separate Exthash values.
To implement Exthash, the existing Merkle Patricia Trie structure has been improved to use the new extended hash. Each node represents a key-value pair, where the keys are object hashes and the values store the actual data. You can avoid redundancy by changing hash generation to use Exthash instead of regular hashing.
The 7-byte nonce added by Exthash creates a unique hash even for identical data. For example, if two nodes contain “Balance: 100”, the Exthash values will differ based on the nonce.now node A You can refer to it with confidence Node B without conflict. If B is later removed during pruning, A’s link will remain intact pointing to her separate Exthash.
Exthash enables secure deletion of shared data, so you can now prune StateDB live while the blockchain is fully operational. This eliminates the downtime and complex resynchronization challenges caused by traditional offline pruning methods.
Keep StateDB lean with live pruning
Klaytn implemented StateDB Live Pruning in version 1.11, allowing automatic deletion of old StateDB data. By default, only the last two days of StateDB data are retained and old information is periodically deleted.
This keeps the active StateDB size at an optimal size for I/O performance, typically in the 150 GB to 200 GB range. With significantly reduced storage requirements compared to full archives, full nodes can run smoothly without getting bogged down by data bloat. A smaller StateDB also has the benefit of caching, which speeds up various blockchain operations.
The introduction of Exthash adds some overhead associated with generating and storing 7-byte extensions. However, benchmarks revealed that the storage and cache improvements made possible by live pruning increased overall system efficiency by more than 20%.
In addition to storage space, StateDB Live Pruning also saves bandwidth for node operators. When you synchronize your nodes for the first time, the pruned StateDB from the past few days minimizes the amount of data that needs to be downloaded. This makes it easier for participants around the world to run nodes.
Ongoing analysis helps determine optimal pruning frequency and duration. Factors such as disk performance, network speed, and node type affect the ideal pruning configuration. Pruning parameters can be adjusted over time as the characteristics of the blockchain data evolve.
Continuous optimization for different node types
Live StateDB pruning provides the most value for consensus and validation-only nodes that require fast access to the latest state data. Maintaining historical data primarily for analytical purposes may not provide any benefit to such nodes.
Future work will include efficiently separating recent hot data and older cold data across different storage systems. For example, a cheaper, slower hard disk can store historical data for analysis, while a faster SSD processes the most recent information for time-sensitive processing.
Further optimizations are also being considered, such as extracting transaction data from recent blocks. Transactions can be moved to separate storage while only state changes remain in the hot StateDB. Now that Live Pruning has proven effective, Klaytn developers can focus on further enhancements for different node types.
conclusion
Blockchain data grows exponentially over time, so efficient storage and pruning is essential for nodes to run smoothly. By using Exthash to intelligently eliminate redundant shared data references, Klaytn enables continuous live StateDB pruning. This keeps nodes lightweight as they only keep the latest blockchain data needed for validation.
Ongoing research further improves pruning by separating data across hot and cold storage based on utility. Innovative solutions like Live Pruning allow blockchains to sustainably scale despite the huge amount of data they generate.