Generalized State Compression
How state compression - the tech behind compressed NFTs - works, and how to implement it in your own Nexis Native Chain programs.
Summary
- State Compression on Nexis Native Chain is most commonly used for compressed NFTs, but it’s possible to use it for arbitrary data
- State Compression lowers the amount of data you have to store onchain by leveraging Merkle trees.
- Merkle trees store a single hash that represents an entire binary tree of hashes. Each leaf on a Merkle tree is a hash of that leaf’s data.
- Concurrent Merkle trees are a specialized version of Merkle trees that allow concurrent updates.
- Because data in a state-compressed program is not stored onchain, you have to user indexers to keep an off-chain cache of the data and then verify that data against the onchain Merkle tree.
Lesson
Previously, we discussed state compression in the context of compressed NFTs. At the time of writing, compressed NFTs represent the most common use case for state compression, but it’s possible to use state compression within any program. In this lesson, we’ll discuss state compression in more generalized terms so that you can apply it to any of your programs.
A theoretical overview of state compression
In traditional programs, data is serialized (typically using borsh) and then stored directly in an account. This allows the data to be easily read and written through Nexis Native Chain programs. You can “trust” the data stored in the accounts because it can’t be modified except through the mechanisms surfaced by the program.
State compression effectively asserts that the most important piece of this equation is how “trustworthy” the data is. If all we care about is the ability to trust that data is what it claims to be, then we can actually get away with not storing the data in an account onchain. Instead, we can store hashes of the data where the hashes can be used to prove or verify the data. The data hash takes up significantly less storage space than the data itself. We can then store the actual data somewhere much cheaper and worry about verifying it against the onchain hash when the data is accessed.
The specific data structure used by the Nexis Native Chain State Compression program is a special binary tree structure known as a concurrent Merkle tree. This tree structure hashes pieces of data together in a deterministic way to compute a single, final hash that gets stored onchain. This final hash is significantly smaller in size than all the original data combined, hence the “compression.” The steps to this process are:
- Take any piece of data
- Create a hash of this data
- Store this hash as a “leaf” at the bottom of the tree
- Each leaf pair is then hashed together, creating a “branch”
- Each branch is then hashed together
- Continually climb the tree and hash adjacent branches together
- Once at the top of the tree, a final ”root hash” is produced
- Store the root hash onchain as verifiable proof of the data within each leaf
- Anyone wanting to verify that the data they have matches the “source of truth” can go through the same process and compare the final hash without having to store all the data onchain
This involves a few rather serious development tradeoffs:
- Since the data is no longer stored in an account onchain, it is more difficult to access.
- Once the data has been accessed, developers must decide how often their applications will verify the data against the onchain hash.
- Any changes to the data will require sending the entirety of the previously hashed data and the new data into an instruction. Developer may also have to provide additional data relevant to the proofs required to verify the original data against the hash.
Each of these will be a consideration when determining if, when, and how to implement state compression for your program.
Concurrent Merkle trees
A Merkle tree is a binary tree structure represented by a single hash. Every leaf node in the structure is a hash of its inner data while every branch is a hash of its child leaf hashes. In turn, branches are also hashed together until, eventually, one final root hash remains.
Since the Merkle tree is represented as a single hash, any modification to leaf data changes the root hash. This causes an issue when multiple transactions in the same slot are attempting to modify leaf data. Since these transactions must execute in series, all but the first will fail since the root hash and proof passed in will have been invalidated by the first transaction to be executed. In other words, a standard Merkle tree can only modify a single leaf per slot. In a hypothetical state-compressed program that relies on a single Merkle tree for its state, this severely limits throughput.
This can be solved with a concurrent Merkle tree. A concurrent Merkle tree is a Merkle tree that stores a secure changelog of the most recent changes along with their root hash and the proof to derive it. When multiple transactions in the same slot try to modify leaf data, the changelog can be used as a source of truth to allow for concurrent changes to be made to the tree.
In other words, while an account storing a Merkle tree would have only the root hash, a concurrent Merkle tree will also contain additional data that allows subsequent writes to successfully occur. This includes:
- The root hash - The same root hash that a standard Merkle tree has.
- A changelog buffer - This buffer contains proof data pertinent to recent root hash changes so that subsequent writes in the same slot can still be successful.
- A canopy - When performing an update action on any given leaf, you need the entire proof path from that leaf to the root hash. The canopy stores intermediate proof nodes along that path so they don’t all have to be passed into the program from the client.
As a program architect, you control three values directly related to these three items. Your choice determines the size of the tree, the cost to create the tree, and the number of concurrent changes that can be made to the tree:
- Max depth
- Max buffer size
- Canopy depth
The max depth is the maximum number of hops to get from any leaf to the root
of the tree. Since Merkle trees are binary trees, every leaf is connected only
to one other leaf. Max depth can then logically be used to calculate the number
of nodes for the tree with 2 ^ maxDepth
.
The max buffer size is effectively the maximum number of concurrent changes that you can make to a tree within a single slot with the root hash still being valid. When multiple transactions are submitted in the same slot, each of which is competing to update leafs on a standard Merkle tree, only the first to run will be valid. This is because that “write” operation will modify the hash stored in the account. Subsequent transactions in the same slot will be trying to validate their data against a now-outdated hash. A concurrent Merkle tree has a buffer so that the buffer can keep a running log of these modifications. This allows the State Compression Program to validate multiple data writes in the same slot because it can look up what the previous hashes were in the buffer and compare against the appropriate hash.
The canopy depth is the number of proof nodes that are stored onchain for any given proof path. Verifying any leaf requires the complete proof path for the tree. The complete proof path is made up of one proof node for every “layer” of the tree, i.e. a max depth of 14 means there are 14 proof nodes. Every proof node passed into the program adds 32 bytes to a transaction, so large trees would quickly exceed the maximum transaction size limit. Caching proof nodes onchain in the canopy helps improve program composability.
Each of these three values, max depth, max buffer size, and canopy depth, comes with a tradeoff. Increasing the value of any of these values increases the size of the account used to store the tree, thus increasing the cost of creating the tree.
Choosing the max depth is fairly straightforward as it directly relates to the
number of leafs and therefore the amount of data you can store. If you need 1
million cNFTs on a single tree where each cNFT is a leaf of the tree, find the
max depth that makes the following expression true: 2^maxDepth > 1 million
.
The answer is 20.
Choosing a max buffer size is effectively a question of throughput: how many concurrent writes do you need? The larger the buffer, the higher the throughput.
Lastly, the canopy depth will determine your program’s composability. State compression pioneers have made it clear that omitting a canopy is a bad idea. Program A can’t call your state-compressed program B if doing so maxes out the transaction size limits. Remember, program A also has required accounts and data in addition to required proof paths, each of which take up transaction space.
Data access on a state-compressed program
A state-compressed account doesn’t store the data itself. Rather, it stores the concurrent Merkle tree structure discussed above. The raw data itself lives only in the blockchain’s cheaper ledger state. This makes data access somewhat more difficult, but not impossible.
The Nexis Native Chain ledger is a list of entries containing signed transactions. In theory, this can be traced back to the genesis block. This effectively means any data that has ever been put into a transaction exists in the ledger.
Since the state compression hashing process occurs onchain, all the data exists in the ledger state and could theoretically be retrieved from the original transaction by replaying the entire chain state from the beginning. However, it’s much more straightforward (though still complicated) to have an indexer track and index this data as the transactions occur. This ensures there is an off-chain “cache” of the data that anyone can access and subsequently verify against the onchain root hash.
This process is complex, but it will make sense after some practice.
State compression tooling
The theory described above is essential to properly understanding state compression. But you don’t have to implement any of it from scratch. Brilliant engineers have laid most of the groundwork for you in the form of the SPL State Compression Program and the Noop Program.
SPL State Compression and Noop Programs
The SPL State Compression Program exists to make the process of creating and updating concurrent Merkle trees repeatable and composable throughout the Nexis Native Chain ecosystem. It provides instructions for initializing Merkle trees, managing tree leafs (i.e. add, update, remove data), and verifying leaf data.
The State Compression Program also leverages a separate “no op” program whose primary purpose is to make leaf data easier to index by logging it to the ledger state. When you want to store compressed data, you pass it to the State Compression program where it gets hashed and emitted as an “event” to the Noop program. The hash gets stored in the corresponding concurrent Merkle tree, but the raw data remains accessible through the Noop program’s transaction logs.
Index data for easy lookup
Under normal conditions, you would typically access onchain data by fetching the appropriate account. When using state compression, however, it’s not so straightforward.
As mentioned above, the data now exists in the ledger state rather than in an account. The easiest place to find the full data is in the logs of the Noop instruction. Unfortunately, while this data will in a sense exist in the ledger state forever, it will likely be inaccessible through validators after a certain period of time.
To save space and be more performant, validators don’t retain every transaction back to the genesis block. The specific amount of time you’ll be able to access the Noop instruction logs related to your data will vary based on the validator. Eventually, you’ll lose access to it if you’re relying directly on instruction logs.
Technically, you can replay the transaction state back to the genesis block but the average team isn’t going to do that, and it certainly won’t be performant. The Digital Asset Standard (DAS) has been adopted by many RPC providers to enable efficient queries of compressed NFTs and other assets. However, at the time of writing, it doesn’t support arbitrary state compression. Instead, you have two primary options:
- Use an indexing provider that will build a custom indexing solution for your program that observes the events sent to the Noop program and stores the relevant data off-chain.
- Create your own pseudo-indexing solution that stores transaction data off-chain.
For many dApps, option 2 makes plenty of sense. Larger-scale applications may need to rely on infrastructure providers to handle their indexing.
State compression development process
Create Rust types
As with a typical Anchor program, one of the first things you should do is define your program’s Rust types. However, Rust types in a traditional Anchor program often represent accounts. In a state-compressed program, your account state will only store the Merkle tree. The more “usable” data schema will just be serialized and logged to the Noop program.
This type should include all the data stored in the leaf node and any contextual
information needed to make sense of the data. For example, if you were to create
a simple messaging program, your Message
struct might look as follows:
To be abundantly clear, this is not an account that you will be able to read from. Your program will be creating an instance of this type from instruction inputs, not constructing an instance of this type from account data that it reads. We’ll discuss how to read data in a later section.
Initialize a new tree
Clients will create and initialize the Merkle tree account in two separate instructions. The first is simply allocating the account by calling System Program. The second will be an instruction that you create on a custom program that initializes the new account. This initialization is effectively just recording what the max depth and buffer size for the Merkle tree should be.
All this instruction needs to do is build a CPI to invoke the
init_empty_merkle_tree
instruction on the State Compression Program. Since
this requires the max depth and max buffer size, these will need to be passed in
as arguments to the instruction.
Remember, the max depth refers to the maximum number of hops to get from any leaf to the root of the tree. Max buffer size refers to the amount of space reserved for storing a changelog of tree updates. This changelog is used to ensure that your tree can support concurrent updates within the same block.
For example, if we were initializing a tree for storing messages between users, the instruction might look like this:
Add hashes to the tree
With an initialized Merkle tree, it’s possible to start adding data hashes. This
involves passing the uncompressed data to an instruction on your program that
will hash the data, log it to the Noop program, and use the State Compression
Program’s append
instruction to add the hash to the tree. The following
discuss what your instruction needs to do in depth:
- Use the
hashv
function from thekeccak
crate to hash the data. In most cases, you’ll want to also hash the owner or authority of the data as well to ensure that it can only be modified by the proper authority. - Create a log object representing the data you wish to log to the Noop
Program, then call
wrap_application_data_v1
to issue a CPI to the Noop program with this object. This ensures that the uncompressed data is readily available to any client looking for it. For broad use cases like cNFTs, that would be indexers. You might also create your own observing client to simulate what indexers are doing but specific to your application. - Build and issue a CPI to the State Compression Program’s
append
instruction. This takes the hash computed in step 1 and adds it to the next available leaf on your Merkle tree. Just as before, this requires the Merkle tree address and the tree authority bump as signature seeds.
When all this is put together using the messaging example, it looks something like this:
Update hashes
To update data, you need to create a new hash to replace the hash at the relevant leaf on the Merkle tree. To do this, your program needs access to four things:
- The index of the leaf to update
- The root hash of the Merkle tree
- The original data you wish to modify
- The updated data
Given access to this data, a program instruction can follow very similar steps as those used to append the initial data to the tree:
- Verify update authority - The first step is new. In most cases, you want
to verify update authority. This typically involves proving that the signer
of the
update
transaction is the true owner or authority of the leaf at the given index. Since the data is compressed as a hash on the leaf, we can’t simply compare theauthority
public key to a stored value. Instead, we need to compute the previous hash using the old data and theauthority
listed in the account validation struct. We then build and issue a CPI to the State Compression Program’sverify_leaf
instruction using our computed hash. - Hash the new data - This step is the same as the first step from
appending initial data. Use the
hashv
function from thekeccak
crate to hash the new data and the update authority, each as their corresponding byte representation. - Log the new data - This step is the same as the second step from
appending initial data. Create an instance of the log struct and call
wrap_application_data_v1
to issue a CPI to the Noop program. - Replace the existing leaf hash - This step is slightly different than the
last step of appending initial data. Build and issue a CPI to the State
Compression Program’s
replace_leaf
instruction. This uses the old hash, the new hash, and the leaf index to replace the data of the leaf at the given index with the new hash. Just as before, this requires the Merkle tree address and the tree authority bump as signature seeds.
Combined into a single instruction, this process looks as follows:
Delete hashes
At the time of writing, the State Compression Program doesn’t provide an
explicit delete
instruction. Instead, you’ll want to update leaf data with
data that indicates the data as “deleted.” The specific data will depend on your
use case and security concerns. Some may opt to set all data to 0, whereas
others might store a static string that all “deleted” items will have in common.
Access data from a client
The discussion so far has covered 3 of the 4 standard CRUD procedures: Create, Update, and Delete. What’s left is one of the more difficult concepts in state compression: reading data.
Accessing data from a client is tricky primarily because the data isn’t stored in a format that is easy to access. The data hashes stored in the Merkle tree account can’t be used to reconstruct the initial data, and the data logged to the Noop program isn’t available indefinitely.
Your best bet is one of two options:
- Work with an indexing provider to create a custom indexing solution for your program, then write client-side code based on how the indexer gives you access to the data.
- Create your own pseudo-indexer as a lighter-weight solution.
If your project is truly decentralized such that many participants will interact with your program through means other than your own frontend, then option 2 might not be sufficient. However, depending on the scale of the project or whether or not you’ll have control over most program access, it can be a viable approach.
There is no “right” way to do this. Two potential approaches are:
- Store the raw data in a database at the same time as sending it to the program, along with the leaf that the data is hashed and stored to.
- Create a server that observes your program’s transactions, looks up the associated Noop logs, decodes the logs, and stores them.
We’ll do a little bit of both when writing tests in this lesson’s lab (though we won’t persist data in a db - it will only live in memory for the duration of the tests).
The setup for this is somewhat tedious. Given a particular transaction, you can
fetch the transaction from the RPC provider, get the inner instructions
associated with the Noop program, use the deserializeApplicationDataEvent
function from the @nexis-network/spl-account-compression
JS package to get the logs,
then deserialize them using Borsh. Below is an example based on the messaging
program used above.
Conclusion
Generalized state compression can be difficult but is absolutely possible to implement with the available tools. Additionally, the tools and programs will only get better over time. If you come up with solutions that improve your development experience, please share with the community!
Lab
Let’s practice generalized state compression by creating a new Anchor program. This program will use custom state compression to power a simple note-taking app.
1. Project setup
Start by initializing an Anchor program:
We’ll be using the spl-account-compression
crate with the cpi
feature
enabled. Let’s add it as a dependency in programs/compressed-notes/Cargo.toml
.
We’ll be testing locally but we need both the Compression program and the Noop
program from Mainnet. We’ll need to add these to the Anchor.toml
in the root
directory so they get cloned to our local cluster.
Lastly, let’s prepare the lib.rs
file for the rest of the Demo. Remove the
initialize
instruction and the Initialize
accounts struct, then add the
imports shown in the code snippet below (be sure to put in your program
id):
For the rest of this Demo, we’ll be making updates to the program code directly
in the lib.rs
file. This simplifies the explanations a bit. You’re welcome to
modify the structure as you will.
Feel free to build before continuing. This ensures your environment is working properly and shortens future build times.
2. Define Note
schema
Next, we’re going to define what a note looks like within our program. Notes should have the following properties:
leaf_node
- this should be a 32-byte array representing the hash stored on the leaf nodeowner
- the public key of the note ownernote
- the string representation of the note
In a traditional Anchor program, this would be an account struct, but since
we’re using state compression, our accounts won’t be mirroring our native
structures. Since we don’t need all the functionality of an account, we can just
use the AnchorSerialize
derive macro rather than the account
macro.
3. Define input accounts and constraints
As luck would have it, every one of our instructions will be using the same
accounts. We’ll create a single NoteAccounts
struct for our account
validation. It’ll need the following accounts:
owner
- this is the creator and owner of the note; should be a signer on the transactiontree_authority
- the authority for the Merkle tree; used for signing compression-related CPIsmerkle_tree
- the address of the Merkle tree used to store the note hashes; will be unchecked since it is validated by the State Compression Programlog_wrapper
- the address of the Noop Programcompression_program
- the address of the State Compression Program
4. Create create_note_tree
instruction
Next, let’s create our create_note_tree
instruction. Remember, clients will
have already allocated the Merkle tree account but will use this instruction to
initialize it.
All this instruction needs to do is build a CPI to invoke the
init_empty_merkle_tree
instruction on the State Compression Program. To do
this, it needs the accounts listed in the NoteAccounts
account validation
struct. It also needs two additional arguments:
max_depth
- the max depth of the Merkle treemax_buffer_size
- the max buffer size of the Merkle tree
These values are required for initializing the data on the Merkle tree account. Remember, the max depth refers to the maximum number of hops to get from any leaf to the root of the tree. Max buffer size refers to the amount of space reserved for storing a changelog of tree updates. This changelog is used to ensure that your tree can support concurrent updates within the same block.
Ensure that your signer seeds on the CPI include both the Merkle tree address and the tree authority bump.
5. Create append_note
instruction
Now, let’s create our append_note
instruction. This instruction needs to take
the raw note as a String and compress it into a hash that we’ll store on the
Merkle tree. We’ll also log the note to the Noop program so the entirety of the
data exists within the chain’s state.
The steps here are as follows:
- Use the
hashv
function from thekeccak
crate to hash the note and owner, each as their corresponding byte representation. It’s crucial that you hash the owner as well as the note. This is how we’ll verify note ownership before updates in the update instruction. - Create an instance of the
NoteLog
struct using the hash from step 1, the owner’s public key, and the raw note as a String. Then callwrap_application_data_v1
to issue a CPI to the Noop program, passing the instance ofNoteLog
. This ensures the entirety of the note (not just the hash) is readily available to any client looking for it. For broad use cases like cNFTs, that would be indexers. You might create your observing client to simulate what indexers are doing but for your own application. - Build and issue a CPI to the State Compression Program’s
append
instruction. This takes the hash computed in step 1 and adds it to the next available leaf on your Merkle tree. Just as before, this requires the Merkle tree address and the tree authority bump as signature seeds.
6. Create update_note
instruction
The last instruction we’ll make is the update_note
instruction. This should
replace an existing leaf with a new hash representing the new updated note data.
For this to work, we’ll need the following parameters:
index
- the index of the leaf we are going to updateroot
- the root hash of the Merkle treeold_note
- the string representation of the old note we’re updatingnew_note
- the string representation of the new note we want to update to
Remember, the steps here are similar to append_note
, but with some minor
additions and modifications:
- The first step is new. We need to first prove that the
owner
calling this function is the true owner of the leaf at the given index. Since the data is compressed as a hash on the leaf, we can’t simply compare theowner
public key to a stored value. Instead, we need to compute the previous hash using the old note data and theowner
listed in the account validation struct. We then build and issue a CPI to the State Compression Program’sverify_leaf
instruction using our computed hash. - This step is the same as the first step from creating the
append_note
instruction. Use thehashv
function from thekeccak
crate to hash the new note and its owner, each as their corresponding byte representation. - This step is the same as the second step from creating the
append_note
instruction. Create an instance of theNoteLog
struct using the hash from step 2, the owner’s public key, and the new note as a string. Then callwrap_application_data_v1
to issue a CPI to the Noop program, passing the instance ofNoteLog
- This step is slightly different than the last step from creating the
append_note
instruction. Build and issue a CPI to the State Compression Program’sreplace_leaf
instruction. This uses the old hash, the new hash, and the leaf index to replace the data of the leaf at the given index with the new hash. Just as before, this requires the Merkle tree address and the tree authority bump as signature seeds.
7. Client test setup
We’re going to write a few tests to ensure that our program works as expected. First, let’s do some setup.
We’ll be using the @nexis-network/spl-account-compression
package. Go ahead and
install it:
Next, we’re going to give you the contents of a utility file we’ve created to
make testing easier. Create a utils.ts
file in the tests
directory, add in
the below, then we’ll explain it.
There are 3 main things in the above file:
NoteLog
- a class representing the note log we’ll find in the Noop program logs. We’ve also added the borsh schema asNoteLogBorshSchema
for deserialization.getHash
- a function that creates a hash of the note and note owner so we can compare it to what we find on the Merkle treegetNoteLog
- a function that looks through the provided transaction’s logs, finds the Noop program logs, then deserializes and returns the corresponding Note log.
8. Write client tests
Now that we’ve got our packages installed and utility file ready, let’s dig into the tests themselves. We’re going to create four of them:
- Create Note Tree - this will create the Merkle tree we’ll be using to store note hashes
- Add Note - this will call our
append_note
instruction - Add Max Size Note - this will call our
append_note
instruction with a note that maxes out the 1232 bytes allowed in a single transaction - Update First Note - this will call our
update_note
instruction to modify the first note we added
The first test is mostly just for setup. In the last three tests, we’ll be asserting each time that the note hash on the tree matches what we would expect given the note text and signer.
Let’s start with our imports. There are quite a few from Anchor,
@nexis-network/web3.js
, @nexis-network/spl-account-compression
, and our own utils file.
Next, we’ll want to set up the state variables we’ll be using throughout our tests. This includes the default Anchor setup as well as generating a Merkle tree keypair, the tree authority, and some notes.
Finally, let’s start with the tests themselves. First the Create Note Tree
test. This test will do two things:
- Allocate a new account for the Merkle tree with a max depth of 3, max buffer size of 8, and canopy depth of 0
- Initialize this new account using our program’s
createNoteTree
instruction
Next, we’ll create the Add Note
test. It should call append_note
with
firstNote
, then check that the onchain hash matches our computed hash and that
the note log matches the text of the note we passed into the instruction.
Next, we’ll create the Add Max Size Note
test. It is the same as the previous
test, but with the second note.
Lastly, we’ll create the Update First Note
test. This is slightly more complex
than adding a note. We’ll do the following:
- Get the Merkle tree root as it’s required by the instruction.
- Call the
update_note
instruction of our program, passing in the index 0 (for the first note), the Merkle tree root, the first note, and the updated data. Remember, it needs the first note and the root because the program must verify the entire proof path for the note’s leaf before it can be updated.
That’s it, congrats! Go ahead and run anchor test
and you should get four
passing tests.
If you’re running into issues, feel free to go back through some of the demo or look at the full solution code in the Compressed Notes repository.
Challenge
Now that you’ve practiced the basics of state compression, add a new instruction to the Compressed Notes program. This new instruction should allow users to delete an existing note. keep in mind that you can’t remove a leaf from the tree, so you’ll need to decide what “deleted” looks like for your program. Good luck!
If you’d like a very simple example of a delete function, check out the
solution
branch on GitHub.