Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce backwards/forwards compatibility for bdk_file_store. #1527

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nymius
Copy link
Contributor

@nymius nymius commented Jul 28, 2024

Description

Addresses issue #1499.

Notes to the reviewers

The changes have been rebased over #1495, because it will be merged sooner and there is a great number of conflicts that will arise later with master if not done.

I created this draft to gather feedback.
I'm not understanding properly the issue and cannot visualize how the end user will use this feature.

I noticed that the expected variant index error will be risen only when the decoded version is newer than the one implemented in the code. There is no need to check the indices in that case, as the error is already signaling the encoded variant is higher than the existing in the enum.
If the decoded variant is in the range of the enum variants, but still decoding of the inner struct fails, it means the error is somewhere else.
Also, the decoded variant shouldn't be below the range, as that's not possible, at least using the default variant representation.

Another point to have in mind is I was expecting the above situation raised InvalidTagEncoding, but it didn't happen. I'm tracking the trigger of this error, which seems more fitted to describe the situation we want to handle, but couldn't find anything yet.

All this means I could refactor the decoding entry_iter code and make it shorter.

FAQ:

What's the idea behind versioned changesets?

To enable backward and forward compatibility @LLFourn crafted versioned changeset with the following properties:

  1. Each new version contains the variants of the previous version and at least one extra variant. I.e.: $\forall\ i,\ j\ \in\ {1, 2, \cdots, n}\ V_i > V_j$ if $i > j$ and $V_i \supset V_j$

E.g.:

enum Versioned1 {
	V1(StructA),
	V2(StructB),
	V3(StructC)
}

// Next code version
enum Versioned2 {
	V1(StructA),
	V2(StructB),
	V3(StructC),
	V4(StructD),
	V5(StructE)
}
  1. Each variant contains the fields of the previous variant in the same order and at least one extra field. I.e.: $\forall\ r \in\ {1, 2, \cdots, k}\ |\ c_r,\ c_{r+1}\ \in\ V_i,\ c_{r+1}\ \supset\ c_r$
    E.g.:
enum Versioned1 {
	V1(StructA),
	V2(StructB),
	V3(StructB)
}

// Next code version
enum Versioned2 {
	V1(StructA),
	V2(StructB),
	V3(StructC),
	V4(StructD),
	V5(StructE)
}

struct StructA {
	field_a: u64
}

struct StructB {
	field_a: u64,
	field_b: u64
}

struct StructC {
	field_a: u64,
	field_b: u64,
	field_c: u64
}

struct StructD {
	field_a: u64,
	field_b: u64,
	field_c: u64,
	field_d: u64,
	field_e: u64,
	field_f: u64
}

struct StructE {
	field_a: u64,
	field_b: u64,
	field_c: u64,
	field_d: u64,
	field_e: u64,
	field_f: u64,
	field_g: u64
}
How versioned changesets solve backward compatibility? Considering the previous definition of versioned changesets, backward compatibility means:

Decode $V_j$ using $V_i$ when $i\ >\ j$

This is trivial, as $V_i$ already includes all the variants in $V_j$ so it is a matter of just letting bincode select the variant of $V_i$ based on the encoded variant index.

How versioned changesets solve forward compatibility? Defining forward compatibility in terms of the versioned changeset definition:

Decode changeset $c_w$ from $V_i$ using code only knowing about $V_j$ when $i\ >\ j$ and $c_w$ is not included in $V_j$ .

Because of property 1 and 2 we know the max variant in $V_j$ , let's call it $c_r$ , fulfills property 2: $c_r \subset c_w$ . So we can try to decode $c_r$ from the encoded data ignoring the remaining fields from $c_w$.

How does the encoding format look like? Using the following format:
[L][V]

where [L] is the length of the encoded data, and [V] is the encoding of a variant of the enum representing the versioned changeset, so it is encoded always prefixed with the variant prefix, let's call it i followed by the actual data, let's call it C. Being the final encoding as the following:

[L][(i)(C)]
Why we need to prepend a length to every encoded versioned changeset? Typically the encoding length of a versioned changeset $V_i$ would be determined by its type definition, however, a newer version $V_j$ with $j > i$ will have new variants, and the encoding length of those variants cannot be determined if the code only knows about $V_i$.
How we realize we are decoding a higher encoded version than we actually have? As bincode encodes each enum variant prefixed with its variant index, if the versioned changeset $V_i$ we are currently working with doesn't contain the variant index we can be sure the data is of a newer type. On the flip side, if the data is of a variant index that exists in $V_i$, we can be sure the data is corrupted.
How encoding works?
  1. Transform the changeset into a versioned-changeset first.
  2. Get serialized size of versioned changeset: [L].
  3. Encode [L] as u64 and write to file.
  4. Encode [V] and write to file. Remember bincode encodes it prefixed with the variant index: [(i)(C)]
How decoding works?
  1. Decode [L]as u64 from file.
  2. Attempt to decode first [L] bytes as [V].
    a. On success: Extract C from [V] ([(i)(C)]) and return it.
    b. On failure: Decode [V] first bytes as u64 to extract the enum variant index. If it is too high, trim the variant index from the data, and attempt to decode C , the written changeset $c_r$ into the version in use $V_i$ max variant $c_r$, the reading changeset.
How do we determine if a variant is too high?

Code available to extract the number of variants in an enum is unsafe or it is in the nightly API, so we force versioned changeset to implement the Default trait expecting it to always return the max variant (latest changeset version) of the versioned changeset. Then we can encode it using bincode and decode the prefixed variant index as u64 ignoring the remaining encoded bytes.

let serialized_max_variant = bincode_options().serialize(&T1::default())?;
// allow trailing bytes because we are serializing the variant, not an u64, and
// we want to extract the varint index prefixed
let max_variant_index = bincode_options()
    .allow_trailing_bytes()
    .deserialize::<u64>(&serialized_max_variant)?;
Is it necessary to check if the encoded variant index is higher than the max variant index of the versioned changeset in use?

I think it's not. While testing the implementation the code panicked with a particular bincode::ErrorKind::Custom error announcing the decoded variant index wasn't expected as it exceeded the limits of the variant indexes available. So it seems just handling this error is enough to detect higher encoded versions.

I haven't removed the check because I want someone else to confirm this.

How to decode the higher version encoded changeset $c_w$ with the lower version decoding changeset $c_r$?

As I said, the code decodes [V] first bytes as u64 to extract the enum variant index and if it is too high, trims the variant index from the data, and attempts to decode C , the written changeset $c_r$ into the version in use $V_i$ max variant $c_r$, the reading changeset.
As this step is decoding less data than the available in the buffer, we risk to left this data in the buffer losing "data alignment" and ending with a bincode SizeLimit error produced after falling short while decoding the last portion of the encoded data.
To avoid this, we need to keep track of the encoded versioned changeset size [L] and after decoding $c_r$, substract the serialized size of $c_r$ (which includes the variant index) to the decoded [L], and use this as an offset to move the reader position after that amount of bytes.

To summarize:

  1. Keep track of the encoded size of the changeset
  2. Get the size of the decoded $c_r$
  3. Get the remaining data to read from the subtraction of 1 minus 2
  4. Skip the obtained number in 3 of bytes (the extra fields from $c_w$)
  5. Keep decoding the remaining data, coming back to 1 if needed

To illustrate this look at the following example:
$c_w$ (greater encoded changeset version)
$c_r$ (decoding changeset version with less fields)
$c_w$ = [0][0][0][0]
$c_r$ = [x][x]
[S] means the byte has been skipped

Ignore for the sake of the example the length prefix, and the variant index, and give me the license to use a delimiter that bincode doesn't have, the pipe |.

initial encoded buffer = [0][0][0]|[0] | [0][0][0][0] | [0][0][0]|[0] | [0][0][0][0]
first decode = [X][X][S]|[S] | [0][0][0][0] | [0][0][0]|[0] | [0][0][0][0]
second decode = [X][X][S]|[S] | [X][X][S][S] | [0][0][0]|[0] | [0][0][0][0]
third decode = [X][X][S]|[S] | [X][X][S][S] | [X][X][S]|[S] | [0][0][0][0]
fourth decode = [X][X][S]|[S] | [X][X][S][S] | [X][X][S]|[S] | [X][X][S][S]

Which problems this realignment of data may have?

If not used carefully a version $V_i$ and a higher $V_j$ may have the same variant indexes with different types, and the change above will decode it without returning any error. This is obviously something against the definition of versioned changeset, but without the realignment code this error was easily spotted.

All users should implement versioned changesets to work with the `file_store` crate?

No, the current code is completely compatible with a simple changeset without using any enum versioned changeset or variant index prefix.

Why there is an pattern matching arm handling `bincode::ErrorKind::InvalidTagEncoding`?

While implementing the changes and investigating the bincode I tumbled over this error that at first seemed the most suitable to handle in relation to a variant index off limits, but I never got it to reproduce.

I think it is removable, but I would like another eye there before doing it.

What tests have been added?

As the backward compatibility case is trivial, it just depends of bincode logic, the only added tests involves the forward compatibility hard case: when changeset $c_w$ has been encoded with a versioned changeset $V_j$ and its being decoded with a lesser versioned changeset $V_i$ with $i &lt; j$ , and the maximum available variant is $c_r$.
For this test multiple changesets have been appended to replicate the alignment explained above (which may not rise if the changeset encoded was just one).
Also, Option type has been avoided because there was a misleading notion of how bincode encoded and decoded these fields that might introduce users to think any additional variants added to higher versions should be of this type.

What I would like to have feedback about?

To figure out parts of the code where I would like to have feedback on, check out:

  • The only implemented test
  • Question Is it necessary to check if the encoded variant index is higher than the max variant index of the versioned changeset in use? of this FAQ.
  • Question Why there is an arm handling bincode::ErrorKind::InvalidTagEncoding? of this FAQ.
  • Question Which problems this realignment of data may have? of this FAQ.

Changelog notice

Checklists

All Submissions:

  • I've signed all my commits
  • I followed the contribution guidelines
  • I ran cargo fmt and cargo clippy before committing

New Features:

  • I've added tests for the new feature
  • I've added docs for the new feature

@@ -118,7 +126,7 @@ where
/// **WARNING**: This method changes the write position in the underlying file. You should
/// always iterate over all entries until `None` is returned if you want your next write to go
/// at the end; otherwise, you will write over existing entries.
pub fn iter_changesets(&mut self) -> EntryIter<C> {
pub fn iter_changesets(&mut self) -> EntryIter<V> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be helpful to have two versions of this method. One that iterates over versioned changesets; iter_versioned_changesets(&mut self) -> EntryIter<V> and one that iterates over changesets; iter_changesets(&mut self) -> EntryIter<C>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

@nymius nymius force-pushed the I1499/introduce-tlv-for-bdk-file-store branch 2 times, most recently from fd43805 to 4891b28 Compare August 2, 2024 23:33
@evanlinjin
Copy link
Member

evanlinjin commented Aug 7, 2024

There are too many changes in the examples. I think there is a bit of a confusion with the goal of the ticket. We don't need to make our new code understand data persisted with our old code. We want data persisted with the new code to be backwards compatible in the future.

Edit: Okay I see why the examples are changed. It's because we can't impl Into and From on tuples that don't contain a locally defined type...

At the same time, changing the example changesets to store descriptors is out of scope for this PR.

@evanlinjin
Copy link
Member

For the bdk_file_store::Store methods that require transforming C to V (or vise versa), I think we should have 2 versions of each. One for when the C: Into<V> and V: Into<C> bounds are satisfied, one where it is not. For the not satisfied case, we take in closures that can handle such transformations.

This way, we can keep the examples mostly the same, and just call the methods with the transform closures as input.

@evanlinjin
Copy link
Member

I'm not understanding properly the issue and cannot visualize how the end user will use this feature.

We want to be able to persist changesets in bdk_file_store that are compatible with newer versions of the software. The newer version of the software may have new fields or modified fields in changeset.

Instead of storing changesets directly, we wrap the changesets in an enum before storing. This acts as some sort of versioning.

Comment on lines 129 to 118
pub fn iter_versioned_changesets(&mut self) -> EntryIter<V> {
EntryIter::new(self.magic_len as u64, &mut self.db_file)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iter_changesets should be built from iter_versioned_changesets. We only want to read versioned changesets directly from the file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    pub fn iter_changesets(&mut self) -> impl Iterator<Item = Result<C, IterError>> + '_ {
        self.iter_versioned_changesets().map(|r| r.map(Into::into))
    }

Comment on lines 25 to 36
C: Into<V>
+ Merge
+ Clone
+ Default
+ serde::de::DeserializeOwned
+ core::marker::Send
+ core::marker::Sync,
V: Into<C>
+ Default
+ serde::Serialize
+ serde::de::DeserializeOwned
+ core::marker::Send
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
C: Into<V>
+ Merge
+ Clone
+ Default
+ serde::de::DeserializeOwned
+ core::marker::Send
+ core::marker::Sync,
V: Into<C>
+ Default
+ serde::Serialize
+ serde::de::DeserializeOwned
+ core::marker::Send
C: Into<V> + Merge + Clone + Default,
V: Into<C>
+ Default
+ serde::Serialize
+ serde::de::DeserializeOwned,

There are a lot of unnecessary bounds here.

Comment on lines 15 to 16
where
C: Sync + Send,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where
C: Sync + Send,

No longer needed.

Copy link
Contributor Author

@nymius nymius Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why they were here in the first place? Change sets cannot be modified by multiple threads?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, we used a trait object that is kept internally in Wallet for the PersistenceBackend... However, I'm not sure, maybe it was just a mistake. Those bounds aren't needed for the methods we implement below.

Comment on lines 158 to 161
pub fn aggregate_changesets(&mut self) -> Result<Option<C>, AggregateChangesetsError<C>> {
let mut changeset = Option::<C>::None;
for next_changeset in self.iter_changesets() {
let next_changeset = match next_changeset {
Ok(next_changeset) => next_changeset,
Err(iter_error) => {
return Err(AggregateChangesetsError {
changeset,
iter_error,
})
let changeset_aggregator =
|mut aggregated_changeset: Option<C>, next_changeset| match next_changeset {
Ok(next_changeset) => {
match &mut aggregated_changeset {
Some(aggregated_changeset) => aggregated_changeset.merge(next_changeset),
aggregated_changeset => *aggregated_changeset = Some(next_changeset),
}
Ok(aggregated_changeset)
}
Err(iter_error) => Err(AggregateChangesetsError {
changeset: aggregated_changeset,
iter_error,
}),
};
match &mut changeset {
Some(changeset) => changeset.merge(next_changeset),
changeset => *changeset = Some(next_changeset),
}
}
Ok(changeset)
// Prepare C changeset deserializer
let unversioned_parsed_changeset = self
.iter_changesets()
.try_fold(Option::<C>::None, changeset_aggregator);

// But try first versioned V changeset deserializer and if it fails, proceed with C
// unversioned deserializer
self.iter_versioned_changesets()
.try_fold(Option::<C>::None, |acc, x| {
changeset_aggregator(acc, x.map(|v| v.into()))
})
.or(unversioned_parsed_changeset)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bit of a misunderstanding of what @LLFourn is suggesting here and here.

Forward compatibility here means that the encoded V is of a newer version than what our software has, but our software needs to support deserializing it.

To have forward compatibility, we need to prepend a length to every encoded V.

Why? Typically, the encoding length can be determined by the type definition of V. However, a newer version of V will have new variants, and the encoding lengths of those variants cannot be determined (because we don't have it).

When we fail to deserialize into a variable of type V, we need to check the "variant index" encoding first.

Why? If the variant index does not exist in our type V (which is an enum), then we are sure the data if of a newer type. Since the data is of a newer type, we can apply the forward-compatibility logic. On the flip side, if the data is of a variant index that that exists in V, then we are sure that the data is corrupted.

What is "variant index"? bincode encodes enums with the variant index first. The variant index is encoded with VarintEncoding by default. Refer to how bincode encodes integers here, and enums here.

The Encoding Format

Each changeset should be encoded as follows:

[L][V]
  • [L] is a bincode encoded u64. Use serialized_size on V to figure out the value of L.
  • [V] is a bincode encoded enum type that represents a versioned changeset. Because it is an enum, it is guaranteed that the encoded data begins with a VariantEncoding.

When encoding a changeset

  1. Transform the changeset into a versioned-changeset first.
  2. Determine the serialized_size of the versioned-changeset. The serialized_size of V is L.
  3. Encode L: u64 and write to file.
  4. Enncode V and write to file.

When decoding a changeset

  1. Decode L: u64 from file.
  2. Read L bytes from file and attempt to decode as V.
    • On success: Great! Extract C from V and return it.
    • On failure: Decode V as u64 (to extract the enum-variant-varint-index). If variant index is too high, trim the variant-index from the data, and attempt to decode into C (changeset) directly.

How do we determine if a variant-index is too high?

Ideally, we can use https://doc.rust-lang.org/std/mem/fn.variant_count.html on our definition of V. However, this is a nightly-only API.

Instead, let's use the hack that @LLFourn has suggested: Encode V::default() (which should always be the latest-version variant), and then decode the result as u64 to extract the variant-index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also propose to update this PR description to: Introduce backwards/forwards compatibility for bdk_file_store.

Copy link
Member

@evanlinjin evanlinjin Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, to have forward compatibility, we are assuming that the new changeset C only adds new fields at the end as Options (when diffed with the old changeset C).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review and write up!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When decoding a changeset, the On failure case for step 2, we hope to succeed if C is of a lower version than the encoded one but without the extra probably Option fields or code should fail? Or we expect C to be of the newer version but V doesn't account for it yet? (like in the test I added)

cc @evanlinjin

Copy link
Member

@evanlinjin evanlinjin Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nymius

Just to be clear with what we are talking about, V and C can either be type definitions defined by the software version we use to read the data, or by the software version which encoded and wrote the data to the file we are reading.

  • For type definitions defined by the software that reads, let's use Vr and Cr (Where Cr is the body of Vr's latest enum variant).
  • For type definitions defined by the software the writes, let's use Vw and Cw (where Cw is the body of Vw's latest enum variant).

When decoding the data as Vr fails, it either means:

  1. Vw's enum variant index does not exist in Vr.
  2. The data to decode is invalid. This is either data corruption, or Vr has a variant with a different C definition to that of Vw.

For 1., we still try decode the data (excluding the enum variant index) to Cr. This will work if Cw only adds new fields on Cr (which still exists in Vw as a lower enum variant).

For 2., we can just error out. It's either a programming error or data corruption.

@nymius nymius changed the title Introduce versioning for changeset backward and forward compatibility Introduce backwards/forwards compatibility for bdk_file_store. Aug 13, 2024
@nymius nymius force-pushed the I1499/introduce-tlv-for-bdk-file-store branch 6 times, most recently from 3158574 to ef70dad Compare August 24, 2024 14:33
@nymius nymius marked this pull request as ready for review August 24, 2024 14:47
@notmandatory notmandatory added this to the 1.0.0-beta milestone Aug 24, 2024
@nymius nymius force-pushed the I1499/introduce-tlv-for-bdk-file-store branch from ef70dad to 45f8285 Compare August 25, 2024 21:47
Forward compatibility means the encoded V is of a newer version than
what our software has, but our software needs to support deserializing
it.
To have forward compatibility, we need a length prefix for every encoded
V.

Why? Typically, the encoding length can be determined by the type
definition of V. However, a newer version of V will have new variants,
and the encoding lengths of those variants cannot be determined (because
we don't have it).
When we fail to deserialize into a variable of type V, we need to check
the "variant index" encoding first.

Why? If the variant index does not exist in our type V (which is an
enum), then we are sure the data if of a newer type. Since the data is
of a newer type, we can apply the forward-compatibility logic. On the
flip side, if the data is of a variant index that exists in V, then we
are sure that the data is corrupted.

What is "variant index"? bincode encodes enums with the variant index
first. The variant index is encoded with VarintEncoding by default.

ENCODING FORMAT

Each changeset should be encoded as follows:

[L][V]

- [L] is a bincode encoded u64. Use serialized_size on V to figure out
  the value of L.
- [V] is a bincode encoded enum type that represents a versioned
  changeset. Because it is an enum, it is guaranteed that the encoded
  data begins with a VariantEncoding.

Encoding:

1. Transform the changeset into a versioned-changeset first.
2. Determine the serialized_size of the versioned-changeset. The
   serialized_size of V is L.
3. Encode L: u64 and write to file.
4. Encode V and write to file.

Decoding:

1. Decode L: u64 from file.
2. Read L bytes from file and attempt to decode as V.
  a. On success: Great! Extract C from V and return it.
  b. On failure: Decode V as u64 (to extract the
  enum-variant-varint-index). If variant index is too high, trim the
  variant-index from the data, and attempt to decode into C (changeset)
  directly.

How do we determine if a variant-index is too high?

Encode V::default() (which should always be the latest-version variant),
and then decode the result as u64 to extract the variant-index.
Serialize newer versioned changeset and deserialize into alternative not
versioned changeset.
@nymius nymius force-pushed the I1499/introduce-tlv-for-bdk-file-store branch from 45f8285 to 07c2857 Compare September 16, 2024 01:16
The V generic is intended for cases when the user keeps their changesets
versioned and with a strict policy of only adding fields, not modifying
older ones nor removing them.

In the case the data has been encoded with a newer version, but is
decoded with an older one, the code should attemp to parse only the
fields it already knows of, ignoring the rest.

This code implements that logic but also opens the door for parsing
fields of different types as the wrong type.
@nymius nymius force-pushed the I1499/introduce-tlv-for-bdk-file-store branch from 07c2857 to 3731a1c Compare September 17, 2024 11:58
@notmandatory notmandatory removed this from the 1.0.0-beta milestone Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

3 participants