Skip to content

Commit

Permalink
engineering.helium.com devblog migration to docs (helium#1067)
Browse files Browse the repository at this point in the history
* adding blog post history

* enable blog

* blog images and links fixed

* set blog title and description config

* adding truncation lines

* replace relative links with hardlinks, docusaurus limitation

* blog moved under devblog

* navbar cleanup

* font color fix

* remove announcement banner from navbar

* hide blog post table of contents

* remove sidebar

* added blog post authors

* added blog post authors

* leading whitespace on headers

* added authors

* header whitespace padding

* typo

* blog list styling

Co-authored-by: Joey Hiller <[email protected]>
  • Loading branch information
samgutentag and jthiller committed Jan 23, 2023
1 parent 07bc153 commit 28f0fd7
Show file tree
Hide file tree
Showing 419 changed files with 18,221 additions and 38 deletions.
38 changes: 38 additions & 0 deletions devblog/2019-10-18-beta-deploy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
layout: post
title: Blockchain Beta Deploy
date: 2019-10-18 23:41 -0700
hide_table_of_contents: true
authors: [madninja]
---

As part of improving network behavior and testing it in a real environment we use a number of opt-in
beta hotspots.

<!--truncate-->

There are some ground rules for beta deployments that we strictly follow:

- The change can _not_ affect blockchain consensus behavior. Any behavior change that affects chain
consensus rules _must_ be gated on a chain variable

- The change can improve peripheral elements of the software which _may_ indirectly affect the
behavior of the blockchain. Changes are usually related to performance or stability improvements
that make it more likely that a hotspot can talk to more hotspots or recover from error conditions
better than before the update.

## Content

The beta group was updated with firmware `2019.10.18.1` which includes:

- Improving some error resilience in the network relay service
- Only blacklisting relays that are explicitly known to not be connected by the intermediate host
- Speed improvements in NAT detection
- Using more available peer targets when syncing the blockchain
- Fix ledger disagreement over ordering of each hotspot's geographic neighbor list

## Deployment Plan

Given the small set of patches this beta includes and the expected stabilizing effect on block sync
and PoC receipt delivery issues, we plan to let this beta run for 24 hours and then deploy it to the
network at large.
154 changes: 154 additions & 0 deletions devblog/2019-10-18-incident-report.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
layout: post
title: Incident Report - Oct 10 - 17, 2019
date: 2019-10-18 22:36 -0700
hide_table_of_contents: true
authors: [madninja]
---

## Overview

The Helium blockchain team has been working for several weeks on some incremental upgrades to our
Proof of Coverage (PoC) and election systems, to make them more scalable and fair. Changes here
include:

<!--truncate-->

- adjustments to make consensus groups more geographically diverse.
- changes to limit the length of PoC paths, which grow exponentially more expensive to validate
(`O(n²)` for the computer scientists in the audience) as they grow longer.
- changes to PoC challenge generation to make it impossible for an unsynced hotspot to participate
in a PoC challenge.

## What Happened

### PoCv2 Issues

On October 10th around noon PDT, we submitted the chain variable transaction to activate these
changes, which had gone, disabled, into production over the preceeding few weeks. Shortly
thereafter, things began to go badly wrong. We were immediately aware of the impact and began
analysis of the situation. While we were able to determine quickly that block gossip has stopped
functioning effectively, it was unclear why. Our public-facing API server, which allows our mobile
apps to communicate with the hotspot blockchain network, was also not functioning correctly.

First, we discovered that the code version on our cloud based full nodes was too old, and they had
stopped syncing the chain and participating in block gossip, slowing things down quite a bit. This
was remedied by 3pm, but did not help as much as expected, so we continued to look at hotspots to
figure out what the true root of the issue was. It was clear that validating PoC receipt
transactions was a major part of the problem, but it was unclear how they had become problematic.

By 8PM PDT the ledger on the API had been reset and it resumed normal service, and the team was deep
into discussions of how to resolve the issue. At 11PM PDT a transaction was prepared that rolled
back part of the configuration change, and this seemed, initially, to be effective, so the team went
to sleep. Overnight, at approximately 1:30AM, the chain halted again, for similar reasons.

The next morning was also dedicated to analysis of the problem. We prepared and partially rolled out
a patch by 3pm that set a deadline on how long a transaction would have to have to validate,
preventing extremely long block times. It was declared GA at 4:20PM PDT and seemed to help somewhat,
but not enough. By 6PM PDT, it was decided that this was not helping enough and the full suite of
PoC changes should be removed via another chain variable change transaction. This had the desired
effect of restabilizing the chain.

### Ledger Fork

However, just after 8PM PDT, the chain experienced another halt. Investigation revealed that a large
number of hotspots considered the block invalid and could not continue syncing the chain. While we
initially feared that this was related to the earlier problems, it turned out to be unrelated. We
have been dealing with long-term determinism issues surrounding retained floating point values and
list ordering diverging over time for a while now, and this was another in the same category.

Our initial response was to skip past the invalid block that had been generated, but the block that
the consensus group subsequently generated proved impossible (for the aforementioned determinism
issues) for a large percentage of the hotspot fleet to progress past. However, the rest of the fleet
managed to continue making blocks and holding elections. Over the weekend, we spent time refining
the process for safely bringing hotspots back into line with the group. This involved reconstructing
the ledger data from blocks on disk, which was moderately time-consuming and not particularly safe.
By Monday, we had improved the existing code around this enough to safely fix the remaining
hotspots.

### API Recovery

The API server essentially is a blockchain consumer and was affected by the ledger fork as well. A
conflicting view of the ledger wouldn't allow the API server to ingest blocks and update the
necessary data in the back-end database required by the app to render different screens.

Before we realized that the root issue was the ledger fork, we tried a few cleanup methods on the
API server:

- Rebooted the API server on all the AWS instances, between 1–2 PM PDT Thursday, Oct 10
- Rebuilt the API server from scratch with updated blockchain dependencies, between 2–5 PM PDT
Thursday, Oct 10

None of which actually resolved the issue. However, once we knew that we had a conflicting ledger,
it was a simple matter of:

- Pausing the API server's sync process
- Resetting the API server's copy of the ledger
- Resuming the blockchain sync process

These steps worked as intended and the API server was back up and running around 8PM PDT, Thursday,
Oct 10.

### Continued Struggles with Pathing

Through the weekend and for most of the following week we struggled with block production and a
phenomenon where PoC rate would fall terribly low, causing a large number of issues as many active
hotspots were assumed down due to how long it had been since they'd issued a PoC challenge.

We eventually realized that the issue had nothing, strictly, to do with the PoC code, but that in
our more crowded cities, changes to how long a hotspot would be judged a poor network participant
for failing PoC challenges was lessened, leading to many new hotspots being in the neighborhood all
at once. This lead to many new and exceptionally long paths.

Since paths grow exponentially more expensive with their length, this would lead to situations where
our PoC challenge generation state machine would be stuck so long generating challenges that no
challenge that it generated could ever succeed, as their validity is height-limited.

Since challenge generation took longer than the rate at which new challenges would be issues, this
state machine entered a spiral from which it would never recover, until either restarted or fixed by
a code change, which was made on the 17th.

## Resolution

There were a number of operational and code fixes undertaken over the course of the week. We added
time budgets to all of the places where a PoC challenge could be generated or validated, and added
new path limitation code to trim paths as soon as they grow too long, rather than computing them in
full before truncating them. The new PoC code was re-enabled on Thursday, October 17th, along with
these changes.

Operationally, our team worked to reset all of the nodes that had been effected by the ledger fork
and return them to proper working order.

## Impact

Over the course of the incident, block production was massively slowed down, and with them,
elections, so token production was slowed down. We were regularly seeing block times of well over
half an hour, many of which took manual intervention to resolve.

A number of hotspots had their PoC challenge code fall behind such that they could no longer issue
challenges, impacting their ability to make tokens.

Due to the ledger fork, 65 hotspots ceased to stay synchronized with the chain, making it so that
they couldn't, temporarily, participate in PoC processes or elections, making them effectively
unable to earn tokens.

These changes also altered the topology of the network, leading to unstable block and election
times.

## Next steps

We've taken action to continually measure determinism drift on the ledger, and when we detect any,
we'll work to reduce it until there is none.

Work on restablizing block and election times is ongoing. Since our consensus groups run on
appliances using consumer internet, this is will be an ongoing challenge.

We have put in place procedures for slower and more careful roll out of new chain variables, and
re-prioritized the reconstruction of our separated test cluster.

This blog is a first step in clearly documenting both incidents and our work to remediate them, as
well as the place to announce planned changes, deployments of those changes, and any maintenance of
the app and associated services.

We apologize for the instability as we work to improve everyone's ability to participate in the
growth of the network.
32 changes: 32 additions & 0 deletions devblog/2019-10-21-beta-deploy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
layout: post
title: Blockchain Beta Deploy
date: 2019-10-22 08:56 -0700
hide_table_of_contents: true
authors: [madninja]
---

We rolled out the [Beta update](https://docs.helium.com/blog/2019/10/10/beta-deploy) over the
weekend and it appears to have helped some, but a number of issues were discovered that caused
unneeded network load, slowing down blocks and elections.

<!--truncate-->

## Content

- An optimization was added to ignore transactions already in the honey badger buffer.

- Elections that failed were not being cleaned up. Now they are.

- A recent blockchain refactor was reverted because it was inappropriately dropping transactions.

- The transaction manager now dumps a transaction if it sees f+1 rejections from the consensus
group.

- More informative logging messages added around transaction validation and speculative absorption.
This will help diagnose lingering issues.

## Deployment Plan

The patches are directly related to a focused set of issues that should improve chain behavior. We
plan to let this beta run for 24 hours and then deploy it to the network at large.
36 changes: 36 additions & 0 deletions devblog/2019-10-22-wallet-update.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
layout: post
title: Wallet Status Update
date: 2019-10-22 09:00 -0700
hide_table_of_contents: true
authors: [madninja]
---

As the number of hotspots grows with the national launch a number of issues were reported by some
users and the engineering team:

<!--truncate-->

- Some times pending transactions are cached incorrectly in the app which can cause the app to show
the block-chain being down. This has affected a small number of customers over the past weekend.

The workaround is to delete the app cache data (Android) or uninstall and re-install the app
(iOS). We'll be pushing out a short term fix while we work out the root cause of the issue.

- The mobile app reports that a hotspot needs attention while it is syncing the blockchain. While
your hotspot is syncing the needs attention warning can be ignored.

Contact support if your hotspot states that it needs attention while the sync is at 100% or if the
sync percentage is stuck at the same number for more than an hour.

- The app reports that is "unable to pair because it's not \<your hotspot name\>". This usually
happens when the hotspot is syncing, which causes high CPU load.

We're working on fixing this. A workaround is to try to pair again a few times to get past it.

- The app reports that there is a fee associated with adding a hotspot. All Helium hotspots have
these fees waived, and a hotspot update fixes this issue.

If you see this error, please close the application and wait for about 10-15 minutes for the
hotspot to update itself. You should see the light turn of as the hotspot restarts to apply the
update.
26 changes: 26 additions & 0 deletions devblog/2019-10-23-beta-deploy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
layout: post
title: Blockchain Beta Deploy
date: 2019-10-23
hide_table_of_contents: true
authors: [fvasquez]
---

In preparation for release of the LongFi SDK we are testing the latest version of the LongFi
protocol on a select few San Francisco hotspots that are within radio range of each other.

<!--truncate-->

## Content

This newly selected beta group was updated with Hotspot firmware version `2019.10.23.0` which
includes:

- A new version of concentrate which is the reference implemnentation of LongFi that runs on all
Helium hotspots
- A new version of miner that integrates with this new version of concentrate for Proof of Coverage.

## Deployment Plan

We plan to let this beta run overnight to see how Proof of Coverage is impacted then deploy it to
the network at large to use the LongFi protocol.
35 changes: 35 additions & 0 deletions devblog/2019-10-25-beta-deploy-2.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
layout: post
title: Beta Deploy 2
date: 2019-10-25 09:47 -0700
hide_table_of_contents: true
authors: [madninja, vagabond]
---

As more and more hotspots ship out, we are encountering some blockchain syncing problems. A number
of the shipped hotspots were assembled over a month ago but had older firmwware that did not include
some important bugfixes.

<!--truncate-->

In some cases the older firmware absorbs blocks in their ledger that they don't properly understand
causing ledger corruption issues that prevent further syncing.

To mitigate this problem we have decided to deploy an intentionally backwards incompatible change to
the blockchain sync protocol. The goal here is two fold; to correct some deficiencies in the
original sync protocol and to force hotspots running an older firmware to upgrade before starting to
sync the chain.

In addition a small change to peerbook metadata has been made to allow hotspots to advertise the
last time they wrote a block to their local blockchain. This is intended to help spotting and
diagnosing hotspots experiencing syncing problems.

## Content

- Roll the blockcain sync protocol version

## Deployment Plan

As this is a network-wide breaking change we plan to roll it out as quickly a possible to everyone
to avoid partitioning the network. A small controlled smoke test OTA will be done followed shortly
by a release to general availability.
46 changes: 46 additions & 0 deletions devblog/2019-10-25-beta-deploy.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
layout: post
title: Blockchain Beta Deploy
date: 2019-10-25
hide_table_of_contents: true
authors: [fvasquez]
---

The [last beta deploy](https://docs.helium.com/blog/2019/20/23/beta-deploy) did not go well. Parsing
of LongFi packets was broken in concentrate so the selected hotspots were not able to engage in
Proof of Coverage with each other as planned.

<!--truncate-->

Rolling back the hotspot firmware update resulted in a hard reboot of the beta hotspots. This hard
reboot resulted in minor file system corruption which triggered reformatting of the `/var` partition
mounted on /var because of an incorrect return code check from fsck.

WiFi credentials were lost so hotspots that were connected to the internet over WiFi fell offline.
Older pre-production units also lost their `swarm_keys` which reside on the SD card rather than the
hardware security module in production units.

Since the `/var` partition was lost on the affected beta hotspots, the blockchain cache was also
lost resulting in any production hotspots in the beta group having to resync the blockchain over
about 24 hours.

To remedy the inadvertent reformatting of hotspots persistent file system we have fast-tracked a PR
to correct the behavior of the file system check and repair script that runs on start up.

To minimize risk all the recent LongFi related commits were reverted from the master branch of the
hotspot firmware so that shut down behavior remains unchanged.

## Content

The beta group will be updated with Hotspot firmware version `2019.10.25.0` which includes:

- Revert concentrate upgrade for LongFi (not in `2019.10.21.0` GA release)
- Do not treat all non-zero fsck return codes as unrecoverable errors
- Revert miner upgrade for LongFi (not in `2019.10.21.0` GA release)
- Revert CMake upgrade (not in `2019.10.21.0` GA release)

## Deployment Plan

We plan to let this beta run for at least a couple of hours. If that proves to be stable we will tag
the branch as `2019.10.25.0` and confirm that the GA release OTA updates successfully before making
it available to all hotspots.
Loading

0 comments on commit 28f0fd7

Please sign in to comment.