`If a device list update goes missing, the server resyncs on the next
one` was failing because a previous test would receive a `waitTime` of
1h, resulting in the test timing out.
This now tries to handle the returned errors differently, e.g. by using
the default `waitTime` of 2s. Also doesn't try further users in the
list, if one of the errors would cause a longer `waitTime`.
This ensures that if the device list updater is already backing off a node, we don't try to call processServer again anyway for server just because the server name arrived in the channel. Otherwise we can keep trying to hit a remote server that is offline or not behaving every second and that spams the logs too.
* Generic-based internal HTTP API (tested out on a few endpoints in the federation API)
* Add `PerformInvite`
* More tweaks
* Fix metric name
* Fix LookupStateIDs
* Lots of changes to clients
* Some serverside stuff
* Some error handling
* Use paths as metric names
* Revert "Use paths as metric names"
This reverts commit a9323a6a343f5ce6461a2e5bd570fe06465f1b15.
* Namespace metric names
* Remove duplicate entry
* Remove another duplicate entry
* Tweak error handling
* Some more tweaks
* Update error behaviour
* Some more error tweaking
* Fix API path for `PerformDeleteKeys`
* Fix another path
* Tweak federation client proxying
* Fix another path
* Don't return typed nils
* Some more tweaks, not that it makes any difference
* Tweak federation client proxying
* Maybe fix the key backup test
* Fix flakey sytest 'Local device key changes get to remote servers'
* Debug logs
* Remove internal/test and use /test only
Remove a lot of ancient code too.
* Use FederationRoomserverAPI in more places
* Use more interfaces in federationapi; begin adding regression test
* Linting
* Add regression test
* Unbreak tests
* ALL THE LOGS
* Fix a race condition which could cause events to not be sent to servers
If a new room event which rewrites state arrives, we remove all joined hosts
then re-calculate them. This wasn't done in a transaction so for a brief period
we would have no joined hosts. During this interim, key change events which arrive
would not be sent to destination servers. This would sporadically fail on sytest.
* Unbreak new tests
* Linting
* Send device_list update to satisfy sytest
* Fix build issue from merged in change
Co-authored-by: Neil Alexander <neilalexander@users.noreply.github.com>
* Initial federation sender -> federation API refactoring
* Move base into own package, avoids import cycle
* Fix build errors
* Fix tests
* Add signing key server tables
* Try to fold signing key server into federation API
* Fix dendritejs builds
* Update embedded interfaces
* Fix panic, fix lint error
* Update configs, docker
* Rename some things
* Reuse same keyring on the implementing side
* Fix federation tests, `NewBaseDendrite` can accept freeform options
* Fix build
* Update create_db, configs
* Name tables back
* Don't rename federationsender consumer for now
* Cross-signing groundwork
* Update to matrix-org/gomatrixserverlib#274
* Fix gobind builds, which stops unit tests in CI from yelling
* Some changes from review comments
* Fix build by passing in UIA
* Update to matrix-org/gomatrixserverlib@bec8d22
* Process master/self-signing keys from devices call
* nolint
* Enum-ify the key type in the database
* Process self-signing key too
* Fix sanity check in device list updater
* Fix check
* Fix sytest, hopefully
* Fix build
Fix#1511
On 32-bits systems, int(hash.Sum32()) can be negative.
This makes the computation of array indices using modulo invalid, crashing dendrite.
Signed-off-by: Loïck Bonniot <git@lesterpig.com>
* Add FederationClient interface to federationsender
- Use a shim struct in HTTP mode to keep the same API as `FederationClient`.
- Use `federationsender` instead of `FederationClient` in `keyserver`.
* Pointers not values
* Review comments
* Fix unit tests
* Rejig backoff
* Unbreak test
* Remove debug logs
* Review comments and linting
We did this already for local `/keys/upload` but didn't for
remote `/users/devices`. This meant any resyncs would spam produce
events, hammering disk i/o and spamming the logs.
- As a last resort, query the DB when exhausting all possible remote query
endpoints, but keep the field in `failures` so clients can detect that this
is stale data.
- Unblock `DeviceListUpdater.Update` on failures rather than timing out.
- Use a mutex when writing directly to `res`, not just for failures.
* WIP: Eagerly sync device lists on /user/keys/query requests
Also notify servers when a user's device display name changes. Few
caveats:
- sytest `Device deletion propagates over federation` fails
- `populateResponseWithDeviceKeysFromDatabase` is called from multiple
goroutines and hence is unsafe.
* Handle deleted devices correctly over federation
* Add sync mechanism to block when updating device lists
With a timeout, mainly for sytest to fix the test
"Server correctly handles incoming m.device_list_update"
which is flakey because it assumes that when `/send` 200 OKs
that the server has updated the device lists in prep for
`/keys/query` which is not always true when using workers.
* Fix UT
* Add new working test
* Add tests for device list updates
* Add stale_device_lists table and use db before asking remote for device keys
* Fetch remote keys if all devices are requested
* Add display_name col to store remote device names
Few other tweaks to make `Server correctly handles incoming m.device_list_update`
pass.
* Fix sqlite otk bug
* Unbuffered channel to block /send causing sytest to not race anymore
* Linting and fix bug whereby we didn't send updated dl tokens to the client causing a tightloop on /sync sometimes
* No longer assert staleness as Update blocks on workers now
* Back out tweaks
* Bugfixes
* Add device list updater which manages updating remote device lists
- Doesn't persist stale lists to the database yet
- Doesn't have tests yet
* Mark device lists as fresh when we persist