This PR changes a few things:
- It pulls out the creation of several NIDs from the `StoreEvent`
function to make the functions more reusable
- Uses more caching when using those NIDs to avoid DB round trips
* Try more servers when calling `/state_ids`
* More logging
* Maybe fix concurrent map write
* Revert "Maybe fix concurrent map write"
This reverts commit da0dbb836207a911afe77e6f6d63c4809669693c.
* Enforce a limit of 20s per server, 5 mins total
* Topologically sort with `SendEventWithState`, so that earlier events should satisfy auth for later ones
* Revert "Topologically sort with `SendEventWithState`, so that earlier events should satisfy auth for later ones"
This reverts commit b0cd706012b4c9b6724b11e16f19c4cb732ab286.
* Update to matrix-org/gomatrixserverlib#293
* `Events` no longer returns an error, other tweaks
* Make sure `Events` is sorted for `parsedRespState` too
* Ensure the input API only uses a single transaction
* Remove more of the dead query API call
* Tidy up
* Fix tests hopefully
* Don't do unnecessary work for rooms that don't exist
* Improve error, fix another case where transaction wasn't used properly
* Add a unit test for checking single transaction on RS input API
* Fix logic oops when deciding whether to use a transaction in storeEvent
* Check that we have a populated state snapshot when determining if we closed the gap
* Do the same in the query API
* Use HasState more opportunistically
* Try to avoid falling down the hole of using a trustworthy but empty state snapshot for non-create events
* Refactor missing state and make sure that we really solve the problem for the new event
* Comments
* Review comments
* Tweak that check again
* Tidy up that create check further
* Fix build hopefully
* Update sendOutliers to use OrderAuthAndStateEvents
* Don't go out of bounds on missingEvents
* Use new event json types in gmsl
* Fix EventJSON to actually unmarshal events
* Update GMSL
* Bump GMSL and improve error messages
* Send back the correct RespState
* Update GMSL
* Don't flake so badly for rejected events
* Moar
* Fix panic
* Don't count rejected events as missing
* Don't treat rejected events without state as missing
* Revert "Don't count rejected events as missing"
This reverts commit 4b6139b62eb91ba059b47415b0275964b37d9b43.
* Missing events should be KindOld
* If we have state, use it, regardless of memberships which could be stale now
* Fetch missing state for KindOld too
* Tweak the condition again
* Clean up a bit
* Use room updater to get latest events in a race-free way
* Return the correct error
* Improve errors
It isn't really clear that the deadlines actually help in any way. Currently we can use up our 2 minutes doing something, run out of context time and then return an error which causes the transaction to rollback and forgetting everything we've done. If the message came to us from NATS then we probably will end up retrying just to be in the same situation. We'd be really a lot better if we just spent the time reconciling the problem in the first place, and then we're much less likely to need to fetch those missing auth or prev events in the future.
Also includes matrix-org/gomatrixserverlib#287 so we don't wait so long for servers that are obviously dead.
* Add transaction to all database tables in roomserver, rename latest events updater to room updater, use room updater for all RS input
* Better transaction management
* Tweak order
* Handle cases where the room does not exist
* Other fixes
* More tweaks
* Fill some gaps
* Fill in the gaps
* good lord it gets worse
* Don't roll back transactions when events rejected
* Pass through errors properly
* Fix bugs
* Fix incorrect error check
* Don't panic on nil txns
* Tweaks
* Hopefully fix panics for good in SQLite this time
* Fix rollback
* Minor bug fixes with latest event updater
* Some review comments
* Revert "Some review comments"
This reverts commit 0caf8cf53e62c33f7b83c52e9df1d963871f751e.
* Fix a couple of bugs
* Clearer commit and rollback results
* Remove unnecessary prepares
* Improve server selection somewhat
* Remove things from the map when we're done
* Be less panicky about auth event signatures in case they are not fatal after all
* Accept HasState in all cases
* Send join asynchronously
* Revert "Send join asynchronously"
This reverts commit 5b685bfcd0b1150a66c7b1e70fb3a3eda509efd1.
* Joins and leaves use background context
* Put federation client functions into their own file
* Look for missing auth events in RS input
* Remove retrieveMissingAuthEvents from federation API
* Logging
* Sorta transplanted the code over
* Use event origin failing all else
* Don't get stuck on mutexes:
* Add verifier
* Don't mark state events with zero snapshot NID as not existing
* Check missing state if not an outlier before storing the event
* Reject instead of soft-fail, don't copy roominfo so much
* Use synchronous contexts, limit time to fetch missing events
* Clean up some commented out bits
* Simplify `/send` endpoint significantly
* Submit async
* Report errors on sending to RS input
* Set max payload in NATS to 16MB
* Tweak metrics
* Add `workerForRoom` for tidiness
* Try skipping unmarshalling errors for RespMissingEvents
* Track missing prev events separately to avoid calculating state when not possible
* Tweak logic around checking missing state
* Care about state when checking missing prev events
* Don't check missing state for create events
* Try that again
* Handle create events better
* Send create room events as new
* Use given event kind when sending auth/state events
* Revert "Use given event kind when sending auth/state events"
This reverts commit 089d64d271b5fca8c104e1554711187420dbebca.
* Only search for missing prev events or state for new events
* Tweaks
* We only have missing prev if we don't supply state
* Room version tweaks
* Allow async inputs again
* Apply backpressure to consumers/synchronous requests to hopefully stop things being overwhelmed
* Set timeouts on roomserver input tasks (need to decide what timeout makes sense)
* Use work queue policy, deliver all on restart
* Reduce chance of duplicates being sent by NATS
* Limit the number of servers we attempt to reduce backpressure
* Some review comment fixes
* Tidy up a couple things
* Don't limit servers, randomise order using map
* Some context refactoring
* Update gmsl
* Don't resend create events
* Set stateIDs length correctly or else the roomserver thinks there are missing events when there aren't
* Exclude our own servername
* Try backing off servers
* Make excluding self behaviour optional
* Exclude self from g_m_e
* Update sytest-whitelist
* Update consumers for the roomserver output stream
* Remember to send outliers for state returned from /gme
* Make full HTTP tests less upsetti
* Remove 'If a device list update goes missing, the server resyncs on the next one' from the sytest blacklist
* Remove debugging test
* Fix blacklist again, remove unnecessary duplicate context
* Clearer contexts, don't use background in case there's something happening there
* Don't queue up events more than once in memory
* Correctly identify create events when checking for state
* Fill in gaps again in /gme code
* Remove `AuthEventIDs` from `InputRoomEvent`
* Remove stray field
Co-authored-by: Kegan Dougal <kegan@matrix.org>