Discovered while running
https://gitlab.futo.org/load-testing/matrix-goose.
Dendrite locks up and runs into `context cancelled`, so the error is not
`sql.ErrNoRows` nor "default" (and definitely shouldn't return that the
account exists in this case)
Fixes#2504
A few issues with the previous iteration:
- We never returned `inaccessible_children`, which (if I read the code
correctly), made Synapse raise an error and thus not returning the
requested rooms
- For restricted rooms, we didn't return the list of allowed rooms
Based on #3340
This adds a `/_synapse/admin/v1/event_reports` endpoint, the same
Synapse has. This way existing tools also work with Dendrite.
Given this is already getting huge (even though many test lines),
splitting this into two PRs. (The next adds "getting one report" and
"deleting reports")
[skip ci]
Part of #3216 and #3226
There will be a follow up PR which is going to add the same admin
endpoints Synapse has, so existing tools also work for Dendrite.
This now should actually speed up startup times.
This is because _many_ rooms (like DMs) don't have room ACLs, this means
that we had around 95% pointless DB queries. (as queried on d.m.org)
Since #3334 didn't change much on d.m.org, this is another attempt to
speed up startup.
Given moderation bots like Mjolnir/Draupnir are in many rooms with quite
often the same or similar ACLs, caching the compiled regexes _should_
reduce the startup time.
Using a pointer to the `*regexp.Regex` ensures we only store _one_
instance of a regex in memory, instead of potentially storing it hundred
of times. This should reduce memory consumption on servers with many
rooms with ACLs drastically. (5.1MB vs 1.7MB with this change on my
server with 8 ACL'd rooms [3 using the same ACLs])
[skip ci]
This hopefully reduces the garbage we currently produce.
(Using [GlitchTip](https://glitchtip.com/) on my personal instance, this
seems to look better)
Fixes https://github.com/matrix-org/dendrite/issues/3240 and potentially
a root cause for state resets.
While testing, I've had added some more debug logging:
```
time="2023-12-16T18:13:11.319458084Z" level=warning msg="already processed event" event_id="$qFYMl_F2vb1N0yxmvlFAMhqhGhLKq4kA-o_YCQKH7tQ" kind=KindNew times=2
time="2023-12-16T18:13:14.537389126Z" level=warning msg="already processed event" event_id="$EU-LTsKErT6Mt1k12-p_3xOHfiLaK6gtwVDlZ35lSuo" kind=KindNew times=5
time="2023-12-16T18:13:16.789551206Z" level=warning msg="already processed event" event_id="$dIPuAfTL5x0VyG873LKPslQeljCSxFT1WKxUtjIMUGE" kind=KindNew times=5
time="2023-12-16T18:13:17.383838767Z" level=warning msg="already processed event" event_id="$7noSZiCkzerpkz_UBO3iatpRnaOiPx-3IXc0GPDQVGE" kind=KindNew times=2
time="2023-12-16T18:13:22.091946597Z" level=warning msg="already processed event" event_id="$3Lvo3Wbi2ol9-nNbQ93N-E2MuGQCJZo5397KkFH-W6E" kind=KindNew times=1
time="2023-12-16T18:13:23.026417446Z" level=warning msg="already processed event" event_id="$lj1xS46zsLBCChhKOLJEG-bu7z-_pq9i_Y2DUIjzGy4" kind=KindNew times=4
```
So we did receive the same event over and over again. Given they are
`KindNew`, we don't short circuit if we already processed them, which
potentially caused the state to be calculated with a now wrong state
snapshot.
Also fixes the back pressure metric. We now correctly increment the
counter once we sent the message to NATS and decrement it once we
actually processed an event.
This should fix#3004 by making sure we also update our in-memory ACLs
after joining a new room.
Also makes use of more caching in `GetStateEvent`
Bonus: Adds some tests, as I was about to use `GetBulkStateContent`, but
turns out that `GetStateEvent` is basically doing the same, just that it
only gets the `eventTypeNID`/`eventStateKeyNID` once and not for every
call.
Use `IsBlacklistedOrBackingOff` from the federation API to check if we
should fetch devices.
To reduce back pressure, we now only queue retrying servers if there's
space in the channel.
This makes the following changes:
- Adds two new metrics observing the usage of the `DeviceListUpdater`
workers
- Makes the number of workers configurable
- Adds a 30s timeout for DB requests when receiving a device list update
over federation
Fixes include:
- Translating state keys that contain user IDs to their respective room
keys for both querying and sending state events
- **NOTE**: there may be design discussion needed on what should happen
when sender keys cannot be found for users
- A simple fix for kicking guests from rooms properly
- Logic for boundary history visibilities was slightly off (I'm
surprised this only manifested in pseudo ID room versions)
Signed-off-by: `Sam Wedgwood <sam@wedgwood.dev>`
This PR adds a config key `room_server.default_config_key` to set the
default room version for the room server.
Signed-off-by: `Sam Wedgwood <sam@wedgwood.dev>`
There are cases where a dendrite instance is unaware of a pseudo ID for
a user, the user is not a member of that room. To represent this case,
we currently use the 'zero' value, which is often not checked and so
causes errors later down the line. To make this case more explict, and
to be consistent with `QueryUserIDForSender`, this PR changes this to
use a pointer (and `nil` to mean no sender ID).
Signed-off-by: `Sam Wedgwood <sam@wedgwood.dev>`
The previous version was getting **ALL** membership events (as
`ClientEvents`, so going through `NewEventFromTrustedJSONWithID`) for a
given room.
Now we are querying only locally joined users as `ClientEvents`, which
should **significantly** reduce allocations.
Take for example a large room with 2k membership events, but only 1
local user - avoiding 1999 `NewEventFromTrustedJSONWithID` calls just to
calculate the `roomSize` which we can also query by other means.
This is also getting called for every `OutputRoomEvent` in the userAPI.
Benchmark with 1 local user and 100 remote users.
```
pkg: github.com/matrix-org/dendrite/userapi/consumers
cpu: 12th Gen Intel(R) Core(TM) i5-12500H
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
LocalRoomMembers-16 375.9µ ± 7% 327.6µ ± 6% -12.85% (p=0.000 n=10)
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
LocalRoomMembers-16 79.426Ki ± 0% 8.507Ki ± 0% -89.29% (p=0.000 n=10)
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
LocalRoomMembers-16 1015.0 ± 0% 277.0 ± 0% -72.71% (p=0.000 n=10)
```
When we're adding state to the database, we check which eventNIDs are
already in a block, if we already have that eventNID, we remove it from
the list. In its current form we would skip over eventNIDs in the case
we already found a match (we're decrementing `i` twice)
My theory is, that when we later get the state blocks, we are receiving
"too many" eventNIDs (well, yea, we stored too many), which may or may
not can result in state resets when comparing different state snapshots.
(e.g. when adding state we stored a eventNID by accident because we
skipped it, later we add more state and are not adding it because we
don't skip it)
This should fix two issues with backfilling:
1. right after creating and joining a room over federation, we are doing
a `/backfill` request, which would return redacted events, because the
`authEvents` are empty. Even though the spec states that, in the absence
of a history visibility event, it should be handled as `shared`.
2. `gomatrixserverlib: unsupported room version ''` - because, well, we
were never setting the `roomInfo` field..