Previous Entry Share Next Entry
Something to keep an eye out for
jducoeur wrote in querki_project
A general request for folks to keep a watchful eye open: I spent the morning in a worrying firedrill, because the Issues Space had gotten into a Weird State. The behavior strongly suggests that we somehow wound up with two copies of that Space alive at the same time, which *should* be impossible. It's a matter of real concern -- while Querki is pretty resistant to data corruption (and so far, the problem doesn't seem to cause data loss, just immense confusion), the notion that there can only be one copy of a Space alive at a time is one of the system's key invariants, and we have to be absolutely intolerant of violations of that.

The problem is, I have absolutely no idea how to recreate this situation, since I have no idea how it happened, or even exactly what "it" is; my attempts to repro it have simply failed. So I'm asking you to keep an eye out for it.

The behavior will probably show up as a Space having short-term "memory" problems. Most typical is that you create a Thing, and then find that it's not there. More precisely, if you reload several times, you find it *sometimes* there and *sometimes* not, inconsistently. (Which is why I believe the problem has to do with duplicate copies of the Space, where some requests are routing to the copy that created the Thing, and some to the other.) This may manifest in the middle of creating a Thing -- you'll see an error saying something like "_edit didn't receive a Thing". (Which is because the Client made a request to create the Thing, succeeded, but then the request to *edit* that Thing went to the other copy.) It also results in a *lot* of inconsistent errors when trying to edit this Thing (since sometimes it is trying to save the change to a Space that doesn't know this Thing exists).

I've put in some logging spewage that *may* help understand the problem. So if you see this happening, please tell me ASAP, so I can go look into the logs and see if they say anything helpful. I want to get this problem diagnosed and fixed as soon as possible -- I consider data integrity to be the very highest priority for Querki, and any problem that threatens that even slightly is automatically critical.

Sorry about the inconvenience. We tested as much as we could before switching to the new cluster, but it was inevitable that some bugs would slip through the cracks.

Speaking of which, a bug that you're less likely to hit but should be aware of: displaying a *very* large page is currently likely to fail. This one is completely understood -- the rendered page is simply too large to send around the network, it turns out. The problem will only happen if the rendered QText for the page is over 128k, and the only pages I know that evince this problem are the ones in the Issues Space that try to display *all* of the Issues. (So don't try those pages right at the moment.) I'm working on the problem, but it may be a couple of weeks before I can push out a good fix for it -- doing this right is going to require some protocol changes, which will be a bit of a headache, but it's worth taking the time to solve this properly and permanently.

  • 1
Eeyuck! Any sense that this is an artifact of moving to IaaS and having additional levels/instances of front end/back end/caching?

It's possible, but I think it's most likely an artifact of moving to a cluster. Remember that, while Querki has been *designed* for clustering from the beginning, and coded for it for a year or so now, we're only now moving day-to-day operations off of running on a single machine. That's the really big recent change, and the one likeliest to cause this sort of problem. In principle, Akka is location transparent, so it shouldn't matter whether you're on one machine or many; in practice, there are a lot of little details that it affects.

If I had to guess, I'd say it's some sort of effect of errors propagating through the system -- something failed that causes something else to fail in such a way that it allows a second instance to boot up. But that's purely a guess. There's also a modest likelihood that it's a bug in the underlying Akka Sharding library (which is the tech that I am trusting to *make* this impossible): we're using a moderately old version, and it's quite possible that I hit an edge-case bug that has since been fixed. So the non-trivial task of updating Querki to the current versions of its core libraries gets a priority boost...

  • 1

Log in

No account? Create an account