Previous Entry Share Next Entry
Okay, I feel a bit better now
jducoeur wrote in querki_project
[Mostly a note to myself, but I'm always interested in folks' opinions about how things work, and any insights about Bot Defense...]

On Friday morning, an NPR segment happened to mention that as much as 2/3 of Web traffic at this point is probably not from real people, it is from 'bots, wandering around and scanning for various reasons. On the one hand, that isn't actually surprising: a single aggressive web-spider accounts for a *lot* of traffic, and there are many reasons for spiders. (Some entirely benign, some entire nefarious, some in-between.) OTOH, it suddenly occurred to me that this *totally* hoses Querki's economics.

The thing is, Querki assumes that traffic in a given Space is "bunchy" -- that we get occasional user sessions in a Space, so we read it into memory, handle those requests, and then let the Space go back to sleep until the next session. This implies that the vast majority of Spaces are dormant at any given time.

But spiders don't respect these assumptions. Depending on how they are written, they can keep hitting a given Space over and over. Worse, they are entirely likely to go wandering into Spaces that are *entirely* dormant -- Spaces that the Members haven't touched in months, but which keep getting woken up by spiders that are doing nothing but poking around.

So I've been doing a bit of re-architecting over the past few days. I've made the page in my Design Notes Space visible, if you care about details, but the high concept is that Querki is going to need to make a clearer distinction between Members of the Space and non-Members, and we're going to need to aggressively cache pages as they look to non-Members. The notion is that, since non-Members cannot, in principle, change the Space, we can cache what each page looks like to them (probably in a high-efficiency NoSQL DB, or even in a proper edge cache eventually), and serve those pages instead of going through the expensive process of waking up the Space and building the page by hand...

  • 1
FWIW, I know of one site that outsources this sort of distinction to CloudFlare. Anonymous users, most of which are assumed to be 'bots, are served the cached version from CloudFlare, while logged in users pass through to the source site. The site admins feel this greatly reduces the load on their servers. This is a Drupal based site, though, so is rather much simpler than the many potential spaces you'd have on Querki.

Yaas. I've thought about that -- unlike most of Querki, these sorts of users are actually fairly well-suited to a CDN or something like that. But that would be a big project unto itself, and not plausible until I have an income stream, so it's a ways off...

Varnish? Set up a separate varnish server and bounce non-members there by default. A separate server can have different virtual hardware, live on a different hosting provider or datacenter, et al?

Interesting -- I haven't worked with Varnish, so it's hard to say, but it's probably worth taking a look at. The really interesting question is whether it is any better than simply running those requests through Querki itself. (Which is also essentially a massively multi-threaded cache, just a more intelligent one.) Might need to do some performance testing to see...

  • 1

Log in

No account? Create an account