scaling alone: a backend dev’s analytics war story

18 Jul, 2025

There are two types of tasks assigned to you

The neat little features you plan, groom, and break into clean stories.
And then there’s the other kind

Like when your PM casually drops:

“Hey, I need to see how many users Raj added below him in the hierarchy, like, everyone under him. Also, how many registrations they all did. Daily. And oh, it should be filterable by date, role, and should work in realtime... it’s urgent.”

And you blink twice, hoping they’re joking.

But they're not. They follow up with:

“We need org-wide analytics with daily updates, full hierarchy traversal, role-level breakdowns, registration metrics… and it has to scale for 50k users. Oh, and you're the only backend dev.”

Great. :)

This story is about that second kind.
The one that doesn't come with clean specs,
only chaos, coffee, and a very real chance you'll end up writing a recursive CTE that haunts your dreams.

🛠️ The Setup

We had:

~50,000 users.
Each user could perform up to 500-1000 registrations per day.
15 levels of roles - with org-specific semantics like Volunteer, LPR, DPOC, DPR, CPOC, RM and a bunch more. :/
A single user could have multiple supervisors, and supervisors could have thousands of descendants.
Users were constantly creating new users, so the hierarchy was dynamic and unbounded.
We had to show:
Self + descendants’ metrics, per day, per user, with filters by:
- Date
- Role
- Address
- Registration type
- ... and a dozen more

🔥 Jumping into the Fire

first thought:

“Okay, just a few joins of users table with other tables using created_by. Boom. Done.”

how cute.

Here’s what broke:

Joins Exploded: Because every user could spawn more users who spawned more users who... you get the point. The tree kept going. So did the joins. Until Postgres wheezed.
The N+1 Apocalypse: Every user’s data needed recursive fetches. Every recursive fetch meant a new query. Multiply that by thousands. Multiply again for each metric.
Filters From Hell: A simple date filter became a nightmare once descendants got involved. “Registrations by anyone under Raj between March and April but only if they were Volunteers from Bihar” = 💥
Real-Time Expectations + No Precomputed Tables. Everything had to be calculated on the fly. Because caching was a “next sprint” idea.

Could you imagine encountering this and a bunch more problems in Just a single day?

And to top it all off?

We had an urgent demo in 48 hours. Classic.

The Schema That Saved Me (and My Sanity)

After crying internally and externally, I realized I needed to model the hierarchy traversal cleanly - without nuking the database every time someone asked for "all users under Raj from the past month."

So I took a breath and went back to first principles.

We needed a way to:

Traverse user hierarchy (even with many-to-many relationships)
Get all descendants for a given user
Aggregate metrics for any node in the tree

…and not kill the DB.

The solution?

A recursive Common Table Expression (CTE) - the SQL wizardry that lets you walk a tree structure using a parent-child mapping.

We already had a table that looked like this:

user_hierarchy (
  parent_id UUID,
  child_id UUID
)

No fancy closure table or nested sets - just plain old parent-child pairs. Why? Because we had no fixed depth. The hierarchy could go 3 levels deep… or 30. It was unpredictable and dynamic. I didn’t want to maintain precomputed paths. And frankly, I didn’t have the luxury of time or a data engineer.

So I did what any desperate backend dev would do:

WITH RECURSIVE descendants AS (
  SELECT child_id
  FROM user_hierarchy
  WHERE parent_id = :given_user_id  -- Starting from Raj (for example)

  UNION ALL

  SELECT uh.child_id
  FROM user_hierarchy uh
  INNER JOIN descendants d ON d.child_id = uh.parent_id
)
SELECT * FROM descendants;

This gave me all the users under a given user, no matter how deep the rabbit hole went. Whether it was 3 levels or 13, Postgres handled it like a champ (until it didn’t 😅).

Sounds inefficient? It was. But it worked. Barely. Now I could:

Grab all descendants of a user,
Then join them with the registrations table,
And group/filter the hell out of it.

But the fun was just beginning.

Now Comes the filters, real pain

Okay, cool. I had a working recursive structure.

But then product asked for filters.
Simple ones at first- date ranges and user roles.

No big deal.

Until we hit 30,000 users.

That’s when the queries started to die.

I’m talking 20-second response times, CPU spikes, and a very visible fear of Postgres in htop.

The filtering logic got wild:

Date ranges? Easy.
Roles? Sure.
Districts, blocks, villages, panchayats, registration type, channel, mode, active, installed the app or not … oh god.

Now imagine:

“Give me how many registrations were done by all users under Raj between Jan 1 and Feb 15 who are DPOCs and from East Champaran, and only if installed the app.”

AND do this live, because “real-time analytics” was apparently a requirement.
(Yes, we hadn’t even built caching yet.)

Instead of smashing all metrics into one mega-query, I broke them down:

One query fetched all descendants using the recursive CTE.
Then, for each metric - say, registrations - I wrote specific, indexed, filter-aware queries.
Filters got pre-validated and normalized in the backend.
Results were then aggregated manually in memory or via smaller grouped SQL calls.

Was it pretty? No.
Was it clean? Also no.
Was it battle-tested and got the job done for the demo? Hell yes.

We weren’t just filtering users.

We were filtering:

The descendant tree of a given user (using recursive CTE),
On attributes of each descendant (role, status, region, etc),
While also filtering the registration data they had generated (registration_type, created_at, etc).

That meant:

WITH RECURSIVE user_tree AS (
  SELECT child_id
  FROM user_hierarchy
  WHERE parent_id = :given_user_id

  UNION ALL

  SELECT uh.child_id
  FROM user_hierarchy uh
  INNER JOIN user_tree ut ON ut.child_id = uh.parent_id
),
filtered_users AS (
  SELECT u.id
  FROM users u
  INNER JOIN user_tree ut ON u.id = ut.child_id
  WHERE u.role = ANY(:roles)
    AND u.village = ANY(:villages)
    AND u.status = 'active'
),
registration_stats AS (
  SELECT 
    r.created_by,
    DATE(r.created_at) AS reg_date,
    COUNT(*) AS total_regs
  FROM registrations r
  WHERE r.created_by IN (SELECT id FROM filtered_users)
    AND r.registration_type = ANY(:types)
    AND r.created_at BETWEEN :start_date AND :end_date
  GROUP BY r.created_by, reg_date
)
SELECT * FROM registration_stats;

Now came the real kicker: 50,000+ users active. Our recursive queries were solid, but still choking under full-table scans. We couldn’t afford 10s per query.

So I indexed everything that moved:

user_id on registrations
parent_id, child_id on user_hierarchy
Partial indexes for date + role + location filters

That dropped our query times from ~8-10s to under 1s in many cases.

I Tested Like a Paranoid Linter on steroids. :/

I fed the query engine all kinds of Rajs:

Raj with 2 kids
Raj with 2,000
Raj with 10,000 kids and half of Bihar under him

It scaled. Mostly.

But one thing saved me more than anything else:

Calculating metrics separately per concern, caching where safe, and never touching the entire org unless explicitly needed.

📈 The Metrics Meltdown

So we’d survived Raj.
But Raj wasn’t the end - he was the beginning of my next nightmare.

See, the original ask was just:

“Show me metrics under Raj.”

But now Product said:

“Let’s make this a leaderboard. I want to see all users, along with:

How many people they added directly

How many their downline added (all with diffeerent roles obv)

Registrations they did

Registrations their entire tree did

Filters by role, location, time window

Paginated. Sorted. Realtime.”

Translation:
Run recursive CTE for every single user in the system - 50,000+ of them.
Every. Damn. Time.

If you’re wondering what happened next:
Yes, I tried it.
Yes, Postgres cried.
Yes, everything broke.

I needed to do something smart and fast.

Precomputing the Damn Metrics

and I did it, I don’t need to compute these metrics live for every user.

Instead, I needed to pre-aggregate metrics per user, including:

Their own registrations (self_reg_count)
Registrations by all their descendants (tree_reg_count)
Number of people they created directly
Number of LPRs / DPOCs / Volunteers in their entire tree

So I created a materialized table:
A snapshot of the org tree, but with real metrics baked in.

Lets call this baby:

user_metrics (
  user_id UUID PRIMARY KEY,
  self_reg_count INTEGER,
  tree_reg_count INTEGER,
  total_users_created INTEGER,
  lpr_count INTEGER,
  dpr_count INTEGER,
  volunteer_count INTEGER,
  -- ...and all the other counts
  last_updated TIMESTAMP
)

And the data?
populated via background jobs using - guess what - recursive CTEs.

🔁 Building this

Each hour (and later, each 15 mins), we ran a job like this:

-- Fetch all descendants for each user
WITH RECURSIVE user_tree AS (
  SELECT parent_id AS root_id, child_id
  FROM user_hierarchy

  UNION ALL

  SELECT ut.root_id, uh.child_id
  FROM user_tree ut
  JOIN user_hierarchy uh ON ut.child_id = uh.parent_id
)
-- Join with registrations and users to get counts
SELECT 
  ut.root_id AS user_id,
  COUNT(DISTINCT r.id) FILTER (WHERE r.created_by = ut.root_id) AS self_reg_count,
  COUNT(r.id) FILTER (WHERE r.created_by != ut.root_id) AS tree_reg_count,
  COUNT(DISTINCT u.id) FILTER (WHERE u.role = 'LPR') AS lpr_count,
  COUNT(DISTINCT u.id) FILTER (WHERE u.role = 'DPR') AS dpr_count,
  COUNT(DISTINCT u.id) FILTER (WHERE u.role = 'VOLUNTEER') AS volunteer_count
FROM user_tree ut
LEFT JOIN users u ON u.id = ut.child_id
LEFT JOIN registrations r ON r.created_by = ut.child_id
GROUP BY ut.root_id;

That output was dumped straight into the user_metrics table, replacing rows for each user. Now, when the frontend asked:

“Hey, show me a leaderboard of most active DPOCs by tree-wide registration count,”

…all we needed was:

SELECT 
  u.name, 
  u.role,
  um.self_reg_count, 
  um.tree_reg_count,
  um.volunteer_count
FROM user_metrics um
JOIN users u ON u.id = um.user_id
WHERE u.role = 'DPOC'
ORDER BY um.tree_reg_count DESC
LIMIT 50 OFFSET 0;

Boom.
Query time dropped from 6~7 seconds to under 100ms.
And no recursion.
No joins with user_hierarchy or registrations.
Just a flat table.

🧱 Caching Hierarchy Metrics: Tradeoffs and Wins

Was it perfect? No.

Data waseventually consistent - could be a few minutes stale.
We needed background jobs to run hourly.
New metrics meant updating the CTE logic.
Edge cases (like cycles or deleted users) still needed cleanup.

But the benefits?

Queries went from 🔥 dumpster fire to 🧊 ice cold efficiency
UI performance soared
I didn’t have to touch the recursion logic every time someone asked for a new metric

🧘 The End (Sort Of)

Recursive CTEs saved me.
But precomputing and flattening the chaos saved the system.

The truth is: when you’re the only backend dev, working on 50k users, complex org structures, and real-time dashboards…

You don’t need magic.
You need war-hardened SQL.
And a plan to precompute everything that hurts.

I learned this the hard way.
But now, when someone says:

“Can I get all users under X, with metrics per role, sorted by downline performance, and filterable by date, and…”

…I smile.

Because I’ve already got the answer waiting in user_metrics.

Still in the Trenches: Optimizing the Recursive Beast

Now here’s the twist: while I di get the recursive CTE working and the hierarchy tree shows up as expected - clean, complete, and shockingly elegant, I’m still in the process of optimizing this.

Why?

Because recursion in SQL is a double-edged sword. It’s great for clarity, but brutal for performance once your dataset scales.

Suddenly, things like:

Deeply nested hierarchies (I have supervisors of supervisors of supervisors... and sometimes, they loop back 🙃),
Varying depths across the org (some trees are shallow, some are downright abyssal),
And those annoying cases where one supervisor controls multiple other supervisors at the same level - or even their own boss (yep, it's as chaotic as it sounds)...

...they all start biting back.

The recursive CTE works, but I’m keeping a close eye on:

Execution time (especially as the number of users grows),
Query planner output,
And whether pre-aggregation or hybrid caches can save the day and could help for high-frequency reads.

Right now, I’m testing a hybrid strategy - think: recursive CTE + a flattened lookup table updated periodically via a cron or trigger-based system. That way, I get both dynamic accuracy and read performance.

But yeah, it’s still a work in progress. Welcome to backend engineering: where “it works” is just the first boss - and performance is the real endgame and the beast you’ve got to tame.

And yeah I am really thinking to use a closure table, let's see, if I did implement it, the link to that blog will be down below.

🧑🏻‍💻 Moving to the Closure Table

TL;DR

Use recursive CTEs to walk dynamic trees
Isolate filters at each layer: user-level, data-level
Break metrics into composable queries
Precompute what hurts to compute live
Benchmark, monitor, and optimize iteratively
And always… test with a Raj who has 10,000 kids.

Until then, may your trees be acyclic, your recursion terminate, and your DB not set itself on fire under load.

See you in the next war story.

#analytics #backend #data-modeling #postgres #recursive-cte