the stack overflowed

scaling alone: a backend dev’s analytics war story

There are two types of tasks assigned to you

  1. The neat little features you plan, groom, and break into clean stories.

  2. And then there’s the other kind

Like when your PM casually drops:

“Hey, I need to see how many users Raj added below him in the hierarchy, like, everyone under him. Also, how many registrations they all did. Daily. And oh, it should be filterable by date, role, and should work in realtime... it’s urgent.”

And you blink twice, hoping they’re joking.

But they're not. They follow up with:

“We need org-wide analytics with daily updates, full hierarchy traversal, role-level breakdowns, registration metrics… and it has to scale for 50k users. Oh, and you're the only backend dev.”

Great. :)

This story is about that second kind.
The one that doesn't come with clean specs,
only chaos, coffee, and a very real chance you'll end up writing a recursive CTE that haunts your dreams.

🛠️ The Setup

We had:

🔥 Jumping into the Fire

first thought:

“Okay, just a few joins of users table with other tables using created_by. Boom. Done.”

how cute.

Here’s what broke:

Could you imagine encountering this and a bunch more problems in Just a single day?

And to top it all off?

We had an urgent demo in 48 hours. Classic.

The Schema That Saved Me (and My Sanity)

After crying internally and externally, I realized I needed to model the hierarchy traversal cleanly - without nuking the database every time someone asked for "all users under Raj from the past month."

So I took a breath and went back to first principles.

We needed a way to:

  1. Traverse user hierarchy (even with many-to-many relationships)
  2. Get all descendants for a given user
  3. Aggregate metrics for any node in the tree

…and not kill the DB.

The solution?

A recursive Common Table Expression (CTE) - the SQL wizardry that lets you walk a tree structure using a parent-child mapping.

We already had a table that looked like this:

user_hierarchy (
  parent_id UUID,
  child_id UUID
)

No fancy closure table or nested sets - just plain old parent-child pairs. Why? Because we had no fixed depth. The hierarchy could go 3 levels deep… or 30. It was unpredictable and dynamic. I didn’t want to maintain precomputed paths. And frankly, I didn’t have the luxury of time or a data engineer.

So I did what any desperate backend dev would do:

WITH RECURSIVE descendants AS (
  SELECT child_id
  FROM user_hierarchy
  WHERE parent_id = :given_user_id  -- Starting from Raj (for example)

  UNION ALL

  SELECT uh.child_id
  FROM user_hierarchy uh
  INNER JOIN descendants d ON d.child_id = uh.parent_id
)
SELECT * FROM descendants;

This gave me all the users under a given user, no matter how deep the rabbit hole went. Whether it was 3 levels or 13, Postgres handled it like a champ (until it didn’t 😅).

Sounds inefficient? It was. But it worked. Barely. Now I could:

But the fun was just beginning.

Now Comes the filters, real pain

Okay, cool. I had a working recursive structure.

But then product asked for filters.
Simple ones at first- date ranges and user roles.

No big deal.

Until we hit 30,000 users.

That’s when the queries started to die.

I’m talking 20-second response times, CPU spikes, and a very visible fear of Postgres in htop.

The filtering logic got wild:

Now imagine:

“Give me how many registrations were done by all users under Raj between Jan 1 and Feb 15 who are DPOCs and from East Champaran, and only if installed the app.”

AND do this live, because “real-time analytics” was apparently a requirement.
(Yes, we hadn’t even built caching yet.)

Instead of smashing all metrics into one mega-query, I broke them down:

Was it pretty? No.
Was it clean? Also no.
Was it battle-tested and got the job done for the demo? Hell yes.

We weren’t just filtering users.

We were filtering:

That meant:

WITH RECURSIVE user_tree AS (
  SELECT child_id
  FROM user_hierarchy
  WHERE parent_id = :given_user_id

  UNION ALL

  SELECT uh.child_id
  FROM user_hierarchy uh
  INNER JOIN user_tree ut ON ut.child_id = uh.parent_id
),
filtered_users AS (
  SELECT u.id
  FROM users u
  INNER JOIN user_tree ut ON u.id = ut.child_id
  WHERE u.role = ANY(:roles)
    AND u.village = ANY(:villages)
    AND u.status = 'active'
),
registration_stats AS (
  SELECT 
    r.created_by,
    DATE(r.created_at) AS reg_date,
    COUNT(*) AS total_regs
  FROM registrations r
  WHERE r.created_by IN (SELECT id FROM filtered_users)
    AND r.registration_type = ANY(:types)
    AND r.created_at BETWEEN :start_date AND :end_date
  GROUP BY r.created_by, reg_date
)
SELECT * FROM registration_stats;

Now came the real kicker: 50,000+ users active. Our recursive queries were solid, but still choking under full-table scans. We couldn’t afford 10s per query.

So I indexed everything that moved:

That dropped our query times from ~8-10s to under 1s in many cases.

I Tested Like a Paranoid Linter on steroids. :/

I fed the query engine all kinds of Rajs:

It scaled. Mostly.

But one thing saved me more than anything else:

Calculating metrics separately per concern, caching where safe, and never touching the entire org unless explicitly needed.

📈 The Metrics Meltdown

So we’d survived Raj.
But Raj wasn’t the end - he was the beginning of my next nightmare.

See, the original ask was just:

“Show me metrics under Raj.”

But now Product said:

“Let’s make this a leaderboard. I want to see all users, along with:

Translation:
Run recursive CTE for every single user in the system - 50,000+ of them.
Every. Damn. Time.

If you’re wondering what happened next:
Yes, I tried it.
Yes, Postgres cried.
Yes, everything broke.

I needed to do something smart and fast.

Precomputing the Damn Metrics

and I did it, I don’t need to compute these metrics live for every user.

Instead, I needed to pre-aggregate metrics per user, including:

So I created a materialized table:
A snapshot of the org tree, but with real metrics baked in.

Lets call this baby:

user_metrics (
  user_id UUID PRIMARY KEY,
  self_reg_count INTEGER,
  tree_reg_count INTEGER,
  total_users_created INTEGER,
  lpr_count INTEGER,
  dpr_count INTEGER,
  volunteer_count INTEGER,
  -- ...and all the other counts
  last_updated TIMESTAMP
)

And the data?
populated via background jobs using - guess what - recursive CTEs.

🔁 Building this

Each hour (and later, each 15 mins), we ran a job like this:

-- Fetch all descendants for each user
WITH RECURSIVE user_tree AS (
  SELECT parent_id AS root_id, child_id
  FROM user_hierarchy

  UNION ALL

  SELECT ut.root_id, uh.child_id
  FROM user_tree ut
  JOIN user_hierarchy uh ON ut.child_id = uh.parent_id
)
-- Join with registrations and users to get counts
SELECT 
  ut.root_id AS user_id,
  COUNT(DISTINCT r.id) FILTER (WHERE r.created_by = ut.root_id) AS self_reg_count,
  COUNT(r.id) FILTER (WHERE r.created_by != ut.root_id) AS tree_reg_count,
  COUNT(DISTINCT u.id) FILTER (WHERE u.role = 'LPR') AS lpr_count,
  COUNT(DISTINCT u.id) FILTER (WHERE u.role = 'DPR') AS dpr_count,
  COUNT(DISTINCT u.id) FILTER (WHERE u.role = 'VOLUNTEER') AS volunteer_count
FROM user_tree ut
LEFT JOIN users u ON u.id = ut.child_id
LEFT JOIN registrations r ON r.created_by = ut.child_id
GROUP BY ut.root_id;

That output was dumped straight into the user_metrics table, replacing rows for each user. Now, when the frontend asked:

“Hey, show me a leaderboard of most active DPOCs by tree-wide registration count,”

…all we needed was:

SELECT 
  u.name, 
  u.role,
  um.self_reg_count, 
  um.tree_reg_count,
  um.volunteer_count
FROM user_metrics um
JOIN users u ON u.id = um.user_id
WHERE u.role = 'DPOC'
ORDER BY um.tree_reg_count DESC
LIMIT 50 OFFSET 0;

Boom.
Query time dropped from 6~7 seconds to under 100ms.
And no recursion.
No joins with user_hierarchy or registrations.
Just a flat table.

🧱 Caching Hierarchy Metrics: Tradeoffs and Wins

Was it perfect? No.

But the benefits?

🧘 The End (Sort Of)

Recursive CTEs saved me.
But precomputing and flattening the chaos saved the system.

The truth is: when you’re the only backend dev, working on 50k users, complex org structures, and real-time dashboards…

You don’t need magic.
You need war-hardened SQL.
And a plan to precompute everything that hurts.

I learned this the hard way.
But now, when someone says:

“Can I get all users under X, with metrics per role, sorted by downline performance, and filterable by date, and…”

…I smile.

Because I’ve already got the answer waiting in user_metrics.

Still in the Trenches: Optimizing the Recursive Beast

Now here’s the twist: while I di get the recursive CTE working and the hierarchy tree shows up as expected - clean, complete, and shockingly elegant, I’m still in the process of optimizing this.

Why?

Because recursion in SQL is a double-edged sword. It’s great for clarity, but brutal for performance once your dataset scales.

Suddenly, things like:

...they all start biting back.

The recursive CTE works, but I’m keeping a close eye on:

Right now, I’m testing a hybrid strategy - think: recursive CTE + a flattened lookup table updated periodically via a cron or trigger-based system. That way, I get both dynamic accuracy and read performance.

But yeah, it’s still a work in progress. Welcome to backend engineering: where “it works” is just the first boss - and performance is the real endgame and the beast you’ve got to tame.

And yeah I am really thinking to use a closure table, let's see, if I did implement it, the link to that blog will be down below.

🧑🏻‍💻 Moving to the Closure Table

TL;DR

Until then, may your trees be acyclic, your recursion terminate, and your DB not set itself on fire under load.

See you in the next war story.

#analytics #backend #data-modeling #postgres #recursive-cte