SQL COUNT and DISTINCT: Practical Guide with Real-World Examples & Performance Tips

Hey there! So you're working with SQL databases and need to figure out how many unique orders came through last quarter? Or maybe identify how many distinct customers bought specific products? That's exactly where count and distinct in SQL become your best friends. I remember struggling with this early in my career – writing queries that returned duplicate records and wondering why my reports looked wrong. That frustration taught me to master these tools properly.

What COUNT Actually Does (And Where People Mess Up)

Let's cut straight to it: COUNT() tallies rows in your results. But here's where folks get tripped up:

Syntax What It Counts NULL Handling Real-Life Use Case
COUNT(*) All rows in table/results Includes NULLs Total website visits
COUNT(column) Non-NULL values in that column Excludes NULLs Completed user registrations

I once saw a junior dev spend hours debugging why user counts didn't match – turned out they used COUNT(email) when emails could be NULL. Rookie mistake. Use COUNT(*) when you need absolute row counts. Simple as that.

Common COUNT Patterns You'll Use Daily

  • Total records: SELECT COUNT(*) FROM orders
  • Active users: SELECT COUNT(user_id) FROM users WHERE last_login > '2023-01-01'
  • Orders by status: SELECT status, COUNT(*) FROM orders GROUP BY status

DISTINCT: Your Duplicate Data Killer

DISTINCT eliminates duplicate rows from your results. But it's not magic – I've seen queries slow to a crawl because someone used DISTINCT on a huge table without indexes. Here's what you need to know:

Scenario Without DISTINCT With DISTINCT Why It Matters
Product colors table Red, Red, Blue, Green Red, Blue, Green Accurate inventory options
Customer countries USA, UK, USA, FR, UK USA, UK, FR Marketing region planning

Where DISTINCT bites you: When applied to multiple columns. SELECT DISTINCT city, country gives unique combos – "Paris, France" and "Paris, Texas" count as different entries. Makes sense when you think about it, but catches many off guard.

When You SHOULDN'T Use DISTINCT

  • On primary keys (they're already unique!)
  • As quick fix for JOIN duplicates (fix the JOIN condition instead)
  • With large text/BLOB columns (kills performance)

The Power Combo: COUNT(DISTINCT)

This is where count and distinct in SQL becomes magical. Need to know how many unique visitors your site had yesterday? SELECT COUNT(DISTINCT user_id) FROM site_activity WHERE date = CURRENT_DATE Done. But watch these gotchas:

Database Compatibility Note

Most databases support COUNT(DISTINCT column) but some (like older MySQL versions) choke on multiple columns. For counting distinct pairs:

SELECT COUNT(*) FROM (SELECT DISTINCT city, country FROM customers) AS temp

Real talk: I once tried COUNT(DISTINCT) on a 500-million-row table without proper indexes. The query ran for 40 minutes before I killed it. Lesson learned – always check execution plans!

Essential COUNT(DISTINCT) Patterns

Business Question SQL Solution Performance Tip
How many unique products sold per category? SELECT category, COUNT(DISTINCT product_id) FROM sales GROUP BY category Add index on (category, product_id)
Daily unique visitors SELECT visit_date, COUNT(DISTINCT user_id) FROM visits GROUP BY visit_date Partition table by date
Customers buying multiple items SELECT COUNT(DISTINCT customer_id) FROM orders WHERE item_count > 1 Filter before counting distinct

Performance Tuning: Making COUNT DISTINCT Fly

Let's be honest - count distinct in SQL can be slow. Here's what I've learned optimizing these queries:

  • Index smartly: Add indexes on columns used in DISTINCT, WHERE, and GROUP BY
  • Approximate counts: Use APPROX_COUNT_DISTINCT() in BigQuery/SparkSQL for 97% accurate results at 10x speed
  • Pre-aggregate: Create summary tables nightly for frequent queries

Warning: NULLs in COUNT DISTINCT

COUNT(DISTINCT email) ignores NULL values completely. If you need to count NULLs as distinct values, do this:

SELECT COUNT(DISTINCT COALESCE(email, 'NULL_PLACEHOLDER'))

(But honestly? Reconsider your data model if NULLs need special counting)

GROUP BY vs DISTINCT: Which to Choose?

Both deduplicate data but serve different purposes:

Operation Best For Performance My Preference
DISTINCT Simple duplicate removal Faster for small datasets When I need just unique values
GROUP BY Aggregations (COUNT, SUM, AVG) Better for large grouped data When counting distinct per group

Pro tip: For complex aggregations, GROUP BY almost always outperforms DISTINCT + subqueries. Test both with EXPLAIN PLAN.

When GROUP BY Replaces DISTINCT

Instead of:

SELECT DISTINCT department FROM employees

You can write:

SELECT department FROM employees GROUP BY department

They return identical results but GROUP BY often executes faster (especially with proper indexes).

Real-World Problems Solved by Count and Distinct in SQL

Let's get practical. Here are actual scenarios where these commands save the day:

E-Commerce Analysis

  • Unique daily shoppers: COUNT(DISTINCT customer_id)
  • Products in multiple categories: COUNT(DISTINCT category_id) per product
  • Abandoned carts: COUNT(DISTINCT session_id) WHERE checkout_complete = 0

User Analytics

  • Monthly active users (MAU): COUNT(DISTINCT user_id) WHERE last_active BETWEEN ...
  • Feature adoption rate: COUNT(DISTINCT user_id) who used feature X
  • Cross-platform usage: COUNT(DISTINCT device_id) per user

Honestly? I use some form of count distinct SQL in almost every analytics report I build. It's that fundamental.

Advanced Tactics: Window Functions and CTEs

When basic COUNT DISTINCT isn't enough:

Counting Distinct Over Time

Rolling 7-day unique users:

SELECT
  date,
  COUNT(DISTINCT user_id) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
FROM visits

Complex Counting with CTEs

Users purchasing from multiple categories:

WITH user_cats AS (
  SELECT user_id, COUNT(DISTINCT category) AS cat_count
  FROM purchases
  GROUP BY user_id
)
SELECT
  COUNT(*) FILTER (WHERE cat_count >= 3) AS power_users,
  COUNT(*) FILTER (WHERE cat_count = 1) AS single_cat_users
FROM user_cats

Your COUNT DISTINCT FAQ Answered

Does COUNT(DISTINCT) work with multiple columns?

In standard SQL, no. Use a subquery: SELECT COUNT(*) FROM (SELECT DISTINCT col1, col2 FROM table) or check your DB's docs (some like Redshift support COUNT(DISTINCT col1, col2)).

Why is my COUNT DISTINCT query so slow?

Three main culprits: Missing indexes on the distinct columns, huge dataset sizes, or doing DISTINCT before filtering. Add WHERE clauses first, create appropriate indexes, and consider approximate counts.

How does NULL behave in COUNT DISTINCT?

All NULLs are treated as identical. COUNT(DISTINCT nullable_col) counts NULL as one distinct value if present. But COUNT(DISTINCT col) excludes NULLs entirely - careful with this inconsistency!

Can I use DISTINCT and ORDER BY together?

Absolutely: SELECT DISTINCT department FROM employees ORDER BY department. But avoid ordering unselected columns as some databases might complain.

What's faster: DISTINCT or GROUP BY?

For simple deduplication, they're similar. But for aggregations, GROUP BY usually outperforms COUNT DISTINCT in SQL. Always test with your specific data and indexes.

Mistakes I've Made (So You Don't Have To)

After 10 years of SQL work, here's my hall of shame with count and distinct in SQL:

  • Overusing DISTINCT as a band-aid: Masked underlying JOIN issues that later caused data inconsistencies
  • Forgetting NULLs in COUNT: Led to undercounted metrics in financial reports
  • COUNT DISTINCT on UUID columns: Brought analytics database to its knees
  • Assuming DISTINCT applies to first column only: Wasted hours debugging "wrong" counts

The worst? Running a COUNT DISTINCT on production during peak hours. Got paged at 2 AM when the system slowed to a crawl. Don't be like me - test big queries on replicas first!

Choosing the Right Tool for the Job

Alternatives to COUNT DISTINCT and when they shine:

Technique Best Used When Example
EXISTS() Checking for presence (ignore counts) "Did customer buy product X?"
ROW_NUMBER() Getting first/last occurrence "Customer's initial purchase"
Approximate functions Speed critical, precision optional Real-time dashboard metrics
Bitmaps Extremely high cardinality data User activity across billions

At the end of the day, nothing beats count and distinct SQL for straightforward unique value counting. Just use it wisely.

Got war stories with COUNT DISTINCT? I once spent three days debugging why counts decreased after a "fix" - turned out someone changed a LEFT JOIN to INNER JOIN. The joys of SQL! What's your battle scar?

Leave a Comments

Recommended Article