Hey there! So you're working with SQL databases and need to figure out how many unique orders came through last quarter? Or maybe identify how many distinct customers bought specific products? That's exactly where count and distinct in SQL become your best friends. I remember struggling with this early in my career – writing queries that returned duplicate records and wondering why my reports looked wrong. That frustration taught me to master these tools properly.
What COUNT Actually Does (And Where People Mess Up)
Let's cut straight to it: COUNT()
tallies rows in your results. But here's where folks get tripped up:
Syntax | What It Counts | NULL Handling | Real-Life Use Case |
---|---|---|---|
COUNT(*) |
All rows in table/results | Includes NULLs | Total website visits |
COUNT(column) |
Non-NULL values in that column | Excludes NULLs | Completed user registrations |
I once saw a junior dev spend hours debugging why user counts didn't match – turned out they used COUNT(email)
when emails could be NULL. Rookie mistake. Use COUNT(*)
when you need absolute row counts. Simple as that.
Common COUNT Patterns You'll Use Daily
- Total records:
SELECT COUNT(*) FROM orders
- Active users:
SELECT COUNT(user_id) FROM users WHERE last_login > '2023-01-01'
- Orders by status:
SELECT status, COUNT(*) FROM orders GROUP BY status
DISTINCT: Your Duplicate Data Killer
DISTINCT eliminates duplicate rows from your results. But it's not magic – I've seen queries slow to a crawl because someone used DISTINCT on a huge table without indexes. Here's what you need to know:
Scenario | Without DISTINCT | With DISTINCT | Why It Matters |
---|---|---|---|
Product colors table | Red, Red, Blue, Green | Red, Blue, Green | Accurate inventory options |
Customer countries | USA, UK, USA, FR, UK | USA, UK, FR | Marketing region planning |
Where DISTINCT bites you: When applied to multiple columns. SELECT DISTINCT city, country
gives unique combos – "Paris, France" and "Paris, Texas" count as different entries. Makes sense when you think about it, but catches many off guard.
When You SHOULDN'T Use DISTINCT
- On primary keys (they're already unique!)
- As quick fix for JOIN duplicates (fix the JOIN condition instead)
- With large text/BLOB columns (kills performance)
The Power Combo: COUNT(DISTINCT)
This is where count and distinct in SQL becomes magical. Need to know how many unique visitors your site had yesterday? SELECT COUNT(DISTINCT user_id) FROM site_activity WHERE date = CURRENT_DATE
Done. But watch these gotchas:
Database Compatibility Note
Most databases support COUNT(DISTINCT column)
but some (like older MySQL versions) choke on multiple columns. For counting distinct pairs:
SELECT COUNT(*) FROM (SELECT DISTINCT city, country FROM customers) AS temp
Real talk: I once tried COUNT(DISTINCT)
on a 500-million-row table without proper indexes. The query ran for 40 minutes before I killed it. Lesson learned – always check execution plans!
Essential COUNT(DISTINCT) Patterns
Business Question | SQL Solution | Performance Tip |
---|---|---|
How many unique products sold per category? | SELECT category, COUNT(DISTINCT product_id) FROM sales GROUP BY category |
Add index on (category, product_id) |
Daily unique visitors | SELECT visit_date, COUNT(DISTINCT user_id) FROM visits GROUP BY visit_date |
Partition table by date |
Customers buying multiple items | SELECT COUNT(DISTINCT customer_id) FROM orders WHERE item_count > 1 |
Filter before counting distinct |
Performance Tuning: Making COUNT DISTINCT Fly
Let's be honest - count distinct in SQL can be slow. Here's what I've learned optimizing these queries:
- Index smartly: Add indexes on columns used in DISTINCT, WHERE, and GROUP BY
- Approximate counts: Use
APPROX_COUNT_DISTINCT()
in BigQuery/SparkSQL for 97% accurate results at 10x speed - Pre-aggregate: Create summary tables nightly for frequent queries
Warning: NULLs in COUNT DISTINCT
COUNT(DISTINCT email)
ignores NULL values completely. If you need to count NULLs as distinct values, do this:
SELECT COUNT(DISTINCT COALESCE(email, 'NULL_PLACEHOLDER'))
(But honestly? Reconsider your data model if NULLs need special counting)
GROUP BY vs DISTINCT: Which to Choose?
Both deduplicate data but serve different purposes:
Operation | Best For | Performance | My Preference |
---|---|---|---|
DISTINCT |
Simple duplicate removal | Faster for small datasets | When I need just unique values |
GROUP BY |
Aggregations (COUNT, SUM, AVG) | Better for large grouped data | When counting distinct per group |
Pro tip: For complex aggregations, GROUP BY almost always outperforms DISTINCT + subqueries. Test both with EXPLAIN PLAN.
When GROUP BY Replaces DISTINCT
Instead of:
SELECT DISTINCT department FROM employees
You can write:
SELECT department FROM employees GROUP BY department
They return identical results but GROUP BY often executes faster (especially with proper indexes).
Real-World Problems Solved by Count and Distinct in SQL
Let's get practical. Here are actual scenarios where these commands save the day:
E-Commerce Analysis
- Unique daily shoppers:
COUNT(DISTINCT customer_id)
- Products in multiple categories:
COUNT(DISTINCT category_id) per product
- Abandoned carts:
COUNT(DISTINCT session_id) WHERE checkout_complete = 0
User Analytics
- Monthly active users (MAU):
COUNT(DISTINCT user_id) WHERE last_active BETWEEN ...
- Feature adoption rate:
COUNT(DISTINCT user_id) who used feature X
- Cross-platform usage:
COUNT(DISTINCT device_id) per user
Honestly? I use some form of count distinct SQL in almost every analytics report I build. It's that fundamental.
Advanced Tactics: Window Functions and CTEs
When basic COUNT DISTINCT isn't enough:
Counting Distinct Over Time
Rolling 7-day unique users:
SELECT
date,
COUNT(DISTINCT user_id) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
FROM visits
Complex Counting with CTEs
Users purchasing from multiple categories:
WITH user_cats AS (
SELECT user_id, COUNT(DISTINCT category) AS cat_count
FROM purchases
GROUP BY user_id
)
SELECT
COUNT(*) FILTER (WHERE cat_count >= 3) AS power_users,
COUNT(*) FILTER (WHERE cat_count = 1) AS single_cat_users
FROM user_cats
Your COUNT DISTINCT FAQ Answered
Does COUNT(DISTINCT) work with multiple columns?
In standard SQL, no. Use a subquery: SELECT COUNT(*) FROM (SELECT DISTINCT col1, col2 FROM table)
or check your DB's docs (some like Redshift support COUNT(DISTINCT col1, col2)).
Why is my COUNT DISTINCT query so slow?
Three main culprits: Missing indexes on the distinct columns, huge dataset sizes, or doing DISTINCT before filtering. Add WHERE clauses first, create appropriate indexes, and consider approximate counts.
How does NULL behave in COUNT DISTINCT?
All NULLs are treated as identical. COUNT(DISTINCT nullable_col) counts NULL as one distinct value if present. But COUNT(DISTINCT col) excludes NULLs entirely - careful with this inconsistency!
Can I use DISTINCT and ORDER BY together?
Absolutely: SELECT DISTINCT department FROM employees ORDER BY department
. But avoid ordering unselected columns as some databases might complain.
What's faster: DISTINCT or GROUP BY?
For simple deduplication, they're similar. But for aggregations, GROUP BY usually outperforms COUNT DISTINCT in SQL. Always test with your specific data and indexes.
Mistakes I've Made (So You Don't Have To)
After 10 years of SQL work, here's my hall of shame with count and distinct in SQL:
- Overusing DISTINCT as a band-aid: Masked underlying JOIN issues that later caused data inconsistencies
- Forgetting NULLs in COUNT: Led to undercounted metrics in financial reports
- COUNT DISTINCT on UUID columns: Brought analytics database to its knees
- Assuming DISTINCT applies to first column only: Wasted hours debugging "wrong" counts
The worst? Running a COUNT DISTINCT on production during peak hours. Got paged at 2 AM when the system slowed to a crawl. Don't be like me - test big queries on replicas first!
Choosing the Right Tool for the Job
Alternatives to COUNT DISTINCT and when they shine:
Technique | Best Used When | Example |
---|---|---|
EXISTS() | Checking for presence (ignore counts) | "Did customer buy product X?" |
ROW_NUMBER() | Getting first/last occurrence | "Customer's initial purchase" |
Approximate functions | Speed critical, precision optional | Real-time dashboard metrics |
Bitmaps | Extremely high cardinality data | User activity across billions |
At the end of the day, nothing beats count and distinct SQL for straightforward unique value counting. Just use it wisely.
Got war stories with COUNT DISTINCT? I once spent three days debugging why counts decreased after a "fix" - turned out someone changed a LEFT JOIN to INNER JOIN. The joys of SQL! What's your battle scar?
Leave a Comments