When reading the article
Crunching 30 Years of NBA Data with MongoDB Aggregation I coulnd't help but
think that we've been enjoying
aggregates in SQL for 3 or 4 decades already.
When using
PostgreSQL it's even easy to actually add your own aggregates
given the SQL command
create aggregate.
Photo Credit: Copyright All rights reserved by Segward GraupnerThe next step after thinking how obvious the queries written in the
mentionned article would be to express in SQL was to actually load the data
into PostgreSQL and write the aggregate queries, of course.
Loading the data
With the help of a little bit of
Common Lisp code and using the
mongo-cl-driver it was easy enough to parse the given
BSON file. What was
more complex was to actually understand enough of the data model to produce
a relational design out of it, avoiding data redundancy as much as possible.
We call that step
normalization in old-style relational databases, and the
goal of that process is to avoid
functional dependency so that the data is
easier to understand, verify and process the data once loaded.
For instance, rather than have both scores from each team and a column
won
per team, which would be a boolean but is a number in the given
BSON file,
we store only the scores. Here's the main table definition of the stats we
are going to be playing with, the
game table:
create table nba.game (
id serial primary key,
date timestamptz,
host int references nba.team(id),
guest int references nba.team(id),
host_score int,
guest_score int
);
As much of the aggregates in the referenced article are playing with
statistics from teams who actually
won the game, let's create a view to
simplify our SQL queries thereafter:
create view winners as
select id,
date,
case when host_score > guest_score
then host
else guest
end as winner
from game;
If you're not doing much SQL, remember that creating such a view is common
practice in the relational world.

Running the Aggregates
Now that we have the extra useful view, it's possible to implement the first
MongoDB query in SQL. First, let's have a look at the MongoDB query:
db.games.aggregate([
{
$match : {
date : {
$gt : ISODate("1999-08-01T00:00:00Z"),
$lt : ISODate("2000-08-01T00:00:00Z")
}
}
},
{
$unwind : '$teams'
},
{
$match : {
'teams.won' : 1
}
},
{
$group : {
_id : '$teams.name',
wins : { $sum : 1 }
}
},
{
$sort : { wins : -1 }
},
{
$limit : 5
}
]);
I don't know about you, but I have quite a hard time deciphering what that
query is actually doing, and when the explanation text talks about
using a
6-stage pipeline my understanding is that the application developper has
been writing the
execution plan of the query here. Let's ignore the query
format itself, as it's obviously meant to be generated by a tool rather than
typed by a human being.
Here's the same query in SQL, with the result this time:
SELECT abbrev, name, count(*)
FROM winners JOIN team ON team.id = winners.winner
WHERE date > '1999-08-01T00:00:00Z'
AND date < '2000-08-01T00:00:00Z'
GROUP BY winner, abbrev, name
ORDER BY count(*) DESC
LIMIT 5;
abbrev | name | count
--------+------------------------+-------
LAL | Los Angeles Lakers | 67
POR | Portland Trail Blazers | 59
IND | Indiana Pacers | 56
UTA | Utah Jazz | 55
SAS | San Antonio Spurs | 53
(5 rows)
Time: 8.101 ms
What we have here is a pretty basic query using a
join, a
where clause to
restrict the data set we are playing with, a
group by clause to define which
data to computa the aggregates against, with an
order by and a
limit clause
for presenting the result. To be realistic, if you've ever done any SQL at
all, then you know how to read that query because you've been writing dozens
of similar ones.
Here's, as in the original article, the same query against a much larger
data set this time, with all games of the 2000s decade:
SELECT abbrev, name, count(*)
FROM winners join team on team.id = winners.winner
WHERE date > '2000-08-01T00:00:00Z'
AND date < '2010-08-01T00:00:00Z'
GROUP BY winner, abbrev, name
ORDER BY count(*) DESC
LIMIT 5;
abbrev | name | count
--------+--------------------+-------
SAS | San Antonio Spurs | 579
DAL | Dallas Mavericks | 568
LAL | Los Angeles Lakers | 524
PHO | Phoenix Suns | 495
DET | Detroit Pistons | 489
(5 rows)
Time: 24.713 ms

Correlating stats with wins
The goal here is to compute how often a team wins when they record more
defensive rebounds than their opponent across the entire data set.
To be able to compute the percentage, we have to have a count of all
registered games, of course. Then we are going to count how many times the
winner team registered a greater
team_stats.drb than the loser, and
count
how many times in SQL is usually written as a
sum(case when <condition> then
1 else 0 end), which is what we're doing here:
select count(*) as games,
sum(case when ws.drb > ls.drb then 1 else 0 end) as drb,
sum(case when ws.drb > ls.drb then 1 else 0 end)::float / count(*) * 100 as pct
from winlose wl
join team w on wl.winner = w.id
join team l on wl.loser = l.id
join team_stats ws on ws.game = wl.id and ws.team = wl.winner
join team_stats ls on ls.game = wl.id and ls.team = wl.loser;
games | drb | pct
-------+-------+------------------
31686 | 22292 | 70.3528372151739
(1 row)
Time: 276.669 ms
We note here than in the original MongoDB article the aggregation query is
short of computing the
percentage directly, apparently it's been done in the
client tool, maybe using a
spreadsheet application or something.

Defensive Rebounds and Total Rebounds Versus Win Percentage
Next, still following on our inspirational article
Crunching 30 Years of NBA Data with MongoDB Aggregation, we’re going to
compute what percentage of the time a team wins as a function of the number
of defensive rebounds they recorded.
I'm not sure I understand what they achieve with averaging ones when a team
wins and zero when a team loses, so I couldn't quite reproduce their result.
Here's an approaching query tho:
with game_stats as (
select t.id, count(*)
from team t join game on game.host = t.id or game.guest = t.id
group by t.id
)
select ts.team, round(avg(drb), 2) as drb,
round(count(*) / gs.count::numeric * 100, 2) as winpct,
count(*) as wins, gs.count as games
from team_stats ts
join game on game.id = ts.game
and game.host = ts.team
and game.host_score > game.guest_score
join game_stats gs on gs.id = ts.team
group by ts.team, gs.count;
team | drb | winpct | wins | games
------+-------+--------+------+-------
4 | 31.59 | 31.46 | 710 | 2257
7 | 32.55 | 31.89 | 720 | 2258
16 | 31.70 | 37.62 | 849 | 2257
I only pasted the first few lines of the result because I'm not sure how to
make sense of it, really.
Interesting factoid
What I find most interesting in the following
factoid proposed in the
MongoDB article is the complete lack of the query you need to run in order
to grab the matching data:
An interesting factoid: the team that recorded the fewest defensive rebounds
in a win was the 1995-96 Toronto Raptors, who beat the Milwaukee Bucks 93-87
on 12/26/1995 despite recording only 14 defensive rebounds.
When doing the necessary query in SQL, using a
Common Table Expression (the
WITH syntax) and a
Window Function for good measure, we get actually 4
different games with the minimum defensive rebounds in our history of NBA
games, 14:
with stats(game, team, drb, min) as (
select ts.game, ts.team, drb, min(drb) over ()
from team_stats ts
join winners w on w.id = ts.game and w.winner = ts.team
)
select game.date::date,
host.name || ' -- ' || host_score as host,
guest.name || ' -- ' || guest_score as guest,
stats.drb as winner_drb
from stats
join game on game.id = stats.game
join team host on host.id = game.host
join team guest on guest.id = game.guest
where drb = min;
-[ RECORD 1 ]----------------------------
date | 1995-12-26
host | Toronto Raptors -- 93
guest | Milwaukee Bucks -- 87
winner_drb | 14
-[ RECORD 2 ]----------------------------
date | 1996-02-02
host | Golden State Warriors -- 114
guest | Toronto Raptors -- 111
winner_drb | 14
-[ RECORD 3 ]----------------------------
date | 1998-03-31
host | Vancouver Grizzlies -- 101
guest | Dallas Mavericks -- 104
winner_drb | 14
-[ RECORD 4 ]----------------------------
date | 2009-01-14
host | New York Knicks -- 128
guest | Washington Wizards -- 122
winner_drb | 14
Time: 126.276 ms
To understand all there's to know about
window functions, have a look at my
article on the topic:
Understanding Window Functions.

Total rebounds and wins
The next interesting aside is the following:
As an aside, the Cleveland Cavaliers beat the New York Knicks 101-97 on
April 11, 1996, despite recording only 21 total rebounds. Inversely, the San
Antonio Spurs lost to the Houston Rockets, 112-110, on January 4, 1992
despite recording 75 total rebounds.
Which we translate in SQL as the following query:
with stats as (
select ts.game, ts.team, trb,
min(trb) over () as min,
max(trb) over () as max
from team_stats ts
join winners w on w.id = ts.game and w.winner = ts.team
)
select game.date::date,
host.name || ' -- ' || host_score as host,
guest.name || ' -- ' || guest_score as guest,
stats.trb as winner_trb
from stats
join game on game.id = stats.game
join team host on host.id = game.host
join team guest on guest.id = game.guest
where trb = min or trb = max;
-[ RECORD 1 ]--------------------------
date | 1995-12-28
host | Dallas Mavericks -- 103
guest | Vancouver Grizzlies -- 101
winner_trb | 76
-[ RECORD 2 ]--------------------------
date | 1996-04-11
host | New York Knicks -- 97
guest | Cleveland Cavaliers -- 101
winner_trb | 21
-[ RECORD 3 ]--------------------------
date | 2007-01-29
host | Utah Jazz -- 115
guest | New Jersey Nets -- 116
winner_trb | 21
Time: 127.771 ms
Again it's easy enough in SQL to have more details about the
aside presented
in our source article, and we get a slightly different story.
Conclusion
It's quite hard for me to appreciate the work done in the
MongoDB
aggregation framework really, when we've been enjoying advanced aggregation
and statistics in
PostgreSQL for a very long time. With the addition of
Window Functions and
Aggregate Functions for Statistics it's possible to
implement advanced analysis right into your SQL queries.
In next PostgreSQL release the set of analytical functions is going to
expand again withe addition of both
Ordered-Set Aggregate Functions (also
known as
inverse distribution functions) and
Hypothetical-Set Aggregate Functions (also known as
WITHIN GROUP).
PostgreSQL is YeSQL!When the problem you have to solve involves analyzing data, one of the more
advanced tooling you can find around certainly is the SQL language, in
particular its implementation in PostgreSQL!