9 Joining Data in SQL

https://learn.datacamp.com/courses/joining-data-in-postgresql

Main functions and concepts covered in this BP chapter:

  1. inner join
  2. left join
  3. right join
  4. full join
  5. cross join
  6. semi and anti join
  7. case
  8. union
  9. intersect
  10. except
  11. subqueries

Packages used in this chapter:

## Load all packages used in this chapter
library(RSQLite)

For the SQL chapters, you can use the “chinook.db” SQL database (which comes from here: https://www.sqlitetutorial.net/sqlite-sample-database/) and load it using the dbConnect function (made available via the RSQLite package):

con <- dbConnect(SQLite(), "data/chinook.db")

You can then create working examples using SQL code chunks, like this:

select FirstName, LastName FROM customers
Table 8.1: Displaying records 1 - 10
FirstName LastName
Luís Gonçalves
Leonie Köhler
François Tremblay
Bjørn Hansen
František Wichterlová
Helena Holý
Astrid Gruber
Daan Peeters
Kara Nielsen
Eduardo Martins

9.1 Introduction to joins

9.1.1 Inner Join

In this example, the INNER JOIN clause matches each row from the tracks table with every row from the albums table based on the join condition (tracks.AlbumID = albums.AlbumID) specified after the ON keyword.

If the join condition evaluates to true (or 1), the columns of rows from both albums and tracks tables are included in the result set.

SELECT 
    Title,
    Name
FROM 
    tracks
INNER JOIN albums 
    ON tracks.AlbumID = albums.AlbumID
Table 8.2: Displaying records 1 - 10
Title Name
For Those About To Rock We Salute You For Those About To Rock (We Salute You)
For Those About To Rock We Salute You Put The Finger On You
For Those About To Rock We Salute You Let’s Get It Up
For Those About To Rock We Salute You Inject The Venom
For Those About To Rock We Salute You Snowballed
For Those About To Rock We Salute You Evil Walks
For Those About To Rock We Salute You C.O.D.
For Those About To Rock We Salute You Breaking The Rules
For Those About To Rock We Salute You Night Of The Long Knives
For Those About To Rock We Salute You Spellbound

Alternative method using USING where we sort the table in alphabetical order by title:

SELECT 
    Title,
    Name
FROM 
    tracks
INNER JOIN albums USING(AlbumId)
ORDER BY Title
Table 8.3: Displaying records 1 - 10
Title Name
…And Justice For All Blackened
…And Justice For All …And Justice For All
…And Justice For All Eye Of The Beholder
…And Justice For All One
…And Justice For All The Shortest Straw
…And Justice For All Harvester Of Sorrow
…And Justice For All The Frayed Ends Of Sanity
…And Justice For All To Live Is To Die
…And Justice For All Dyers Eve
20th Century Masters - The Millennium Collection: The Best of Scorpions Rock You Like a Hurricane

You can join on two things using ON and AND. For example, we might want to join a country data set by country name, and then match up the years. If there is one year column for 2010 and one for 2015, we can do AND year1 = year1 - 5;. This way, if the year is 2015, it will be matched with 2010.

9.2 Outer joins and cross joins

9.2.1 Left Join

The LEFT JOIN clause selects data starting from the left table (artists) and matching rows in the right table (albums) based on the join condition (artists.ArtistId = albums.ArtistId) .

The left join returns all rows from the artists table (or left table) and the matching rows from the albums table (or right table).

If a row from the left table doesn’t have a matching row in the right table, SQLite includes columns of the rows in the left table and NULL for the columns of the right table.

Note that RIGHT JOIN does the exact same as left, but it returns all rows from the right table and the matching rows from the left table

If a row from the right table doesn’t have a matching row in the left table, SQLite includes columns of the rows in the right table and NULL for the columns of the left table.

SELECT
    Name, 
    Title
FROM
    artists
LEFT JOIN albums ON
    artists.ArtistId = albums.ArtistId
ORDER BY Name
Table 8.4: Displaying records 1 - 10
Name Title
A Cor Do Som NA
AC/DC For Those About To Rock We Salute You
AC/DC Let There Be Rock
Aaron Copland & London Symphony Orchestra A Copland Celebration, Vol. I
Aaron Goldberg Worlds
Academy of St. Martin in the Fields & Sir Neville Marriner The World of Classical Favourites
Academy of St. Martin in the Fields Chamber Ensemble & Sir Neville Marriner Sir Neville Marriner: A Celebration
Academy of St. Martin in the Fields, John Birch, Sir Neville Marriner & Sylvia McNair Fauré: Requiem, Ravel: Pavane & Others
Academy of St. Martin in the Fields, Sir Neville Marriner & Thurston Dart Bach: Orchestral Suites Nos. 1 - 4
Academy of St. Martin in the Fields, Sir Neville Marriner & William Bennett NA

If you want to find artists who don’t have any albums, you can add a WHERE clause as shown in the following query:

SELECT
    Name,
    Title
FROM
    artists
LEFT JOIN albums ON
    artists.ArtistId = albums.ArtistId
WHERE Title IS NULL   
ORDER BY Name
Table 8.5: Displaying records 1 - 10
Name Title
A Cor Do Som NA
Academy of St. Martin in the Fields, Sir Neville Marriner & William Bennett NA
Aerosmith & Sierra Leone’s Refugee Allstars NA
Avril Lavigne NA
Azymuth NA
Baby Consuelo NA
Banda Black Rio NA
Barão Vermelho NA
Bebel Gilberto NA
Ben Harper NA

9.2.2 Full Join

In theory, the result of the FULL OUTER JOIN is a combination of a LEFT JOIN and a RIGHT JOIN

SELECT
    Name,
    Title
FROM
    artists
FULL JOIN albums ON
    artists.ArtistId = albums.ArtistId
ORDER BY Name
Table 8.6: Displaying records 1 - 10
Name Title
A Cor Do Som NA
AC/DC For Those About To Rock We Salute You
AC/DC Let There Be Rock
Aaron Copland & London Symphony Orchestra A Copland Celebration, Vol. I
Aaron Goldberg Worlds
Academy of St. Martin in the Fields & Sir Neville Marriner The World of Classical Favourites
Academy of St. Martin in the Fields Chamber Ensemble & Sir Neville Marriner Sir Neville Marriner: A Celebration
Academy of St. Martin in the Fields, John Birch, Sir Neville Marriner & Sylvia McNair Fauré: Requiem, Ravel: Pavane & Others
Academy of St. Martin in the Fields, Sir Neville Marriner & Thurston Dart Bach: Orchestral Suites Nos. 1 - 4
Academy of St. Martin in the Fields, Sir Neville Marriner & William Bennett NA

9.2.3 Case

The CASE expression evaluates a list of conditions and returns an expression based on the result of the evaluation. It is similar to the IF-THEN-ELSE statement in other programming languages.

In the following example, we look at if the customer’s country is the USA. If it is, we return the value “domestic.” If it is something else besides the USA, we return “foreign”. We call this new variable CustomerGroup.

SELECT customerid,
       firstname,
       lastname,
       CASE country 
           WHEN 'USA' 
               THEN 'Domestic' 
           ELSE 'Foreign' 
       END CustomerGroup
FROM 
    customers
ORDER BY 
    LastName,
    FirstName;
Table 9.1: Displaying records 1 - 10
CustomerId FirstName LastName CustomerGroup
12 Roberto Almeida Foreign
28 Julia Barnett Domestic
39 Camille Bernard Foreign
18 Michelle Brooks Domestic
29 Robert Brown Foreign
21 Kathy Chase Domestic
26 Richard Cunningham Domestic
41 Marc Dubois Foreign
34 João Fernandes Foreign
30 Edward Francis Foreign

9.2.4 Cross Join

If you use a LEFT JOIN, INNER JOIN, or CROSS JOIN without the ON or USING clause, SQLite produces the Cartesian product of the involved tables. The number of rows in the Cartesian product is the product of the number of rows in each involved tables.

In the following example, we create a dataset that cross joins genres with media types, which produces a data set with each combination of media type and genre (i.e. MPEG-Rock, MPEG-Jazz, etc.)

create view mediatypes as
select Name as mediatype
from media_types
## Error: view mediatypes already exists
create view genres1 as
select Name as genre
from genres
## Error: view genres1 already exists
SELECT
    mediatype, genre
FROM
    mediatypes
CROSS JOIN genres1 
Table 8.9: Displaying records 1 - 10
mediatype genre
MPEG audio file Rock
MPEG audio file Jazz
MPEG audio file Metal
MPEG audio file Alternative & Punk
MPEG audio file Rock And Roll
MPEG audio file Blues
MPEG audio file Latin
MPEG audio file Reggae
MPEG audio file Pop
MPEG audio file Soundtrack

9.3 Set theory clauses

9.3.1 Union

Sometimes, you need to combine data from multiple tables into a complete result set. It may be for tables with similar data within the same database or maybe you need to combine similar data from multiple databases.

To combine rows from two or more queries into a single result set, you use SQLite UNION operator. You can also use UNION ALL. UNION and UNION ALL operators combine rows from result sets into a single result set. The UNION operator removes eliminate duplicate rows, whereas the UNION ALL operator does not.

The following example combines the names of employees and customers into a single list:

SELECT FirstName, LastName, 'Employee' AS Type
FROM employees
UNION
SELECT FirstName, LastName, 'Customer'
FROM customers;
Table 8.10: Displaying records 1 - 10
FirstName LastName Type
Aaron Mitchell Customer
Alexandre Rocha Customer
Andrew Adams Employee
Astrid Gruber Customer
Bjørn Hansen Customer
Camille Bernard Customer
Daan Peeters Customer
Dan Miller Customer
Diego Gutiérrez Customer
Dominique Lefebvre Customer

9.3.2 Intersect

SQLite INTERSECT operator compares the result sets of two queries and returns distinct rows that are output by both queries.

The following example intersects the customers and invoices tables on CustomerID to find which customers appear in the invoices table.

select * 
from customers
Table 8.11: Displaying records 1 - 10
CustomerId FirstName LastName Company Address City State Country PostalCode Phone Fax Email SupportRepId
1 Luís Gonçalves Embraer - Empresa Brasileira de Aeronáutica S.A. Av. Brigadeiro Faria Lima, 2170 São José dos Campos SP Brazil 12227-000 +55 (12) 3923-5555 +55 (12) 3923-5566 3
2 Leonie Köhler NA Theodor-Heuss-Straße 34 Stuttgart NA Germany 70174 +49 0711 2842222 NA 5
3 François Tremblay NA 1498 rue Bélanger Montréal QC Canada H2G 1A7 +1 (514) 721-4711 NA 3
4 Bjørn Hansen NA Ullevålsveien 14 Oslo NA Norway 0171 +47 22 44 22 22 NA 4
5 František Wichterlová JetBrains s.r.o. Klanova 9/506 Prague NA Czech Republic 14700 +420 2 4172 5555 +420 2 4172 5555 4
6 Helena Holý NA Rilská 3174/6 Prague NA Czech Republic 14300 +420 2 4177 0449 NA 5
7 Astrid Gruber NA Rotenturmstraße 4, 1010 Innere Stadt Vienne NA Austria 1010 +43 01 5134505 NA 5
8 Daan Peeters NA Grétrystraat 63 Brussels NA Belgium 1000 +32 02 219 03 03 NA 4
9 Kara Nielsen NA Sønder Boulevard 51 Copenhagen NA Denmark 1720 +453 3331 9991 NA 4
10 Eduardo Martins Woodstock Discos Rua Dr. Falcão Filho, 155 São Paulo SP Brazil 01007-010 +55 (11) 3033-5446 +55 (11) 3033-4564 4
select *
from invoices
Table 8.12: Displaying records 1 - 10
InvoiceId CustomerId InvoiceDate BillingAddress BillingCity BillingState BillingCountry BillingPostalCode Total
1 2 2009-01-01 00:00:00 Theodor-Heuss-Straße 34 Stuttgart NA Germany 70174 1.98
2 4 2009-01-02 00:00:00 Ullevålsveien 14 Oslo NA Norway 0171 3.96
3 8 2009-01-03 00:00:00 Grétrystraat 63 Brussels NA Belgium 1000 5.94
4 14 2009-01-06 00:00:00 8210 111 ST NW Edmonton AB Canada T6G 2C7 8.91
5 23 2009-01-11 00:00:00 69 Salem Street Boston MA USA 2113 13.86
6 37 2009-01-19 00:00:00 Berger Straße 10 Frankfurt NA Germany 60316 0.99
7 38 2009-02-01 00:00:00 Barbarossastraße 19 Berlin NA Germany 10779 1.98
8 40 2009-02-01 00:00:00 8, Rue Hanovre Paris NA France 75002 1.98
9 42 2009-02-02 00:00:00 9, Place Louis Barthou Bordeaux NA France 33000 3.96
10 46 2009-02-03 00:00:00 3 Chatham Street Dublin Dublin Ireland NA 5.94
SELECT CustomerId
FROM customers
INTERSECT
SELECT CustomerId
FROM invoices
ORDER BY CustomerId;
Table 9.2: Displaying records 1 - 10
CustomerId
1
2
3
4
5
6
7
8
9
10

9.3.3 Except

SQLite EXCEPT operator compares the result sets of two queries and returns distinct rows from the left query that are not output by the right query.

The following example finds artist ids of artists who do not have any album in the albums table:

SELECT ArtistId
FROM artists
EXCEPT
SELECT ArtistId
FROM albums;
Table 9.3: Displaying records 1 - 10
ArtistId
25
26
28
29
30
31
32
33
34
35

9.3.4 Semi and Anti Join

These joins determine which records to keep in the left table. In other words, you use these last two joins (semi-join and anti-join) in a way similar to a WHERE clause dependent on the values of a second table. The previous examples using intersect and except can function as semi and anti join respectively. Intersect kept only the customer IDs that were found in the invoice table. Except kept only the artist IDs that were not found in the album table.

9.4 Subqueries

A subquery is a SELECT statement nested in another statement. See the following statement.

9.4.1 Where

You can use a simple subquery as a search condition. For example, the following statement returns all the tracks in the album with the title “Let There Be Rock.”

SELECT trackid,
       name,
       albumid
FROM tracks
WHERE albumid = (
   SELECT albumid
   FROM albums
   WHERE title = 'Let There Be Rock'
);
Table 9.4: 8 records
TrackId Name AlbumId
15 Go Down 4
16 Dog Eat Dog 4
17 Let There Be Rock 4
18 Bad Boy Boogie 4
19 Problem Child 4
20 Overdose 4
21 Hell Ain’t A Bad Place To Be 4
22 Whole Lotta Rosie 4

9.4.2 From

Sometimes you want to apply aggregate functions to a column multiple times. For example, first, you want to sum the size of an album and then calculate the average size of all albums. You may come up with the following query.

SELECT
    AVG(album.size)
FROM
    (
        SELECT
            SUM(bytes) SIZE
        FROM
            tracks
        GROUP BY
            albumid
    ) AS album;
Table 9.5: 1 records
AVG(album.size)
338288920

We also might want the average milliseconds of

select *
from tracks
Table 9.6: Displaying records 1 - 10
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
2 Balls to the Wall 2 2 1 NA 342562 5510424 0.99
3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 230619 3990994 0.99
4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman 252051 4331779 0.99
5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99
6 Put The Finger On You 1 1 1 Angus Young, Malcolm Young, Brian Johnson 205662 6713451 0.99
7 Let’s Get It Up 1 1 1 Angus Young, Malcolm Young, Brian Johnson 233926 7636561 0.99
8 Inject The Venom 1 1 1 Angus Young, Malcolm Young, Brian Johnson 210834 6852860 0.99
9 Snowballed 1 1 1 Angus Young, Malcolm Young, Brian Johnson 203102 6599424 0.99
10 Evil Walks 1 1 1 Angus Young, Malcolm Young, Brian Johnson 263497 8611245 0.99
create view albumseconds as
SELECT 
    Title,
    Name,
    Milliseconds
FROM 
    tracks
INNER JOIN albums 
    ON tracks.AlbumID = albums.AlbumID
## Error: view albumseconds already exists

We also might want the average milliseconds of each album.

SELECT
    AVG(album.length)
FROM
    (
        SELECT
            SUM(milliseconds) LENGTH
        FROM
            albumseconds
        GROUP BY
            Title
    ) AS ALBUM;
Table 9.7: 1 records
AVG(album.length)
3973424

9.4.3 Correlated Subquery

The correlated subquery is a subquery that uses the values from the outer query. Unlike an ordinal subquery, a correlated subquery cannot be executed independently.

The correlated subquery is not efficient because it is evaluated for each row processed by the outer query.

The following query uses a correlated subquery to return the albums whose size is less than 10MB.

SELECT albumid,
       title
  FROM albums
 WHERE 10000000 > (
                      SELECT sum(bytes) 
                        FROM tracks
                       WHERE tracks.AlbumId = albums.AlbumId
                  )
 ORDER BY title;
Table 9.8: Displaying records 1 - 10
AlbumId Title
296 A Copland Celebration, Vol. I
285 A Soprano Inspired
307 Adams, John: The Chairman Dances
272 Adorate Deum: Gregorian Chant from the Proper of the Mass
273 Allegri: Miserere
319 Armada: Music from the Courts of England and Spain
277 Bach: Goldberg Variations
327 Bach: Orchestral Suites Nos. 1 - 4
300 Bach: The Brandenburg Concertos
278 Bach: The Cello Suites

We can also find the albums where their total length is greater than 1 hour.

SELECT albumid,
       title
  FROM albums
 WHERE 3600000 < (
                      SELECT sum(milliseconds) 
                        FROM tracks
                       WHERE tracks.AlbumId = albums.AlbumId
                  )
 ORDER BY title;
Table 9.9: Displaying records 1 - 10
AlbumId Title
156 …And Justice For All
94 A Matter of Life and Death
139 A TempestadeTempestade Ou O Livro Dos Dias
203 A-Sides
224 Acústico
167 Acústico MTV
26 Acústico MTV [Live]
24 Afrociberdelia
14 Alcohol Fueled Brewtality Live! [Disc 1]
75 Angel Dust

9.4.4 Select

The following query uses a correlated subquery in the SELECT clause to return the number of tracks in an album.

SELECT albumid,
       title,
       (
           SELECT count(trackid) 
             FROM tracks
            WHERE tracks.AlbumId = albums.AlbumId
       )
       tracks_count
  FROM albums
 ORDER BY tracks_count DESC;
Table 9.10: Displaying records 1 - 10
AlbumId Title tracks_count
141 Greatest Hits 57
23 Minha Historia 34
73 Unplugged 30
229 Lost, Season 3 26
230 Lost, Season 1 25
251 The Office, Season 3 25
83 My Way: The Best Of Frank Sinatra [Disc 1] 24
231 Lost, Season 2 24
253 Battlestar Galactica (Classic), Season 1 24
24 Afrociberdelia 23