Contingency Tables

There are many options for producing contingency tables and summary tables in R.

We will review the following methods:

dplyr & tidyr

The more things you can accomplish within the tidyverse of r packages, the better (IMO). Using dplyr to produce your summary stats enables you to continue the code seamlessly into the next task (filtering, plotting, etc…).

The group_by(), summarize(), and spread() commands are a useful combination for producing aggregate or summary values of our data.

First, let’s load dplyr, tidyr, and ggplot2(for the sample data).

library(ggplot2) library(dplyr) library(tidyr) library(knitr) #for printing html-friendly tables.

We will use the mpg dataset from ggplot2 for these exercises.

manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

Here, we can get the total number of cars with each class & cyl combination using group_by() and summarize().

mpg%>% group_by(class, cyl)%>% summarize(n=n())%>% kable()
class cyl n
2seater 8 5
compact 4 32
compact 5 2
compact 6 13
midsize 4 16
midsize 6 23
midsize 8 2
minivan 4 1
minivan 6 10
pickup 4 3
pickup 6 10
pickup 8 20
subcompact 4 21
subcompact 5 2
subcompact 6 7
subcompact 8 5
suv 4 8
suv 6 16
suv 8 38

dplyr & tidyr: Crosstabs

To turn our summary data into a crosstab or contingency table, we need variable A (class) to be listed by row, and variable B (cyl) to be listed by column.

We can achieve this by including the spread() command, to create columns for each cyl value, with n as the crosstab response value.

mpg%>% group_by(class, cyl)%>% summarise(n=n())%>% spread(cyl, n)%>% kable()
class 4 5 6 8
2seater NA NA NA 5
compact 32 2 13 NA
midsize 16 NA 23 2
minivan 1 NA 10 NA
pickup 3 NA 10 20
subcompact 21 2 7 5
suv 8 NA 16 38

Summary statistics other than frequency.

One advantage of dplyr is that we can determine what kind of summary statistic we want to see very easily by adjusting our summarize() input.

Here instead of displaying frequencies, we can get the average number of city miles by class & cyl

mpg%>% group_by(class, cyl)%>% summarise(mean_cty=mean(cty))%>% spread(cyl, mean_cty)%>% kable()
class 4 5 6 8
2seater NA NA NA 15.40000
compact 21.37500 21 16.92308 NA
midsize 20.50000 NA 17.78261 16.00000
minivan 18.00000 NA 15.60000 NA
pickup 16.00000 NA 14.50000 11.80000
subcompact 22.85714 20 17.00000 14.80000
suv 18.00000 NA 14.50000 12.13158

Or max number of city miles by class & cyl

mpg%>% group_by(class, cyl)%>% summarise(max_cty=max(cty))%>% spread(cyl, max_cty)%>% kable()
class 4 5 6 8
2seater NA NA NA 16
compact 33 21 18 NA
midsize 23 NA 19 16
minivan 18 NA 17 NA
pickup 17 NA 16 14
subcompact 35 20 18 15
suv 20 NA 17 14

dplyr & tidyr: Proportions

We can find proportions by creating a new, calculated variable dividing row frequency by table frequency.

mpg%>% group_by(class)%>% summarize(n=n())%>% mutate(prop=n/sum(n))%>% # our new proportion variable kable()
class n prop
2seater 5 0.0213675
compact 47 0.2008547
midsize 41 0.1752137
minivan 11 0.0470085
pickup 33 0.1410256
subcompact 35 0.1495726
suv 62 0.2649573

We can create a contingency table of proportion values by applying the same spread command as before. Vary the group_by() and spread() arguents to produce proportions of different variables.

mpg%>% group_by(class, cyl)%>% summarize(n=n())%>% mutate(prop=n/sum(n))%>% subset(select=c("class","cyl","prop"))%>% #drop the frequency value spread(class, prop)%>% kable()
cyl 2seater compact midsize minivan pickup subcompact suv
4 NA 0.6808511 0.3902439 0.0909091 0.0909091 0.6000000 0.1290323
5 NA 0.0425532 NA NA NA 0.0571429 NA
6 NA 0.2765957 0.5609756 0.9090909 0.3030303 0.2000000 0.2580645
8 1 NA 0.0487805 NA 0.6060606 0.1428571 0.6129032

table()

table() is a quick way to pull together row/column frequencies and proportions for categorical variables

Using the basic table() command, we can get a contingency table of vehicle class by number of cylinders.

table(mpg$class, mpg$cyl)
## ## 4 5 6 8 ## 2seater 0 0 0 5 ## compact 32 2 13 0 ## midsize 16 0 23 2 ## minivan 1 0 10 0 ## pickup 3 0 10 20 ## subcompact 21 2 7 5 ## suv 8 0 16 38

Table, Column, and Row Frequencies

The table frequency can also be called by using the ftable() command.

mpg_table
## 4 5 6 8 ## ## 2seater 0 0 0 5 ## compact 32 2 13 0 ## midsize 16 0 23 2 ## minivan 1 0 10 0 ## pickup 3 0 10 20 ## subcompact 21 2 7 5 ## suv 8 0 16 38

For row frequencies, we use the margin.table() command, with the 1 argument.

margin.table(mpg_table, 1) 
## ## 2seater compact midsize minivan pickup subcompact ## 5 47 41 11 33 35 ## suv ## 62

For column frequencies, we use the margin.table() command, with the 2 argument.

margin.table(mpg_table, 2) 
## ## 4 5 6 8 ## 81 4 79 70

Table, Column, and Row Proportions

We can get the proportion values for our variable combinations as well.

For proportion of the entire table, we use the prop.table() command.

prop.table(mpg_table) #proportion of entire table
## ## 4 5 6 8 ## 2seater 0.000000000 0.000000000 0.000000000 0.021367521 ## compact 0.136752137 0.008547009 0.055555556 0.000000000 ## midsize 0.068376068 0.000000000 0.098290598 0.008547009 ## minivan 0.004273504 0.000000000 0.042735043 0.000000000 ## pickup 0.012820513 0.000000000 0.042735043 0.085470085 ## subcompact 0.089743590 0.008547009 0.029914530 0.021367521 ## suv 0.034188034 0.000000000 0.068376068 0.162393162

For row proportions, we use the prop.table() command, with the 1 argument following the table name.

prop.table(mpg_table, 1) #proportion of entire row
## ## 4 5 6 8 ## 2seater 0.00000000 0.00000000 0.00000000 1.00000000 ## compact 0.68085106 0.04255319 0.27659574 0.00000000 ## midsize 0.39024390 0.00000000 0.56097561 0.04878049 ## minivan 0.09090909 0.00000000 0.90909091 0.00000000 ## pickup 0.09090909 0.00000000 0.30303030 0.60606061 ## subcompact 0.60000000 0.05714286 0.20000000 0.14285714 ## suv 0.12903226 0.00000000 0.25806452 0.61290323

For column proportions, we use the prop.table() command, with the 2 argument following the table name.

prop.table(mpg_table, 2) #proportion of entire column
## ## 4 5 6 8 ## 2seater 0.00000000 0.00000000 0.00000000 0.07142857 ## compact 0.39506173 0.50000000 0.16455696 0.00000000 ## midsize 0.19753086 0.00000000 0.29113924 0.02857143 ## minivan 0.01234568 0.00000000 0.12658228 0.00000000 ## pickup 0.03703704 0.00000000 0.12658228 0.28571429 ## subcompact 0.25925926 0.50000000 0.08860759 0.07142857 ## suv 0.09876543 0.00000000 0.20253165 0.54285714

gmodels::CrossTable()

The CrossTable() command from the gmodels package produces frequencies, and table, row, & column proportions with a single command. The values are not as quickly drawn into tables of their own, or further manipulated as they are with the dyplr/tidyr tables, but this is a handy command nonetheless.

Install & Load the gmodels package

install.packages("gmodels") library(gmodels)

Run the CrossTable() command, with your two variables as inputs.

CrossTable(mpg$class, mpg$cyl)
## ## ## Cell Contents ## |-------------------------| ## | N | ## | Chi-square contribution | ## | N / Row Total | ## | N / Col Total | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 234 ## ## ## | mpg$cyl ## mpg$class | 4 | 5 | 6 | 8 | Row Total | ## -------------|-----------|-----------|-----------|-----------|-----------| ## 2seater | 0 | 0 | 0 | 5 | 5 | ## | 1.731 | 0.085 | 1.688 | 8.210 | | ## | 0.000 | 0.000 | 0.000 | 1.000 | 0.021 | ## | 0.000 | 0.000 | 0.000 | 0.071 | | ## | 0.000 | 0.000 | 0.000 | 0.021 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## compact | 32 | 2 | 13 | 0 | 47 | ## | 15.210 | 1.782 | 0.518 | 14.060 | | ## | 0.681 | 0.043 | 0.277 | 0.000 | 0.201 | ## | 0.395 | 0.500 | 0.165 | 0.000 | | ## | 0.137 | 0.009 | 0.056 | 0.000 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## midsize | 16 | 0 | 23 | 2 | 41 | ## | 0.230 | 0.701 | 6.059 | 8.591 | | ## | 0.390 | 0.000 | 0.561 | 0.049 | 0.175 | ## | 0.198 | 0.000 | 0.291 | 0.029 | | ## | 0.068 | 0.000 | 0.098 | 0.009 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## minivan | 1 | 0 | 10 | 0 | 11 | ## | 2.070 | 0.188 | 10.641 | 3.291 | | ## | 0.091 | 0.000 | 0.909 | 0.000 | 0.047 | ## | 0.012 | 0.000 | 0.127 | 0.000 | | ## | 0.004 | 0.000 | 0.043 | 0.000 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## pickup | 3 | 0 | 10 | 20 | 33 | ## | 6.211 | 0.564 | 0.117 | 10.391 | | ## | 0.091 | 0.000 | 0.303 | 0.606 | 0.141 | ## | 0.037 | 0.000 | 0.127 | 0.286 | | ## | 0.013 | 0.000 | 0.043 | 0.085 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## subcompact | 21 | 2 | 7 | 5 | 35 | ## | 6.515 | 3.284 | 1.963 | 2.858 | | ## | 0.600 | 0.057 | 0.200 | 0.143 | 0.150 | ## | 0.259 | 0.500 | 0.089 | 0.071 | | ## | 0.090 | 0.009 | 0.030 | 0.021 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## suv | 8 | 0 | 16 | 38 | 62 | ## | 8.444 | 1.060 | 1.162 | 20.403 | | ## | 0.129 | 0.000 | 0.258 | 0.613 | 0.265 | ## | 0.099 | 0.000 | 0.203 | 0.543 | | ## | 0.034 | 0.000 | 0.068 | 0.162 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## Column Total | 81 | 4 | 79 | 70 | 234 | ## | 0.346 | 0.017 | 0.338 | 0.299 | | ## -------------|-----------|-----------|-----------|-----------|-----------| ## ##