Exploring R objects

X

Keywords
tip, tutorial, R, computing

Posted 2010-08-09 18:25:52 UTC

Let's assume, since you are reading this article, that you use R, the data analysis and statistics–focussed programming language; or you are at least considering it. As a result it's likely that you're aware that R is often accused of having a steep learning curve—or quite possibly you have discovered that for yourself. There are various possible reasons for the perception or fact of this learning curve, but for me a big part of the problem is the difficulty with self-discovery. Because R's object orientation features are based around the idea of generic functions, rather than class and object methods, it is rarely clear exactly what can be done with a particular object. Nevertheless, there are some tools in base R that can help with this, and this tutorial will give an overview of them.

Assuming that R's base, stats and datasets packages are loaded—which they usually are in an interactive session—you should be able to run the following command. (The initial angle bracket indicates the command prompt and should not be copied.)

> x <- lm(drivers ~ kms * law, data=Seatbelts)

It doesn't matter if you know what this command is doing or not—in fact, it is possibly more useful for the exercise if you don't! The important effect for our purposes is that an R object, x, will be created. Our aim will be to figure out what x is, and what we can do with it.

First we establish the class of x.

> class(x)
[1] "lm"

It is a common convention that R functions create objects with class name matching the name of the function, and lm() follows this pattern. So, if we didn't know that x had been created by lm(), we could fruitfully try the command

> ?lm

and obtain quite a bit of information about the object. This doesn't always work, but it's usually worth a try. If it doesn't help, we can try broadening the search to functions which mention "lm" somewhere, viz.

> ??lm

The "see also" section of the help page for lm() (hereafter simply ?lm) points us to some functions which relate to lm objects, but not all help pages are so helpful. An extremely useful function is methods, from the standard utils package, which can be used to discover which generic functions are defined for objects of a particular class.

> methods(class="lm")
 [1] add1.lm*           alias.lm*          anova.lm           case.names.lm*    
 [5] confint.lm*        cooks.distance.lm* deviance.lm*       dfbeta.lm*        
 [9] dfbetas.lm*        drop1.lm*          dummy.coef.lm*     effects.lm*       
[13] extractAIC.lm*     family.lm*         formula.lm*        hatvalues.lm      
[17] influence.lm*      kappa.lm           labels.lm*         logLik.lm*        
[21] model.frame.lm     model.matrix.lm    plot.lm            predict.lm        
[25] print.lm           proj.lm*           residuals.lm       rstandard.lm      
[29] rstudent.lm        simulate.lm*       summary.lm         variable.names.lm*
[33] vcov.lm*          

   Non-visible functions are asterisked

We can find out about any of these functions by looking at their own help pages (for example, ?add1). The actual code behind the methods specific to the lm class can be obtained by typing the name alone (e.g. plot.lm); or for "non-visible" functions, asterisked above, the getAnywhere() function can be used (e.g. getAnywhere(add1.lm)).

The methods function is far from perfect, however. It picks up relevant functions by performing pattern matching on their names, rather than through any more fundamental mechanism. As a result it misses nongeneric functions; generic functions which can handle the object but do not have a specific method for it (e.g. coef, which handles lm objects through its default method, coef.default); and functions which are defined for the implicit class of the object, which is list for lm objects. (It also does not work for so-called S4 objects - another function, showMethods(), should be used in that case.)

At the risk of muddying the water somewhat, the issue of implicit classes merits an extra parenthetic mention. In addition to its explicit class, which is usually returned by class(), objects have an implicit class. Conversely, objects without an explicit class only have an implicit one, and so the implicit class of x can be obtained by unclassing it.

> class(unclass(x))
[1] "list"

It is worth knowing that while x will act as an lm object for the standard generic functions returned by methods(), it will be treated as a list by R's internal generic functions, which include the $ operator, length() and names(). Hence

> length(x)
[1] 12

This outcome has nothing to do with the lm class, and everything to do with the fact that at its core, x is a list with 12 elements. This can be fully observed by using the str function, which can be used to show the internal structure of objects.

> str(x)
List of 12
 $ coefficients : Named num [1:4] 2155.5983 -0.0303 -610.0763 0.0184
  ..- attr(*, "names")= chr [1:4] "(Intercept)" "kms" "law" "kms:law"
 $ residuals    : Named num [1:192] -194 -415 -347 -439 -166 ...
  ..- attr(*, "names")= chr [1:192] "1" "2" "3" "4" ...
 $ effects      : Named num [1:192] -23144 -1780 1043 -140 -123 ...
  ..- attr(*, "names")= chr [1:192] "(Intercept)" "kms" "law" "kms:law" ...
 $ rank         : int 4
 $ fitted.values: Named num [1:192] 1881 1923 1854 1824 1798 ...
  ..- attr(*, "names")= chr [1:192] "1" "2" "3" "4" ...
 $ assign       : int [1:4] 0 1 2 3
 $ qr           :List of 5
  ..$ qr   : num [1:192, 1:4] -13.8564 0.0722 0.0722 0.0722 0.0722 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:192] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:4] "(Intercept)" "kms" "law" "kms:law"
  .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
  ..$ qraux: num [1:4] 1.07 1.17 1.03 1.02
  ..$ pivot: int [1:4] 1 2 3 4
  ..$ tol  : num 1e-07
  ..$ rank : int 4
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 188
 $ xlevels      : list()
 $ call         : language lm(formula = drivers ~ kms * law, data = Seatbelts)
 $ terms        :Classes 'terms', 'formula' length 3 drivers ~ kms * law
  .. ..- attr(*, "variables")= language list(drivers, kms, law)
  .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:3] "drivers" "kms" "law"
  .. .. .. ..$ : chr [1:3] "kms" "law" "kms:law"
  .. ..- attr(*, "term.labels")= chr [1:3] "kms" "law" "kms:law"
  .. ..- attr(*, "order")= int [1:3] 1 1 2
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(drivers, kms, law)
  .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:3] "drivers" "kms" "law"
 $ model        :'data.frame':  192 obs. of  3 variables:
  ..$ drivers: num [1:192] 1687 1508 1507 1385 1632 ...
  ..$ kms    : num [1:192] 9059 7685 9963 10955 11823 ...
  ..$ law    : num [1:192] 0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 drivers ~ kms * law
  .. .. ..- attr(*, "variables")= language list(drivers, kms, law)
  .. .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:3] "drivers" "kms" "law"
  .. .. .. .. ..$ : chr [1:3] "kms" "law" "kms:law"
  .. .. ..- attr(*, "term.labels")= chr [1:3] "kms" "law" "kms:law"
  .. .. ..- attr(*, "order")= int [1:3] 1 1 2
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(drivers, kms, law)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:3] "drivers" "kms" "law"
 - attr(*, "class")= chr "lm"

Thus we can see that x contains a whole load of information, wrapped up inside an R list. (Not all objects are this complicated!) As far as I know there is no special function for extracting the original function call from this substantial object, but by pulling it apart with str() and knowing that it is fundamentally a list, we see that the call can be obtained using

> x$call
lm(formula = drivers ~ kms * law, data = Seatbelts)

There is, however, a special function for extracting the residuals, and in line with the R convention for such "accessor" functions, its name matches that of the corresponding list element, so it is simply called residuals.

To summarise, we have seen in this overview that R provides a number of functions to help make sense of unfamiliar objects, and to recover their contents. Unfortunately these functions are not particularly obvious, and have some significant limitations, but they are quite useful for objects with an explicit class. Finding out what functions can be called with an object of a core type (like a matrix, for example) is, to my knowledge, somewhat more difficult, because many such functions are not generic. Hopefully the situation will improve with time—and in fact, I am working on a new R package which I hope may help a bit with this. With any luck, the learning curve will only get shallower.