Visualizing linear relationships

    In the spirit of Tukey, the regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. That is to say that seaborn is not itself a package for statistical analysis. To obtain quantitative measures related to the fit of regression models, you should use . The goal of seaborn, however, is to make exploring a dataset through visualization quick and easy, as doing so is just as (if not more) important than exploring a dataset through tables of statistics.

    1. tips = sns.load_dataset("tips")

    Two main functions in seaborn are used to visualize a linear relationship as determined through regression. These functions, regplot() and are closely related, and share much of their core functionality. It is important to understand the ways they differ, however, so that you can quickly choose the correct tool for particular job.

    In the simplest invocation, both functions draw a scatterplot of two variables, x and y, and then fit the regression model y ~ x and plot the resulting regression line and a 95% confidence interval for that regression:

    1. sns.regplot(x="total_bill", y="tip", data=tips);

    1. sns.lmplot(x="total_bill", y="tip", data=tips);

    http://seaborn.pydata.org/_images/regression_8_0.png

    You should note that the resulting plots are identical, except that the figure shapes are different. We will explain why this is shortly. For now, the other main difference to know about is that regplot() accepts the x and y variables in a variety of formats including simple numpy arrays, pandas Series objects, or as references to variables in a pandas DataFrame object passed to data. In contrast, has data as a required parameter and the x and y variables must be specified as strings. This data format is called “long-form” or “tidy” data. Other than this input flexibility, possesses a subset of lmplot()’s features, so we will demonstrate them using the latter.

    It’s possible to fit a linear regression when one of the variables takes discrete values, however, the simple scatterplot produced by this kind of dataset is often not optimal:

    1. sns.lmplot(x="size", y="tip", data=tips);

    One option is to add some random noise (“jitter”) to the discrete values to make the distribution of those values more clear. Note that jitter is applied only to the scatterplot data and does not influence the regression line fit itself:

    1. sns.lmplot(x="size", y="tip", data=tips, x_jitter=.05);

    http://seaborn.pydata.org/_images/regression_12_0.png

    A second option is to collapse over the observations in each discrete bin to plot an estimate of central tendency along with a confidence interval:

    1. sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean);

    The simple linear regression model used above is very simple to fit, however, it is not appropriate for some kinds of datasets. The dataset shows a few examples where simple linear regression provides an identical estimate of a relationship where simple visual inspection clearly shows differences. For example, in the first case, the linear regression is a good model:

    1. anscombe = sns.load_dataset("anscombe")

    http://seaborn.pydata.org/_images/regression_17_0.png

    The linear relationship in the second dataset is the same, but the plot clearly shows that this is not a good model:

    1. sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),

    1. sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),
    2. order=2, ci=None, scatter_kws={"s": 80});

    http://seaborn.pydata.org/_images/regression_21_0.png

    A different problem is posed by “outlier” observations that deviate for some reason other than the main relationship under study:

    1. ci=None, scatter_kws={"s": 80});

    In the presence of outliers, it can be useful to fit a robust regression, which uses a different loss function to downweight relatively large residuals:

    1. sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
    2. robust=True, ci=None, scatter_kws={"s": 80});

    http://seaborn.pydata.org/_images/regression_25_0.png

    When the y variable is binary, simple linear regression also “works” but provides implausible predictions:

    1. tips["big_tip"] = (tips.tip / tips.total_bill) > .15
    2. sns.lmplot(x="total_bill", y="big_tip", data=tips,
    3. y_jitter=.03);

    The solution in this case is to fit a logistic regression, such that the regression line shows the estimated probability of y = 1 for a given value of x:

    1. sns.lmplot(x="total_bill", y="big_tip", data=tips,
    2. logistic=True, y_jitter=.03);

    http://seaborn.pydata.org/_images/regression_29_0.png

    Note that the logistic regression estimate is considerably more computationally intensive (this is true of robust regression as well) than simple regression, and as the confidence interval around the regression line is computed using a bootstrap procedure, you may wish to turn this off for faster iteration (using ci=None).

    An altogether different approach is to fit a nonparametric regression using a lowess smoother. This approach has the fewest assumptions, although it is computationally intensive and so currently confidence intervals are not computed at all:

    1. sns.lmplot(x="total_bill", y="tip", data=tips,
    2. lowess=True);

    The function can be a useful tool for checking whether the simple regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0:

    1. sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),

    http://seaborn.pydata.org/_images/regression_33_0.png

    If there is structure in the residuals, it suggests that simple linear regression is not appropriate:

    The best way to separate out a relationship is to plot both levels on the same axes and to use color to distinguish them:

    1. sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips);

    http://seaborn.pydata.org/_images/regression_37_0.png

    In addition to color, it’s possible to use different scatterplot markers to make plots the reproduce to black and white better. You also have full control over the colors used:

    1. sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips,
    2. markers=["o", "x"], palette="Set1");

    To add another variable, you can draw multiple “facets” which each level of the variable appearing in the rows or columns of the grid:

    1. sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips);

    http://seaborn.pydata.org/_images/regression_41_0.png

    1. sns.lmplot(x="total_bill", y="tip", hue="smoker",
    2. col="time", row="sex", data=tips);

    Before we noted that the default plots made by and look the same but on axes that have a different size and shape. This is because regplot() is an “axes-level” function draws onto a specific axes. This means that you can make multi-panel figures yourself and control exactly where the regression plot goes. If no axes object is explicitly provided, it simply uses the “currently active” axes, which is why the default plot has the same size and shape as most other matplotlib functions. To control the size, you need to create a figure object yourself.

    1. f, ax = plt.subplots(figsize=(5, 6))
    2. sns.regplot(x="total_bill", y="tip", data=tips, ax=ax);

    http://seaborn.pydata.org/_images/regression_44_0.png

    In contrast, the size and shape of the figure is controlled through the FacetGrid interface using the size and aspect parameters, which apply to each facet in the plot, not to the overall figure itself:

    1. sns.lmplot(x="total_bill", y="tip", col="day", data=tips,
    2. col_wrap=2, height=3);

    1. sns.lmplot(x="total_bill", y="tip", col="day", data=tips,
    2. aspect=.5);

    http://seaborn.pydata.org/_images/regression_47_0.png

    A few other seaborn functions use in the context of a larger, more complex plot. The first is the jointplot() function that we introduced in the . In addition to the plot styles previously discussed, jointplot() can use to show the linear regression fit on the joint axes by passing kind="reg":

    1. sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg");

    Using the pairplot() function with kind="reg" combines and PairGrid to show the linear relationship between variables in a dataset. Take care to note how this is different from . In the figure below, the two axes don’t show the same relationship conditioned on two levels of a third variable; rather, PairGrid() is used to show multiple relationships between different pairings of the variables in a dataset:

    http://seaborn.pydata.org/_images/regression_51_0.png

    1. sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"],