These functions are used to subset a data frame, applying the expressions in ... to determine which rows should be kept (for filter()) or dropped ( for filter_out()).

Multiple conditions can be supplied separated by a comma. These will be combined with the & operator. To combine comma separated conditions using | instead, wrap them in when_any().

Both filter() and filter_out() treat NA like FALSE. This subtle behavior can impact how you write your conditions when missing values are involved. See the section on Missing values for important details and examples.

# S3 method for class 'SpatialExperiment'
filter(.data, ..., .preserve = FALSE)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.

...

<data-masking> Expressions that return a logical vector, defined in terms of the variables in .data. If multiple expressions are included, they are combined with the & operator. To combine expressions using | instead, wrap them in when_any(). Only rows for which all expressions evaluate to TRUE are kept (for filter()) or dropped (for filter_out()).

.preserve

Relevant when the .data input is grouped. If .preserve = FALSE (the default), the grouping structure is recalculated based on the resulting data, otherwise the grouping is kept as is.

Value

An object of the same type as .data. The output has the following properties:

  • Rows are a subset of the input, but appear in the same order.

  • Columns are not modified.

  • The number of groups may be reduced (if .preserve is not TRUE).

  • Data frame attributes are preserved.

Missing values

Both filter() and filter_out() treat NA like FALSE. This results in the following behavior:

  • filter() drops both NA and FALSE.

  • filter_out() keeps both NA and FALSE.

This means that filter(data, <conditions>) + filter_out(data, <conditions>) captures every row within data exactly once.

The NA handling of these functions has been designed to match your intent. When your intent is to keep rows, use filter(). When your intent is to drop rows, use filter_out().

For example, if your goal with this cars data is to "drop rows where the class is suv", then you might write this in one of two ways:

cars <- tibble(class = c("suv", NA, "coupe"))
cars
#> # A tibble: 3 x 1
#>   class
#>   <chr>
#> 1 suv
#> 2 <NA>
#> 3 coupe

cars |> filter(class != "suv")
#> # A tibble: 1 x 1
#>   class
#>   <chr>
#> 1 coupe

cars |> filter_out(class == "suv")
#> # A tibble: 2 x 1
#>   class
#>   <chr>
#> 1 <NA>
#> 2 coupe

Note how filter() drops the NA rows even though our goal was only to drop "suv" rows, but filter_out() matches our intuition.

To generate the correct result with filter(), you'd need to use:

cars |> filter(class != "suv" | is.na(class))
#> # A tibble: 2 x 1
#>   class
#>   <chr>
#> 1 <NA>
#> 2 coupe

This quickly gets unwieldy when multiple conditions are involved.

In general, if you find yourself:

  • Using "negative" operators like != or !

  • Adding in NA handling like | is.na(col) or & !is.na(col)

then you should consider if swapping to the other filtering variant would make your conditions simpler.

Comparison to base subsetting

Base subsetting with [ doesn't treat NA like TRUE or FALSE. Instead, it generates a fully missing row, which is different from how both filter() and filter_out() work.

cars <- tibble(class = c("suv", NA, "coupe"), mpg = c(10, 12, 14))
cars
#> # A tibble: 3 x 2
#>   class   mpg
#>   <chr> <dbl>
#> 1 suv      10
#> 2 <NA>     12
#> 3 coupe    14

cars[cars$class == "suv",]
#> # A tibble: 2 x 2
#>   class   mpg
#>   <chr> <dbl>
#> 1 suv      10
#> 2 <NA>     NA

cars |> filter(class == "suv")
#> # A tibble: 1 x 2
#>   class   mpg
#>   <chr> <dbl>
#> 1 suv      10

Useful filter functions

There are many functions and operators that are useful when constructing the expressions used to filter the data:

Grouped tibbles

Because filtering expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped filtering:

starwars |> filter(mass > mean(mass, na.rm = TRUE))

With the grouped equivalent:

starwars |> filter(mass > mean(mass, na.rm = TRUE), .by = gender)

In the ungrouped version, filter() compares the value of mass in each row to the global average (taken over the whole data set), keeping only the rows with mass greater than this global average. In contrast, the grouped version calculates the average mass separately for each gender group, and keeps rows with mass greater than the relevant within-gender average.

Methods

This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages: dplyr (data.frame, ts), plotly (plotly), tidySingleCellExperiment (SingleCellExperiment), tidySpatialExperiment (SpatialExperiment) .

See also

Examples

example(read10xVisium)
#> 
#> rd10xV> dir <- system.file(
#> rd10xV+   file.path("extdata", "10xVisium"), 
#> rd10xV+   package = "SpatialExperiment")
#> 
#> rd10xV> sample_ids <- c("section1", "section2")
#> 
#> rd10xV> samples <- file.path(dir, sample_ids, "outs")
#> 
#> rd10xV> list.files(samples[1])
#> [1] "raw_feature_bc_matrix" "spatial"              
#> 
#> rd10xV> list.files(file.path(samples[1], "spatial"))
#> [1] "scalefactors_json.json"    "tissue_lowres_image.png"  
#> [3] "tissue_positions_list.csv"
#> 
#> rd10xV> file.path(samples[1], "raw_feature_bc_matrix")
#> [1] "/home/runner/work/_temp/Library/SpatialExperiment/extdata/10xVisium/section1/outs/raw_feature_bc_matrix"
#> 
#> rd10xV> (spe <- read10xVisium(samples, sample_ids, 
#> rd10xV+   type = "sparse", data = "raw", 
#> rd10xV+   images = "lowres", load = FALSE))
#> Warning: 'read10xVisium' is deprecated.
#> Use 'VisiumIO::TENxVisium(List)' instead.
#> See help("Deprecated")
#> # A SpatialExperiment-tibble abstraction: 99 × 7
#> # Features = 50 | Cells = 99 | Assays = counts
#>    .cell              in_tissue array_row array_col sample_id pxl_col_in_fullres
#>    <chr>              <lgl>         <int>     <int> <chr>                  <int>
#>  1 AAACAACGAATAGTTC-1 FALSE             0        16 section1                2312
#>  2 AAACAAGTATCTCCCA-1 TRUE             50       102 section1                8230
#>  3 AAACAATCTACTAGCA-1 TRUE              3        43 section1                4170
#>  4 AAACACCAATAACTGC-1 TRUE             59        19 section1                2519
#>  5 AAACAGAGCGACTCCT-1 TRUE             14        94 section1                7679
#>  6 AAACAGCTTTCAGAAG-1 FALSE            43         9 section1                1831
#>  7 AAACAGGGTCTATATT-1 FALSE            47        13 section1                2106
#>  8 AAACAGTGTTCCTGGG-1 FALSE            73        43 section1                4170
#>  9 AAACATGGTGAGAGGA-1 FALSE            62         0 section1                1212
#> 10 AAACATTTCCCGGATT-1 FALSE            61        97 section1                7886
#> # ℹ 89 more rows
#> # ℹ 1 more variable: pxl_row_in_fullres <int>
#> 
#> rd10xV> # base directory 'outs/' from Space Ranger can also be omitted
#> rd10xV> samples2 <- file.path(dir, sample_ids)
#> 
#> rd10xV> (spe2 <- read10xVisium(samples2, sample_ids, 
#> rd10xV+   type = "sparse", data = "raw", 
#> rd10xV+   images = "lowres", load = FALSE))
#> Warning: 'read10xVisium' is deprecated.
#> Use 'VisiumIO::TENxVisium(List)' instead.
#> See help("Deprecated")
#> # A SpatialExperiment-tibble abstraction: 99 × 7
#> # Features = 50 | Cells = 99 | Assays = counts
#>    .cell              in_tissue array_row array_col sample_id pxl_col_in_fullres
#>    <chr>              <lgl>         <int>     <int> <chr>                  <int>
#>  1 AAACAACGAATAGTTC-1 FALSE             0        16 section1                2312
#>  2 AAACAAGTATCTCCCA-1 TRUE             50       102 section1                8230
#>  3 AAACAATCTACTAGCA-1 TRUE              3        43 section1                4170
#>  4 AAACACCAATAACTGC-1 TRUE             59        19 section1                2519
#>  5 AAACAGAGCGACTCCT-1 TRUE             14        94 section1                7679
#>  6 AAACAGCTTTCAGAAG-1 FALSE            43         9 section1                1831
#>  7 AAACAGGGTCTATATT-1 FALSE            47        13 section1                2106
#>  8 AAACAGTGTTCCTGGG-1 FALSE            73        43 section1                4170
#>  9 AAACATGGTGAGAGGA-1 FALSE            62         0 section1                1212
#> 10 AAACATTTCCCGGATT-1 FALSE            61        97 section1                7886
#> # ℹ 89 more rows
#> # ℹ 1 more variable: pxl_row_in_fullres <int>
#> 
#> rd10xV> # tabulate number of spots mapped to tissue
#> rd10xV> cd <- colData(spe)
#> 
#> rd10xV> table(
#> rd10xV+   in_tissue = cd$in_tissue, 
#> rd10xV+   sample_id = cd$sample_id)
#>          sample_id
#> in_tissue section1 section2
#>     FALSE       28       27
#>     TRUE        22       22
#> 
#> rd10xV> # view available images
#> rd10xV> imgData(spe)
#> DataFrame with 2 rows and 4 columns
#>     sample_id    image_id   data scaleFactor
#>   <character> <character> <list>   <numeric>
#> 1    section1      lowres   ####   0.0510334
#> 2    section2      lowres   ####   0.0510334
spe |>
    filter(in_tissue == TRUE)
#> # A SpatialExperiment-tibble abstraction: 44 × 7
#> # Features = 50 | Cells = 44 | Assays = counts
#>    .cell              in_tissue array_row array_col sample_id pxl_col_in_fullres
#>    <chr>              <lgl>         <int>     <int> <chr>                  <int>
#>  1 AAACAAGTATCTCCCA-1 TRUE             50       102 section1                8230
#>  2 AAACAATCTACTAGCA-1 TRUE              3        43 section1                4170
#>  3 AAACACCAATAACTGC-1 TRUE             59        19 section1                2519
#>  4 AAACAGAGCGACTCCT-1 TRUE             14        94 section1                7679
#>  5 AAACCGGGTAGGTACC-1 TRUE             42        28 section1                3138
#>  6 AAACCGTTCGTCCAGG-1 TRUE             52        42 section1                4101
#>  7 AAACCTCATGAAGTTG-1 TRUE             37        19 section1                2519
#>  8 AAACGAAGAACATACC-1 TRUE              6        64 section1                5615
#>  9 AAACGAGACGGTTGAT-1 TRUE             35        79 section1                6647
#> 10 AAACGGTTGCGAACTG-1 TRUE             67        59 section1                5271
#> # ℹ 34 more rows
#> # ℹ 1 more variable: pxl_row_in_fullres <int>