--- title: Managing expiration of versioned subdirectories author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "Revised: September 4, 2022" output: BiocStyle::html_document: toc_float: true package: dir.expiry vignette: > %\VignetteIndexEntry{Managing directory expiration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} knitr::opts_chunk$set(error=FALSE, message=FALSE, warnings=FALSE) library(BiocStyle) ``` # Background Several Bioconductor packages (e.g., `r Biocpkg("basilisk")`, `r Biocpkg("rebook")`) use external cache directories to manage its resources. This typically involves creating a package cache directory in which versioned directories are created: ``` ~/.cache/ basilisk/ # package cache directory 0.99.0/ # versioned directory 1.0.1/ 1.1.3/ ``` The use of separate versioned caches allows the simultaneous use of multiple R installations with different versions of the same package. However, it requires some housekeeping to remove stale caches and avoid an large increase in disk usage - especially for packages with frequent version updates. One might think of using POSIX access time but this is not a reliable indicator of whether a cache is truly being used by the package. For example, external processes like virus scans, `find -exec grep` and so on can update the last access time without actually using the cache's contents. Some filesystem administrators may even turn off access time tracking for performance. # Setting the access time `r Biocpkg("dir.expiry")` implements its own tracker to monitor the last access time for a particular cache version. To illustrate, let's set up a versioned cache directory. ```{r} cache.path <- tempfile(pattern="expired_demo") dir.create(cache.path) version <- package_version("1.0.0") version.dir <- file.path(cache.path, version) dir.create(version.dir) ``` The path to this versioned directory should contain the version number as the `basename` and the package cache directory as the `dirname`. The expectation is that there may be multiple versioned directories nested within a single package's cache directory. ```{r} version.dir ``` Assume that we successfully accessed the versioned directory (for some arbitrary definition of success). We can then update the last access time by calling the `touchDirectory()` function with the path to the versioned directory: ```{r} library(dir.expiry) touchDirectory(version.dir) ``` Doing so creates or updates an `*_dir.expiry` file containing the last access date as an integer. (This implies that the process creating the cache should not make any of its own files ending with `*_dir.expiry-info`, lest they be confused with `r Biocpkg("dir.expiry")`'s files.) Only directories with Bioconductor-style versions should be present in the package cache directory. ```{r} list.files(cache.path) cat(readLines(file.path(cache.path, "1.0.0_dir.expiry")), sep="\n") ``` # Maintaining thread safety `r Biocpkg("dir.expiry")` makes extensive use of file locks (via the `r CRANpkg("filelock")` package) to avoid deleting caches that might be in use by other R processes. Developers are strongly advised to lock their directories of interest before doing anything with them, including any calls to `touchDirectory()`. This should be done using the `lockDirectory()` function: ```{r} v <- package_version("1.0.0") v.dir <- file.path(cache.path, v) lock <- lockDirectory(v.dir) # on.exit(unlockDirectory(lock)), in a function context. # Do stuff with the versioned cache 'v.dir' here... dir.create(v.dir, showWarnings=FALSE) # Finally, touch the directory on successful completion. touchDirectory(v.dir) ``` In real functions, `on.exit()` is the preferred approach for releasing the locks created by `lockDirectory()`. In this vignette, though, we will just call this function manually. ```{r} unlockDirectory(lock) ``` For read-only applications, developers may wish to set `exclusive=FALSE` in `lockDirectory()`. This allows multiple processes to read from the versioned cache at the same time, only being blocked when one process needs to perform a write operation. # Deleting old caches `unlockDirectory()` will automatically delete all expired versions that it finds in the package cache directory. By default, this is defined as all directories that were last touched more than 30 days ago. To illustrate, let's create another versioned directories and pretend it was created 100 days ago: ```{r} cache.path <- tempfile(pattern="expired_demo") old.version <- package_version("0.99.0") old.version.dir <- file.path(cache.path, old.version) lck <- lockDirectory(old.version.dir) dir.create(old.version.dir) touchDirectory(old.version.dir, date=Sys.Date() - 100) unlockDirectory(lck, clear=FALSE) list.files(cache.path) ``` Locking and unlocking a more recent versioned directory will delete the files associated with the older version. This includes the versioned directory itself as well as any lock and expiry files. ```{r} new.version <- package_version("1.0.0") new.version.dir <- file.path(cache.path, new.version) lck <- lockDirectory(new.version.dir) dir.create(new.version.dir) touchDirectory(new.version.dir) unlockDirectory(lck) list.files(cache.path) ``` However, the converse is not true; locking/unlocking the older version will never delete files associated with a newer version. This favors retention of caches corresponding to later versions, which is generally a reasonable outcome. ```{r} new.version2 <- package_version("1.0.1") new.version.dir2 <- file.path(cache.path, new.version2) # Newer version but earlier access. lck2 <- lockDirectory(new.version.dir2) dir.create(new.version.dir2) touchDirectory(new.version.dir2, date=Sys.Date() - 100) unlockDirectory(lck2) # Re-accessing the older version. lck <- lockDirectory(new.version.dir) touchDirectory(new.version.dir) unlockDirectory(lck) list.files(cache.path) ``` Under the hood, `unlockDirectory()` calls `clearDirectories()` to destroy the expired versioned directories. Users can tune the expiration threshold by setting the number of days in the `limit=` argument or the `BIOC_DIR_EXPIRY_LIMIT` environment variable. When deleting old versions, `clearDirectories()` will attempt to acquire an exclusive lock on the corresponding directories. As long as other processes call `lockDirectory()` before working on them, we ensure that `clearDirectories()` does not inadvertently delete a versioned directory that another process is simultaneously operating on. # Session information {-} ```{r} sessionInfo() ```