### Reproducible pkg envs for Modeling Workflows Pascal Sauer
notes: https://events.hifis.net/event/994/contributions/7946/ 15 min presentation, 5 min questions -- ### storytime - HPC, ~50 users using same pkg library - ~30 pkgs developed in-house, ~10 updates per day, most people want all updates 1. 👩🏻🎓 finishing up paper, run results today differ from yesterday 2. 🧑🏼💻 testing new feature in a central tool 3. 👨🏾🔧 reproduce run from a month ago notes: b) get installed automatically on push to github day at work 1. "did not change anything, how can that be?" -> ask RSE 2. ...by installing it system-wide, of course there's a bug -> everyone affected 3. "can I install all pkg versions from back then? I have written them down on a piece of paper" can help all of them with pkg envs 3min --- ### What is a pkg env? - installed pkgs - specific to a project (folder) - isolated from system-wide installed pkgs - tools to - install/update/remove pkgs - document pkg env - restore pkg env notes: conceptual part, let's look at concrete example --
-- #### renv files & folders ```sh renv/library/ .Rprofile renv.lock ``` -- #### renv.lock ```json "magclass": { "Package": "magclass", "Version": "6.13.2", "Source": "Repository", "Repository": "https://rse.pik-potsdam.de/r/packages", "RemoteUrl": "https://github.com/pik-piam/magclass", "RemoteRef": "HEAD", "RemoteSha": "3c385d684d5cc3e7b90b8", "Hash": "dbc9e33e04a823f7219", "Requirements": [ "abind", "data.table" ] }, ``` notes: 7min ---
-- ### MAgPIE pkg envs - clone MAgPIE from GitHub - one time only: fill empty pkg env - on each start: - check for updates - create new pkg env specific to that run notes: open source land use optimization model in GAMS run prep + postprocessing in R ...most users want all updates ...to reproduce later 9min x8:30min -- ### differences to regular setup - daily updates vs. stable pkg env - for each run one stable pkg env - quick pkg env creation needed -- ### technical differences - lockfile not commited - lockfile archive - main pkg env + one stable pkg env per run - global symlink-based pkg cache - ✅ little disk space required (8.3GB for 3400 pkgs) - ✅ instant installation of cached packages - ⛔️ cache corrupt = all pkg envs break - ⛔️ dev version with same version number in cache - ⛔️ file permission issues notes: a) pkg envs focus on stability; usually not wanted here (also: merge conflicts) which pkgs to install? not renv.lock, but DESCRIPTION + auto dep analysis b) per project, timestamps -> time travel alternative to git tracked lockfile c) sync / manage async case d) shared between all users -> pkg envs symlink into cache 8.3 GB for 3400 pkg versions enables one pkg env per run single point of failure, contradicting isolation some user did not set umask clearly trade-off 16min --- ### conclusion - ⛔️ need to learn pkg env workflows - ⛔️ many different pkg envs 1. ✅ stable pkg env 👩🏻🎓 2. ✅ isolated testing pkg env 🧑🏼💻 3. ✅ greatly increased reproducibility 👨🏾🔧 - ⚠️ e.g. system libs not covered notes: will run into problems debugging harder --- ## final slide [pascal.sauer@pik-potsdam.de](mailto:pascal.sauer@pik-potsdam.de) [github.com/pascal-sauer](https://github.com/pascal-sauer) [slides on GitHub](https://github.com/pascal-sauer/deRSE24-reproducible-pkg-envs) [renv homepage](https://rstudio.github.io/renv/) [slides made with reveal.js](https://revealjs.com/) notes: 19min