# git rev-parse -q --verify c0a572d9d32fe1e95672f24e860776dba0750a38^{commit} c0a572d9d32fe1e95672f24e860776dba0750a38 already have revision, skipping fetch # git checkout -q -f -B kisskb c0a572d9d32fe1e95672f24e860776dba0750a38 # git clean -qxdf # < git log -1 # commit c0a572d9d32fe1e95672f24e860776dba0750a38 # Merge: 1f2300a73821 6ac392815628 # Author: Linus Torvalds # Date: Mon Jun 26 10:27:04 2023 -0700 # # Merge tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs # # Pull vfs mount updates from Christian Brauner: # "This contains the work to extend move_mount() to allow adding a mount # beneath the topmost mount of a mount stack. # # There are two LWN articles about this. One covers the original patch # series in [1]. The other in [2] summarizes the session and roughly the # discussion between Al and me at LSFMM. The second article also goes # into some good questions from attendees. # # Since all details are found in the relevant commit with a technical # dive into semantics and locking at the end I'm only adding the # motivation and core functionality for this from commit message and # leave out the invasive details. The code is also heavily commented and # annotated as well which was explicitly requested. # # TL;DR: # # > mount -t ext4 /dev/sda /mnt # | # └─/mnt /dev/sda ext4 # # > mount --beneath -t xfs /dev/sdb /mnt # | # └─/mnt /dev/sdb xfs # └─/mnt /dev/sda ext4 # # > umount /mnt # | # └─/mnt /dev/sdb xfs # # The longer motivation is that various distributions are adding or are # in the process of adding support for system extensions and in the # future configuration extensions through various tools. A more detailed # explanation on system and configuration extensions can be found on the # manpage which is listed below at [3]. # # System extension images may – dynamically at runtime — extend the # /usr/ and /opt/ directory hierarchies with additional files. This is # particularly useful on immutable system images where a /usr/ and/or # /opt/ hierarchy residing on a read-only file system shall be extended # temporarily at runtime without making any persistent modifications. # # When one or more system extension images are activated, their /usr/ # and /opt/ hierarchies are combined via overlayfs with the same # hierarchies of the host OS, and the host /usr/ and /opt/ overmounted # with it ("merging"). When they are deactivated, the mount point is # disassembled — again revealing the unmodified original host version of # the hierarchy ("unmerging"). Merging thus makes the extension's # resources suddenly appear below the /usr/ and /opt/ hierarchies as if # they were included in the base OS image itself. Unmerging makes them # disappear again, leaving in place only the files that were shipped # with the base OS image itself. # # System configuration images are similar but operate on directories # containing system or service configuration. # # On nearly all modern distributions mount propagation plays a crucial # role and the rootfs of the OS is a shared mount in a peer group # (usually with peer group id 1): # # TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID # / / ext4 shared:1 29 1 # # On such systems all services and containers run in a separate mount # namespace and are pivot_root()ed into their rootfs. A separate mount # namespace is almost always used as it is the minimal isolation # mechanism services have. But usually they are even much more isolated # up to the point where they almost become indistinguishable from # containers. # # Mount propagation again plays a crucial role here. The rootfs of all # these services is a slave mount to the peer group of the host rootfs. # This is done so the service will receive mount propagation events from # the host when certain files or directories are updated. # # In addition, the rootfs of each service, container, and sandbox is # also a shared mount in its separate peer group: # # TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID # / / ext4 shared:24 master:1 71 47 # # For people not too familiar with mount propagation, the master:1 means # that this is a slave mount to peer group 1. Which as one can see is # the host rootfs as indicated by shared:1 above. The shared:24 # indicates that the service rootfs is a shared mount in a separate peer # group with peer group id 24. # # A service may run other services. Such nested services will also have # a rootfs mount that is a slave to the peer group of the outer service # rootfs mount. # # For containers things are just slighly different. A container's rootfs # isn't a slave to the service's or host rootfs' peer group. The rootfs # mount of a container is simply a shared mount in its own peer group: # # TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID # /home/ubuntu/debian-tree / ext4 shared:99 61 60 # # So whereas services are isolated OS components a container is treated # like a separate world and mount propagation into it is restricted to a # single well known mount that is a slave to the peer group of the # shared mount /run on the host: # # TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID # /propagate/debian-tree /run/host/incoming tmpfs master:5 71 68 # # Here, the master:5 indicates that this mount is a slave to the peer # group with peer group id 5. This allows to propagate mounts into the # container and served as a workaround for not being able to insert # mounts into mount namespaces directly. But the new mount api does # support inserting mounts directly. For the interested reader the # blogpost in [4] might be worth reading where I explain the old and the # new approach to inserting mounts into mount namespaces. # # Containers of course, can themselves be run as services. They often # run full systems themselves which means they again run services and # containers with the exact same propagation settings explained above. # # The whole system is designed so that it can be easily updated, # including all services in various fine-grained ways without having to # enter every single service's mount namespace which would be # prohibitively expensive. The mount propagation layout has been # carefully chosen so it is possible to propagate updates for system # extensions and configurations from the host into all services. # # The simplest model to update the whole system is to mount on top of # /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc # will then propagate into every service. This works cleanly the first # time. However, when the system is updated multiple times it becomes # necessary to unmount the first update on /opt, /usr, /etc and then # propagate the new update. But this means, there's an interval where # the old base system is accessible. This has to be avoided to protect # against downgrade attacks. # # The vfs already exposes a mechanism to userspace whereby mounts can be # mounted beneath an existing mount. Such mounts are internally referred # to as "tucked". The patch series exposes the ability to mount beneath # a top mount through the new MOVE_MOUNT_BENEATH flag for the # move_mount() system call. This allows userspace to seamlessly upgrade # mounts. After this series the only thing that will have changed is # that mounting beneath an existing mount can be done explicitly instead # of just implicitly. # # The crux is that the proposed mechanism already exists and that it is # so powerful as to cover cases where mounts are supposed to be updated # with new versions. Crucially, it offers an important flexibility. # Namely that updates to a system may either be forced or can be delayed # and the umount of the top mount be left to a service if it is a # cooperative one" # # Link: https://lwn.net/Articles/927491 [1] # Link: https://lwn.net/Articles/934094 [2] # Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3] # Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4] # Link: https://github.com/flatcar/sysext-bakery # Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1 # Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2 # Link: https://github.com/systemd/systemd/pull/26013 # # * tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: # fs: allow to mount beneath top mount # fs: use a for loop when locking a mount # fs: properly document __lookup_mnt() # fs: add path_mounted() # < /opt/cross/kisskb/korg/gcc-5.5.0-nolibc/powerpc64-linux/bin/powerpc64-linux-gcc --version # < /opt/cross/kisskb/korg/gcc-5.5.0-nolibc/powerpc64-linux/bin/powerpc64-linux-ld --version # < git log --format=%s --max-count=1 c0a572d9d32fe1e95672f24e860776dba0750a38 # make -s -j 160 ARCH=powerpc O=/kisskb/build/linus_44x_fsp2_defconfig_powerpc-gcc5 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-5.5.0-nolibc/powerpc64-linux/bin/powerpc64-linux- 44x/fsp2_defconfig # < make -s -j 160 ARCH=powerpc O=/kisskb/build/linus_44x_fsp2_defconfig_powerpc-gcc5 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-5.5.0-nolibc/powerpc64-linux/bin/powerpc64-linux- help # make -s -j 160 ARCH=powerpc O=/kisskb/build/linus_44x_fsp2_defconfig_powerpc-gcc5 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-5.5.0-nolibc/powerpc64-linux/bin/powerpc64-linux- olddefconfig # make -s -j 160 ARCH=powerpc O=/kisskb/build/linus_44x_fsp2_defconfig_powerpc-gcc5 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-5.5.0-nolibc/powerpc64-linux/bin/powerpc64-linux- Completed OK # rm -rf /kisskb/build/linus_44x_fsp2_defconfig_powerpc-gcc5 # Build took: 0:01:56.874255