FreeBSD Capsicum vs. Linux Seccomp Process Sandboxing

Posted by vermaden 1 day ago

Comments

Comment by brynet 1 day ago

EDIT: Article seems to have been updated to remove mention of Chromium.

This article contains a lot of errors, for example Chromium on FreeBSD does NOT use Capsicum, it never has. That was experimental and invasive work done 17 years ago that was NEVER committed to their official ports repository. In fact, not a single browser on FreeBSD uses Capsicum or any form of sandboxing _at all_.

https://github.com/rwatson/chromium-capsicum

https://www.freshports.org/www/chromium/

https://cgit.freebsd.org/ports/log/www/chromium/Makefile?qt=...

Contrast that with OpenBSD, where the Chromium port has used pledge(2) since January 2016, and unveil(2) since 2018. Both are enabled by default. Mozilla Firefox ports also use both pledge and unveil since 2018-2019, with refinements over the years.

https://marc.info/?l=openbsd-ports-cvs&m=145211683609002&w=2

https://marc.info/?l=openbsd-ports-cvs&m=153250162128188&w=2

OpenBSD's fork of tcpdump has been privsep for ~22 years, and its packet parser runs with no privileges. It's pledged tightly "stdio" and has no network/filesystem access, and uses OpenBSD specific innovations like bpf descriptor locking (BIOCLOCK) missing from both FreeBSD/Linux tcpdump today (despite FreeBSD adding the ioctl in 2005).

In the years since it was added, the reason Capsicum has only been applied to a handful of utilities is because it's a tree barren of decades worth of incremental work on privilege separation and security research.

Comment by limagnolia 1 day ago

I would like to see a comparison of capsicum and pledge/unveil. Is capsicum much more difficult to use? Is it inherently less secure?

Comment by brynet 1 day ago

It's very difficult to reason about, for instance compare the OpenSSH sshd sandbox implementations.

https://github.com/openssh/openssh-portable/blob/master/sshd...

https://github.com/openssh/openssh-portable/blob/master/sand...

w/ Capsicum, beyond faffing around with some file descriptors, it's unclear what security cap_enter() adds:

https://github.com/openssh/openssh-portable/blob/master/sand...

Comment by brynet 1 day ago

> EDIT: Article seems to have been updated to remove mention of Chromium.

Archive: https://archive.ph/rLmTq

Comment by PeterWhittaker 1 day ago

Interesting article, but it compares apples to a fruit stand: The approach could be improved by comparing Capsicum to using seccomp in the same way.

Sometime ago I wrote a library for a customer that did exactly that: Open a number of resources, e.g., stdin, stdout, stderr, a pipe or two, a socket or two, make the seccomp calls necessary to restrict the use of read/write/etc. to the associated file descriptors, then lock out all other system calls - which includes seccomp-related calls.

Basically, the library took a very Capsicum-like approach of whitelisting specific actions then sealing itself against further changes.

This is a LOT of work, of course, and the available APIs don't make it particularly easy or elegant, but it is definitely doable. I chose this approach because the docker whitelist approach was far too open ended and "uncurated", if you will, for the use-case we were targeting.

In this particular case, I was aided by the fact the library was written to support the very specific use-case of filters running in containers using FIFOs for IPC, logging, and reporting: Every filter saw exactly the same interfaces to the world, so it was relatively easier to lock things down.

Having said that, I wish Linux had a Capsicum-equivalent call, or, even better for the approach I took, a friendlier way to whitelist specific calls.

Comment by thomashabets2 1 day ago

A problem with that approach is that libc can after an upgrade decide to start doing syscalls you were not expecting. Like the first time you call `printf()` it calls `newfstatat()`. Only the first time. Maybe in the future it'll call it more often than that, and then your binary breaks.

I'm not sure what glibc's latest policy is on linking statically, but at least it used to be basically unsupported and bugs about it were ignored. But even if supported, you can't know if it under some configurations or runtime circumstances uses dlopen for something.

Or maybe once you juggle more than X file descriptors some code switches from using `poll()` to using `select()` (or `epoll()`).

My thoughts last time I looked at seccomp: https://blog.habets.se/2022/03/seccomp-unsafe-at-any-speed.h...

Comment by staticassertion 1 day ago

This is a problem but fwiw libc's should be falling back to old system calls. You can block clone3 today and see that your libc will fall back to clone.

Comment by thomashabets2 1 day ago

Yeah. But it still means wandering into de facto unsupported territory in a way that pledge/unveil/landlock does not.

Your example may be true, but I'm guessing it's not a guarantee. Not to mention if one wants to be portable to musl or cosmopolitan libc. The others inherently are more likely to work in a way that any libc would be "unsurprised" by.

Comment by staticassertion 1 day ago

Yeah for sure, it's a real issue. In general, seccomp feels hard to use unless you own your stack top to bottom.

Comment by Someone 1 day ago

> A problem with that approach is that libc can after an upgrade decide to start doing syscalls you were not expecting.

That would break capsicum, too, so I don’t see how that’s a problem when “comparing Capsicum to using seccomp in the same way”.

Comment by thomashabets2 1 day ago

That's the approach I meant by "that approach", the library the parent commenter was talking about writing for a customer. Compare this to Landlock or OpenBSDs pledge/unveil.

Comment by chuckadams 1 day ago

Now that Landlock actually is a thing, have you considered writing another followup? Given what I've seen of landlock, I expect it'll be spicy...

Comment by WalterGR 1 day ago

I took the bait.

“The goal of Landlock is to enable restriction of ambient rights (e.g. global filesystem or network access) for a set of processes. Because Landlock is a stackable LSM [(Linux Security Model)], it makes it possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls. ... Landlock empowers any process, including unprivileged ones, to securely restrict themselves.”

https://docs.kernel.org/userspace-api/landlock.html

Comment by thomashabets2 1 day ago

I've actually found it pretty fine. It doesn't have full coverage, but they have a system of adding coverage (ABI versions), and it covers a lot of the important stuff.

The one restriction I'm not sure about is that you can't say "~/ except ~/.gnupg". You have to actually enumerate everything you do want to allow. But maybe that's for the best. Both because it mandates rules not becoming too complex to reason about, and because that's a weird requirement in general. Like did you really mean to give access to ~/.gnupg.backup/? Probably not. Probably best to enumerate the allowlist.

And if you really want to, I guess you can listdir() and compose the exhaustive list manually, after subtracting the "except X".

I find seccomp unusable and not fit for purpose, but landlock closes many doors.

Maybe you know better? I'd love to hear your take.

Comment by chuckadams 1 day ago

I definitely don't know better, and after taking a few more looks at landlock, I'm not even sure what my objections were, probably got it confused with something else entirely. Confusion and ignorance on my part I guess.

Comment by hrmtst93837 1 day ago

You can make seccomp mimic Capsicum by whitelisting syscalls and checking FD arguments with libseccomp, but that quickly becomes error prone once you factor in syscall variants and helper calls. Read and write take the FD as arg0 while pread and pwrite shift it, and sendfile, splice and io_uring change semantics, and ioctl or fcntl can defeat naive filters, so you wind up with a huge BPF program and still miss corner cases.

Capsicum attaches rights to descriptors and gives kernel enforced primitives like cap_enter and cap_rights_limit, so delegation is explicit and easier to reason about. If you want Linux parity, use libseccomp to shrink the syscall surface, combine it with mount and user namespaces and Landlock for filesystem constraints, and design your app around FD based delegation instead of trying to encode every policy into BPF.

Comment by adiabatichottub 1 day ago

One question I've always had about these capability systems is: why isn't there a way to set capabilities from the parent process when execing? Why trust a program to set its own capabilities? I know that having a process set capabilities on itself doesn't break existing tools, but it seems like if you really wanted a robust system it would make sense to have the parent process, the user's shell for example, set the capabilities on its children, and have those capabilities be inheritable so the child could spawn other processes with the same or fewer capabilities (if it's allowed to do that at all). Is there an existing system that works this way, in or outside of the UNIX family? Or maybe some research paper written on the subject? I'd love to know.

Comment by g0xA52A2A 1 day ago

You may be interested in OpenBSD's pledge[1][2][3].

> Why trust a program to set its own capabilities?

An example may be that a program starts needing a wide range of capabilties but can then ratchet down to a reduced set once running, aka "privdrop".

> why isn't there a way to set capabilities from the parent process when execing?

There have been replies on other systems so just to stick with pledge which provides the abiliy to set "execpromises" to do this.

[1] https://man.openbsd.org/pledge

[2] https://www.openbsd.org/papers/eurobsdcon2017-pledge.pdf

[3] https://www.openbsd.org/papers/BeckPledgeUnveilBSDCan2018.pd...

Comment by adiabatichottub 1 day ago

I think you're talking about "execpromises"?[1] I'll have to study it a bit.

[1] https://bsdb0y.github.io/posts/openbsd-intro-to-update-on-pl...

Comment by somat 14 hours ago

I am less sure about the others (capsicum, seccomp) but the threat model for opebsd's pledge is not that you don't trust the process, you do trust the process, otherwise you would not be running it. The threat pledge is trying to solve is where if the process gets corrupted by a malicious agent while it is running the fallout is minimal. Under this threat model the process notifies the kernel to shed capabilities as soon as it no longer needs them. something that can only be done in process.

Openbsd had a neat external syscall sandboxing system at one point (systrace ) it was removed for reasons I don't fully understand. But I think it boils down to "optional security isn't". hard to maintain, problematic, external policies, the first thing you do is disable them (cough selinux cough)

Comment by toast0 1 day ago

I've only really messed with capsicum. You can certainly cap_enter between fork and exec, but depending on exactly what your target does, it's really not simple to do anything meaningful beyond the basic capsicum mode without changes to the program.

The way capabilities usually work is you more or less turn off the usual do whatever you want syscalls, and have to do restricted things through FDs that have the capability to do them. So like, no more open any path, you have to use openat with a FD in your directory of interest. But that requires the program to understand how to use the capabilities and how to be passed them. It's not something that you can just impose.

My understanding of SELinux, is it can be imposed on a program without the knowledge of the program, because it's more or less matching rules for syscalls... rather than giving a restricted FD to use with openat, you restrict the options for open.

Comment by gizmo686 14 hours ago

You can mostly do that with Seccomp on Linux (I have no experience with FreeBSD).

Child processes inherit the restrictions from the parent. You can therefore have the parent fork, setup it's rules, then exec. This is exactly how syscall filtering (and a bunch of other lockdowns) are implemented in SystemD

Comment by harporoeder 1 day ago

This is essentially what containers are. Bubblewrap / Docker / Podman. I think the primary issue is very few applications on Desktop systems are actually designed with sandboxing in mind unlike say something on a phone.

Comment by adiabatichottub 1 day ago

I'm not terrible familiar with Linux container systems, cgroups and all that, but I have been down the rabbit-hole with FreeBSD's jails, and I definitely wouldn't call them a capabilities system. You can lock down the environment quite a bit, and limit or even virtualize the network stack, but you can't say, "Here process, have your standard IO streams and nothing more. Go forth and compute." The process isn't blind to it's environment. You're still in the same basic UNIX user security model. It's really somewhere between chroot and full virtualization.

Comment by harporoeder 1 day ago

A default container seccomp profile will let you do quite a few things but you can use a different profile some json and limit to just a few system calls if you want such as doing IO on open FDs without the ability to open them. I think the runtime opens the FDs before the child process starts and are inherited.

Comment by black_knight 1 day ago

Answering without reading TFA here. But I am familiar with capsicum.

But I am pretty sure you CAN get your capabilities from a patent process using capsicum, since they are just file descriptors.

Comment by thomashabets2 1 day ago

Yeah I'm not a fan of seccomp (https://blog.habets.se/2022/03/seccomp-unsafe-at-any-speed.h...).

On Linux I understand that Landlock is the way to go.

Comment by 0x457 1 day ago

Landlock right now doesn't offer a lot for things that aren't file system access. Other than that it's great, you can have different restrictions per-thread if you want to.

Comment by thomashabets2 1 day ago

Yeah, but the file system is where I put most of my files. :-)

Between file system, bind/connect, and sending signals, that covers most of it. Probably the biggest remaining risk is any unpatched bugs in the kernel itself.

So one would need to first gain execution in the process, and then elevate that access inside the kernel, in a way that doesn't just grant you root but still Landlocked, and with a much smaller effective syscall attack surface. Like even if there's a kernel bug in ioctl on devs, landlock can turn that off too.

Comment by 0x457 1 day ago

I agree, but it would be nice if it had similar fine-grained APIs for network calls. That said I solved it by using LD_PRELOAD and socks5. It's not perfect, but good enough.

Landlock is one of my favorite linux-only APIs almost feels like it was FreeBSD's answer to some Linux feature.

Comment by littlestymaar 1 day ago

I've seen AI written blog posts before, but this is one step above: the entire blog (~90 articles) have been AI generated over the past three months.

I already find it very frustrating that most open source projects spawning on HN's front page are resume-boosting AI slop but if blogs start being the same the internet is definitely dead.

Edit: it doesn't even looks like it's resume-boosting in this case, the “person” behind it doesn't even appear to exist. We can only speculate about the intent behind this.

Comment by smartbit 11 hours ago

The person https://www.linkedin.com/in/vvoss/ seems to exist, I even have a mutual linkedin connection. What makes you think the “person” behind it doesn't even appear to exist?

Comment by littlestymaar 11 hours ago

I can't log-in to linkedin right now, but here's a few things:

- the profile picture is almost certainly (like 99%, certainty) AI-generated (I can even tell you it's ChatGPT-generated, the style is way too characteristic to miss).

- the LinkedIn profile shows prolific activity for the past few days, but almost nothing before that, I'm not sure the profile existed before.

- the github account is just 2 weeks old.

Having a mutual connection doesn't mean much, the interesting question would be who's the mutual and for how long has he be a connection. It's not hard to get to 500 LinkedIn connections on LinkedIn in a few days, you just need to add headhunters and other hiring specialists, they'll never refuse an invitation from a profile that look interesting. They could also have added someone who interacted with their LinkedIn slop submission, making the person more likely to accept the invitation.

Comment by renox 13 hours ago

And the Chrome capsicum hallucination got me..

Comment by shirro 1 day ago

It is getting more difficult to research now. Increasingly I just grab the source code locally and don't bother with the browser. Every search returns pages of wordy AI generated docs. At best they restate the code. At worse they read like badly written brochures. I am avoiding any project that doesn't have a long history. Large, feature packed projects that appeared out of nowhere on github with a single commit with no history or users are essentially stolen code that has been machine translated to obscure the original authors works.

I hate becoming the old person shaking their fist at the sky but the AI bros have just gone too far. I don't know why there isn't a bigger political and social movement against them. I would sign up in an instant to see their companies and practices regulated out of existence.

Comment by ruslan 1 day ago

Excuse me for being ignorant, is Seccomp what SELinux is based on ?

Also, what is well-known piece of software that uses Capsicum on FreeBSD ? Can someone name a few ?

Comment by gizmo686 14 hours ago

No. SELinux is based on the Linux Security Module framework, which places explicit hooks at key points within the kernel.

They also operate under pretty fundamentally different philosophies. Seccomp is based on a program dropping its own permissions. SELinux is based on a system integrator writing an ahead of time policy restricting what a program can do.

Comment by kevincloudsec 1 day ago

subtraction vs filtration is the right framing even if the article is slop. removing capabilities is structurally different from filtering syscalls because the set of things to filter grows every kernel release but the set of things a process actually needs doesn't.

Comment by jmclnx 1 day ago

This site is a perfect example showing why people are complaining about grey text, to me it is unreadable. See:

https://news.ycombinator.com/item?id=47268574

Comment by dddddaviddddd 1 day ago

And without Javascript enabled, the page refreshes in a loop!

Comment by szszrk 1 day ago

I can't read it normally even on 300% zoom. Somehow even reading mode is broken, due to diagrams being rendered in browser - I did not expect that.

But hey, it's a game!

Comment by icedchai 1 day ago

The font and color combination is terrible. It looks blurry to me, even at high zoom.

Comment by szszrk 1 day ago

Game in background doesn't help either.

It reminds me the pinnacle of design - Microsoft Authenticator. On Android, out of the blue, it displays global overlay to select one of the 3 numbers to confirm login.

The overlay is ... transparent.

Comment by littlestymaar 1 day ago

You're not missing anything, the entire blog is AI slop.

Comment by szszrk 1 day ago

I'd love to hear this explained. Deeply.

The UI is fun but unreadable, but content is solid. Explain how this is slop please.

Comment by capnrefsmmat 1 day ago

Several reasons:

1. The post mainly reiterates a single idea (Capsicum enumerates what the process can do, seccomp provides a configurable filter) in many different ways. There is not much actual depth, code samples notwithstanding. Nothing on why different designs were chosen, how easy each is to use, outcomes besides the Chrome example, etc.

2. There are a lot of AI writing tells, like staccato sentences, parallelism ("Same browser. Same threat model. Same problem."), pointless summary tables, "it's not X, it's Y" contradiction ("This is not a bug. It is the original Unix security model"), etc.

3. The author has roughly a blog post a day, all with similar style and on widely varied topics, and in the same writing style. Unless the author has deep expertise on a remarkably wide range of topics and spends all their time writing, these can't reflect deep insight or experience, but minimal editing of AI output.

So yes, it's pretty sloppy.

Comment by Jolter 1 day ago

Its not solid. It’s overly long and repetitive.

Comment by Bnjoroge 1 day ago

It's pretty obvious. Lots of LLM signs. Short sentences that keep repeating the same idea. It's not x, it's this. In fact, the entire blog seems to be LLM-generated.

Comment by jajuuka 1 day ago

The game happening at the same time is just distraction central too.

Comment by thedatamonger 1 day ago

so .. if i'm getting this right, this is an article about security, but the author can't be bothered to configure https correctly?

Comment by craftkiller 1 day ago

What'd they get wrong? Firefox and curl aren't reporting any TLS errors for me.

Comment by thedatamonger 1 day ago

$ dig vivianvoss.net A +short @ns11.infomaniak.ch.

78.46.78.181

$ curl -v https://vivianvoss.net/ 2>&1 | tail -3

* OpenSSL/3.0.13: error:0A00010B:SSL routines::wrong version number

* Closing connection

curl: (35) OpenSSL/3.0.13: error:0A00010B:SSL routines::wrong version number

$ curl -v http://vivianvoss.net/ 2>&1 | grep Location

< Location: https://www.safebrowse.io/warn.html?url=http://vivianvoss.ne...

$ whois 78.46.78.181 | grep -i netname

netname: HETZNER-RZ-NBG-NET

$ host 78.46.78.181

181.78.46.78.in-addr.arpa domain name pointer min2max.run.

The domain's authoritative nameserver (Infomaniak) points vivianvoss.net at 78.46.78.181 — a Hetzner box in Germany with rDNS min2max.run. That server redirects HTTP to SafeBrowse.io and responds to TLS handshakes with garbage. Not a local issue, not a DNS hijack — the A record itself is wrong.

Comment by craftkiller 1 day ago

Hmm so oddly enough this works fine for me:

  $ curl -v https://vivianvoss.net/ 2>&1 | tail -3
        <script src="/assets/scripts/perf.js"></script>
    </body>
    </html>

And the logs show it is going to the same address:

  * Established connection to vivianvoss.net (78.46.78.181 port 443) from 172.16.245.55 port 36208

Any chance you're a comcast xfinity customer? Searching for safebrowse.io shows that xfinity "advanced security" does this whole redirect to safebrowse.io.

Unrelated, but the site also returns an AAAA record for an ipv6 address that does not work. So they've misconfigured their server in that regard.

  $ drill vivianvoss.net AAAA  @1.1.1.1
  [...]
  vivianvoss.net. 3600 IN AAAA 2a01:4f8:120:34ad::1
  [...]
  
  $ curl --header 'Host: vivianvoss.net' 'https://[2a01:4f8:120:34ad::1]:443'
  <hangs forever>

  $ curl https://ipv6.google.com
  <works immediately>

Comment by thedatamonger 21 hours ago

Some further digging ...

-------

$ dig vivianvoss.net A +short @8.8.8.8

78.46.78.181

$ curl -v4 https://vivianvoss.net/ 2>&1 | grep -E "Connected|error"

* Connected to vivianvoss.net (78.46.78.181) port 443

* OpenSSL/3.0.13: error:0A00010B:SSL routines::wrong version number

$ curl -s https://ipinfo.io | grep org

"org": "AS7922 Comcast Cable Communications, LLC",

Same IP you're hitting, same port, but Comcast's xFi Advanced Security seems to be MITMing the connection before TLS completes.

I hate Comcast so much ...