bpf, docs: document open-coded BPF iterators

Extract BPF open-coded iterators documentation spread out across a few original commit messages ([0], [1]) into a dedicated doc section under Documentation/bpf/bpf_iterators.rst. Also make explicit expectation that BPF iterator program type should be accompanied by a corresponding open-coded BPF iterator implementation, going forward. [0] https://lore.kernel.org/all/20230308184121.1165081-3-andrii@kernel.org/ [1] https://lore.kernel.org/all/20230308184121.1165081-4-andrii@kernel.org/ Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250509180350.2604946-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-04 20:19:47 +08:00 · 2025-05-09 11:03:50 -07:00 · 2025-05-09 11:03:50 -07:00 · 7220eabff8
commit 7220eabff8
parent c8ce7db0ca
1 changed files with 110 additions and 3 deletions
--- a/Documentation/bpf/bpf_iterators.rst
+++ b/Documentation/bpf/bpf_iterators.rst
@ -2,10 +2,117 @@
 BPF Iterators
 =============
 --------
 Overview
 --------
----------
+BPF supports two separate entities collectively known as "BPF iterators": BPF
-Motivation
+iterator *program type* and *open-coded* BPF iterators. The former is
----------
+a stand-alone BPF program type which, when attached and activated by user,
 will be called once for each entity (task_struct, cgroup, etc) that is being
 iterated. The latter is a set of BPF-side APIs implementing iterator
 functionality and available across multiple BPF program types. Open-coded
 iterators provide similar functionality to BPF iterator programs, but gives
 more flexibility and control to all other BPF program types. BPF iterator
 programs, on the other hand, can be used to implement anonymous or BPF
 FS-mounted special files, whose contents are generated by attached BPF iterator
 program, backed by seq_file functionality. Both are useful depending on
 specific needs.
 When adding a new BPF iterator program, it is expected that similar
 functionality will be added as open-coded iterator for maximum flexibility.
 It's also expected that iteration logic and code will be maximally shared and
 reused between two iterator API surfaces.
 ------------------------
 Open-coded BPF Iterators
 ------------------------
 Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
 (constructor, next element fetch, destructor) and iterator-specific type
 describing on-the-stack iterator state, which is guaranteed by the BPF
 verifier to not be tampered with outside of the corresponding
 constructor/destructor/next APIs.
 Each kind of open-coded BPF iterator has its own associated
 struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
 bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
 small enough to fit on BPF stack. For performance reasons its best to avoid
 dynamic memory allocation for iterator state and size the state struct big
 enough to fit everything necessary. But if necessary, dynamic memory
 allocation is a way to bypass BPF stack limitations. Note, state struct size
 is part of iterator's user-visible API, so changing it will break backwards
 compatibility, so be deliberate about designing it.
 All kfuncs (constructor, next, destructor) have to be named consistently as
 bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
 type, and iterator state should be represented as a matching
 `struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
 a pointer to this `struct bpf_iter_<type>` as the very first argument.
 Additionally:
  - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
  number of arguments. Return type is not enforced either.
  - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
  type and should have exactly one argument: `struct bpf_iter_<type> *`
  (const/volatile/restrict and typedefs are ignored).
  - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
  should have exactly one argument, similar to the next method.
  - `struct bpf_iter_<type>` size is enforced to be positive and
  a multiple of 8 bytes (to fit stack slots correctly).
 Such strictness and consistency allows to build generic helpers abstracting
 important, but boilerplate, details to be able to use open-coded iterators
 effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
 enforced at kfunc registration point by the kernel.
 Constructor/next/destructor implementation contract is as follows:
  - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
    the stack. If any of the input arguments are invalid, constructor should
    make sure to still initialize it such that subsequent next() calls will
    return NULL. I.e., on error, *return error and construct empty iterator*.
    Constructor kfunc is marked with KF_ITER_NEW flag.
  - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
    and produces an element. Next method should always return a pointer. The
    contract between BPF verifier is that next method *guarantees* that it
    will eventually return NULL when elements are exhausted. Once NULL is
    returned, subsequent next calls *should keep returning NULL*. Next method
    is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
    NULL-returning kfunc, of course).
  - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
    constructor failed or next returned nothing.  Destructor frees up any
    resources and marks stack space used by `struct bpf_iter_<type>` as usable
    for something else. Destructor is marked with KF_ITER_DESTROY flag.
 Any open-coded BPF iterator implementation has to implement at least these
 three methods. It is enforced that for any given type of iterator only
 applicable constructor/destructor/next are callable. I.e., verifier ensures
 you can't pass number iterator state into, say, cgroup iterator's next method.
 From a 10,000-feet BPF verification point of view, next methods are the points
 of forking a verification state, which are conceptually similar to what
 verifier is doing when validating conditional jumps. Verifier is branching out
 `call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
 (iteration is done) and non-NULL (new element is returned). NULL is simulated
 first and is supposed to reach exit without looping. After that non-NULL case
 is validated and it either reaches exit (for trivial examples with no real
 loop), or reaches another `call bpf_iter_<type>_next` instruction with the
 state equivalent to already (partially) validated one. State equivalency at
 that point means we technically are going to be looping forever without
 "breaking out" out of established "state envelope" (i.e., subsequent
 iterations don't add any new knowledge or constraints to the verifier state,
 so running 1, 2, 10, or a million of them doesn't matter). But taking into
 account the contract stating that iterator next method *has to* return NULL
 eventually, we can conclude that loop body is safe and will eventually
 terminate. Given we validated logic outside of the loop (NULL case), and
 concluded that loop body is safe (though potentially looping many times),
 verifier can claim safety of the overall program logic.
 ------------------------
 BPF Iterators Motivation
 ------------------------
 There are a few existing ways to dump kernel data into user space. The most
 popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps