Introduction
It is extremely common to integrate multiple languages in a single program or system. This might occur due to different languages being better suited for different components, requiring new code to work with old code written in an older language, or when incrementally adopting a new language.
In some cases, different languages can be isolated in different processes (either on the same machine or different ones), language interoperation is then just a matter of finding compatible IPC solutions. Where this isn't feasible, libraries produced by different compilers will be linked together into the same executable and run together in the same process.
In the Rust world, the most common foreign language to interoperate with is C. Interop between Rust and C is easier than much other language interop because neither language has a required runtime, Rust has built-in support for interoperating with C and for the C ABI, and there are good supporting tools. However, there are many rough edges, and dealing with the difference in safety invariants is an involved task.
Stepping back, there are two aspects of Rust interop: the mechanics of aligning runtime semantics across different languages, and ensuring Rust's safety invariants are upheld. With interop between Rust and C, the former is mostly solved by tooling and built-in features. For interop between Rust and C++ or other languages, there is more work to do (work which is often specific to a project, rather than being a reusable solution). The safety aspect is solved by designing a safe API for foreign components. Some techniques for this are generic and apply to most situations, some are specific to the constraints of a project.
TODO - organisation of docs
Overview
When you need Rust code to work with code written in another language (we'll often call this a foreign language), there are two ways to do it: the code can be kept in different executables running in different processes or the code can be linked together and run in the same process. Code in different processes makes the language-specific issues much easier, but communication must happen by IPC which can complicate design and performance. Rust has good support for several IPC mechanisms, so this is a good option, where it is possible. If your system requires closer integration, then code must be separately compiled and linked together. Code written in different languages must then interoperate, that is functions are called across the language boundary and data is passed from one language to another. Such interop is the subject of these docs.
A high-level view of interop
A programming language gives semantics to source code which is compiled to machine code. When we have two programming languages, we have two different sets of semantics which are compiled differently into machine code. For the programming languages to interoperate, they must have a shared understanding of the semantics behind the machine code, at least enough for communication. In practice, this means functions must have compatible ABIs and data must have a representation in memory which is known to both languages and compatible with both languages. Representation and ABI combine to produce features like dynamic dispatch of methods and implicit adjustment of types such as referencing or dereferencing pointers.
Furthermore, a programming language's semantics extends beyond the nuts and bolts of ABI and representation to concepts of correctness and validity. For Rust, these semantics give rise to Rust's guarantees around memory safety. For interop, we must ensure that a programming language's expectations of validity are not subverted by foreign code.
In both the abstract, and in the concrete design of interoperating programs, we can think of interop as two layers: a reflection of foreign code into Rust (that is, bindings for foreign functions and Rust declarations of data types which have parallel foreign declarations), and an abstraction layer (which provides some of the higher level semantics of the foreign language and Rust, and provides a more idiomatic interface).
This guide aims to provide a precise understanding of issues around the mechanical layer of interop. That should facilitate writing tools for generating bindings, or for writing or editing bindings by hand. The abstraction layer of interop is fundamentally a design problem rather than a tooling problem and projects will require mostly unique solutions. This guide will aim to describe design patterns, best practices, and issues to help you solve the design problem.
As with much software design, the key to successful interop is modularity. The more you can isolate code from different languages and the more you can draw strong boundaries, the easier interop will be. The more fine-grained your interop must be, and the more closely foreign code must be integrated, the harder things get.
To give some concrete examples, passing around foreign objects opaquely is much easier than if you need to operate on those objects, calling straightforward functions is easy but calling virtually dispatched functions is more difficult, accessing data via well-encapsulated functions is much easier than accessing and manipulating data directly.
Other languages
In this guide, we'll mostly cover interop with C and C++ because that is the most common and most fundamental integration. Rust interoperating with other languages is very much possible, but is a less well-trodden path. Different languages have different issues and different levels of existing tooling. As described above, the issues come down to making different languages agree on the semantics of representation, operation, and correctness.
Any language which provides a mechanism for interop with C can also interoperate with Rust. This can be done with an intermediate C library, or by directly binding foreign functions and data types in Rust. In some cases we can do better than this since the foreign langauge and Rust might share some features or invariants which are lost when translating to C as an intermediate step.
Old text
TODO most of this is copy/pasted from the old doc, the rest needs updating to the new layout
The good news is that interop is extremely low cost (interop with C is cheap and interop with other languages is no more expensive than interop with C) and the fundamentals (ABI compatibility of many datatypes and functions, extern declarations, etc.) are built in to Rust. In many cases, there is no need for data marshalling or serialization, or adaptation for calling conventions, etc. Calling a Rust function from C or vice versa is no more expensive than calling a C function in a different library (and since LTO works across the language boundary, it can even be as cheap as a within the same file).
Generally speaking, Rust is ABI-compatible with C. That means that Rust can interoperate with any language which can interoperate with C (though FFI with languages other than C is likely to be more complex and to have some runtime overhead). There is community support for interop with C++, Ruby, Javascript, and Python. Interop with .net and Java is supported via P/Invoke and JNI respectively.
TODO:
- mechanics vs safety (FFI types vs idiomatic types)
- levels of abstraction (making FFI ergonomic/idiomatic)
- ffi and safety, and what is the challenge to writing an FFI layer
- differences with other interop (managed langs, etc)
- assessing the feasibility/difficulty of interop
- Kinds of interop - C lib, Rust lib, mixed program, etc.
- using a library vs building a library
Architecture
The more well-defined the boundary between Rust and foreign code, the easier things will be. At the limit, if your Rust and foreign code can live in different processes (i.e., are different programs compiled separately) and can communicate via some form of IPC then you won't have to worry about a lot of the issues with interop at all! Rust has great support for serialization/deserialization, gRPC, and other IPC/RPC technologies which can facilitate this.
If you need Rust and foreign code in the same process, then they should be used in separate 'components' of your design. Do not attempt to have Rust and foreign code interoperate in a fine-grained way within a single component. If you are migrating from another language to Rust, plan the migration on a per-component rather than a per-file basis. It is worth putting some up-front effort into designing the API of these components and the language boundary. As well as the usual API design issues, making the API coarse-grained (i.e., avoiding many calls), using simple datatypes (the more C-like, the better) with simple invariants, and avoiding bidirectional interaction will make FFI issues simpler.
Using a generic FFI option, such as COM/WinRT, is a good option if components can be separate to this extent. You will still have to consider safety issues, but the mechanical issues of corresponding types, etc., are much easier. The windows-rs crate offers good support for COM and WinRT.
In terms of dependencies, Rust code can be either upstream (e.g., R -> C) or downstream (C -> R) of foreign code (it is possible to have many layers of dependencies, e.g., C -> R -> C but each dependency can be considered separately). It is possible to have Rust code embedded in a foreign library and thus have a bidirectional dependency, however, you should avoid this! It is difficult to manage and build the code, and makes interop error-prone.
In other words, you can think of interoperating code as either exposing a Rust API to C code (or other languages which interoperate with a C ABI), or as exposing a C API to Rust code. The former is usually encountered when writing a Rust component which can be used from other languages, the latter when new Rust code must interoperate with legacy code.
Using Rust code from C
When designing a Rust library to be used from other languages, the design depends on whether the library is only designed to be used from other languages or if it is meant to be used from Rust code too (and in this case, whether the usage from Rust or from other languages is primary). If the Rust code will only be used from other languages, then design a crate with no public items other than extern
ones which are C-compatible. If the Rust code must be used from both Rust and other languages, then it is usually better to have a pure Rust crate and a second wrapper crate which provides the C API. If the primary consumer will be Rust code, then design the Rust crate to have a Rust-idiomatic API; the wrapper crate may need to do considerable work to project a C API. If the primary consumer will be other languages, then design the API of the Rust crate to be C-idiomatic (but expressed in Rust), and the wrapper crate can be a thin wrapper (perhaps entirely auto-generated by CBindgen).
Using C code from Rust
When wrapping a foreign library for use in Rust, consider writing a first layer in C (especially if the legacy code is C++) with an API better suited for interacting with Rust. Then have a crate which is only bindings of C code into Rust (either hand-written or auto-generated). The next layer is a crate which only has the functionality of the foreign library (i.e., no client logic), but presented in a Rust-idiomatic way. The bindings crate will be all unsafe
, the idiomatic crate should aim to have a 100% safe API. Clients should only use the idiomatic crate and never use the bindings crate (some advanced usages may require using the bindings in unanticipated ways, however these clients should create safe abstractions of their own rather than use the bindings directly). If following this pattern, it is common to give the idiomatic Rust crate the same name as the foreign library, and the bindings library the same name with the -sys suffix, e.g., foo
and foo-sys
. (On the topic of naming, it is idiomatic to always avoid using an -rs
suffix on any Rust crate: it is nearly always obvious from context that the crate is a Rust library, so -rs
usually adds nothing).
------------------------
C/C++ library libfoo
------------------------
C wrapper libfoo-ffi
------------------------
Rust bindings (unsafe) foo-sys
------------------------
Rust wrapper (idiomatic) foo
------------------------
Rust users
------------------------
Building
If you have a mostly Rust project with some foreign libraries, you should use Cargo. If you have a project with only a small amount of Rust, then you will probably want to use the existing build system and will need to find a way to integrate the Rust build into it. Integrating Cargo and rustc with other build systems is a big topic, and this section will only be a brief summary.
To build foreign libraries inside a Cargo project, the usual approach is to orchestrate the foreign builds from build.rs. The CC crate is often used to build C/C++/ASM from a build script.
To build Rust code from a different build system you have several options, depending on your project's constraints. The simplest approach is to have the build system just call cargo build
, however this means the build system treats the whole Rust build as a black box, that Cargo will need network access (or you can vendor the crates, see below), and if you have multiple Rust crates they will not share dependencies (unless they can all be built with a single Cargo invocation).
Another approach is to use cargo vendor
to compute and download dependencies and keep these checked-in to version control ('vendored'). Building the Rust sub-project can be handled by the build system which will call rustc directly.
There are also more sophisticated, part-automated approaches available for some build systems. E.g., cargo-raze for Bazel or reindeer for Buck.
You will probably want to build a cdylib rather than the default rlib or dylib. That is because a cdylib uses the C ABI rather than Rust's unstable ABI.
It is common to end up with multiple disjoint components in each language within a single project. You probably don't want to 'split' the project by language (e.g., having a single Cargo project for all Rust code or having a high-level 'rust' directory). It is usually better to have independent builds for each component (i.e., one Cargo project for each Rust component and separate sub-projects for non-Rust components), and the main library/application build system composes the output of each sub-project.
As well as promoting more componentized design, this has practical benefits for Cargo feature propagation, dependency versioning, etc. However, it might make builds slower because there is less sharing of artefacts.
Bindings and types
Bindings between Rust and foreign functions can either be hand-written or auto-generated. To generate bindings for C/C++ functions in Rust, use bindgen. To generate bindings for Rust functions in C, use cbindgen. These tools can either be called from build.rs to create bindings on the fly, or used from the command line to generate bindings which can be adjusted and checked in to version control (the latter being a good compromise between generated and hand-written bindings).
Choosing hand-written or generated bindings is a trade-off. Automatically generated bindings are less work, stay up to date if the foreign code changes, and are more likely to be bug-free. Getting all the types right in bindings is sometimes subtle and tricky, and is not checked at compile time. Furthermore, some bindings can be target-dependent, so any approach which does not generate bindings with knowledge of the target platform has an increased likelihood of bugs.
On the other hand, hand-written bindings can sometimes be higher quality since the programmer has more knowledge of how the code is used, and binding-generating tools have limitations including around modularity (bindgen does not expect to run multiple times in a single project and therefore types which are logically the same will have multiple definitions which can lead to incompatibility).
We recommend using auto-generated bindings where possible. In particular, wrapping generated bindings with idiomatic Rust code is less fragile in the face of change or consistency issues than trying to write better bindings by hand.
Another approach if you really need custom bindings but have significant amount of code (or target-dependence) is to write your own bindings generators, either from scratch or by forking bindgen. This is more reasonable if you have some source of truth for the generated bindings other than C headers.
Whether bindings are hand-written or auto-generated, they must follow the same rules and idioms.
To call a foreign function from Rust, it must be redeclared in a Rust module inside an extern
block, e.g.:
#![allow(unused)] fn main() { #[link(name = "some_c_library")] extern { fn callable_from_rust(); } }
To expose a Rust function to C code, declare it using the extern
keyword in its signature. Any extern function should use the #[no_mangle]
attribute to prevent name mangling, e.g.:
#![allow(unused)] fn main() { #[no_mangle] pub extern fn callable_from_c() { ... } }
For primitive types (e.g., long
, double
) in these bindings, it is recommended to use the type aliases in the libc crate which match with C types. Libc also provides Rust versions of non-primitive types used in the C standard library; windows-rs provides similar Windows-specific types.
Rust integers, floats, and booleans correspond with C equivalents and no conversion is required (see the aliases in libc for the correspondence between Rust and C types). Note that booleans in Rust must be either 0
or 1
, technically this is true in C/C++ too, however, it is common to use integers as booleans and to treat any non-zero value as true. You must ensure that a value is 0
or 1
before treating it as a Rust bool
.
Rust raw pointers can correspond with C pointers. Use std::ffi::c_void
for void
pointers. 'Opaque pointers' (where the pointee is only used in one language) can be handled trivially. If the pointee is to be accessed from multiple languages, then you must consider the pointee type for compatibility.
Treating objects as opaque is a common idiom for interop (and in C++). For foreign types which should be opaque in Rust, you can use a struct with a single private field which is a zero-sized array (there used to be advice to use a zero-variant enum for opaque types but that is no longer recommended because it can lead to UB in some circumstances (because the compiler might assume a zero variant enum can never be created)). If you must pass an opaque struct by value, then you can make it the correct size (though this is obviously fragile). For Rust types which should be opaque in C, you can declare but not define the type. Both bindgen and cxx have built-in support for such opaque types.
Slices in Rust combine a pointer to data with the length of the slice into a wide pointer. These components can be passed to C for use as an array without any deep conversion. The slice must be disassembled when passed to C, and if an array is passed to Rust, then it can be re-assembled (see the FFI omnibus for details).
User-defined Rust types (structs, unions, and some enums) can be passed to foreign code and accessed there. You will need to declare structs and unions using #[repr(C)]
(or rarely #[repr(packed)]
), and ensure that all field types are C-compatible.
Only enums with no fields are C-compatible. You must specify the type of the determinant and may want to specify the values of variants.
Other Rust types should not be passed to C, unless they will be treated completely opaquely. This includes zero-sized types, trait objects, other dynamically sized types (such as slices and strings without being adapted), tuples, and enums with fields (technically it is possible to share enums with fields which are #[repr(C)]
but the correspondence between C and Rust types is complicated and we advise against it).
Consider the traits derived for types which will cross the FFI boundary (e.g., Send
, Sync
, Copy
, Clone
, Default
, Debug
). These can affect the semantics of the types in Rust (e.g., Copy
), can affect how tools generate bindings, and/or affect the ways in which types must be handled in foreign code (e.g, if a type does not implement Send
then it must not be moved between threads even in foreign code where this is not enforced by the compiler). If you're using a tool to generate bindings, the documentation for that tool should have more details.
Error handling
It is never OK to unwind across the FFI boundary, therefore neither Rust panics nor C++ exceptions can be used. Rust's Result
type is an enum with fields and therefore cannot cross the FFI boundary. This all makes error handling somewhat challenging. I don't think there is a general solution, you basically just have to do whatever fits best with the C code and convert that error handling to idiomatic Rust error handling as part of the wrapping of the FFI bindings into idiomatic Rust (e.g., implement a set of functions and macros to convert a C error code into a Rust Result
).
Safety
All foreign code is considered unsafe by Rust. Therefore, working with foreign code is intimately related to working with unsafe code. If you are writing code which involves FFI you should have a good understanding of unsafe code in Rust. That is a big topic! Too big to cover in depth here, but I'll try and cover some of the basics and some of the interop-specific parts. See the resources below - the Nomicon is probably the best place to start.
Unsafe code does not give the programmer permission to violate Rust's safety invariants. Unsafe code requires the programmer to uphold those invariants rather than relying on the compiler to check them. Safety is not a local property, it is possible to do things in unsafe code which cause runtime errors in safe code. Safety is often subtle and unintuitive to reason about, see this blog post for some examples. The programmer must therefore carefully consider safety for any data which passes the FFI boundary, including how it is accessed in foreign code.
When a function is marked unsafe
then it's whole body is treated as unsafe code, however, there is a big difference between an unsafe
function and a safe function with an unsafe
block - the former is unsafe to call, the latter is safe to call. You should make a function unsafe
if the caller must help maintain safety invariants in any way. Making a function safe (with or without internal unsafety) indicates that the library and compiler will ensure safety with no requirements on the caller.
Safety invariants must be enforced at the boundary between safe and unsafe code. When interoperating with foreign code that means that safety invariants must be established as part of the FFI boundary. There are several techniques for helping to ensure safety at the boundary:
- runtime assertions (e.g., asserting that a pointer is non-null),
- types (both Rust and foreign types can encode information which can help ensure invariants),
- documentation (clearly documenting safety invariants makes them easier to understand and maintain).
Ultimately, we rely on invariants being upheld in foreign code which the Rust compiler cannot check. This is mostly up to the programmer, but can be helped with the above techniques.
Safety in the context of unsafe Rust specifically means memory safety. This can be divided into a few areas which might feel disjoint:
Uniqueness and mutability invariants around pointers
Rust's key invariant for ensuring memory safety is that all values must be immutable or unique. This property can be ensured statically or dynamically, but must always be upheld. Even in foreign code, this invariant must be respected, at least as far as it is observable to Rust code. I.e., if Rust code has a reference to a value, then foreign code must not mutate that value unless it can be guaranteed that the value cannot be read by the Rust code.
Pointer validity invariants
If a raw pointer may be dereferenced in Rust code or converted to a safe reference, then it must be valid. Since it is usually too late to ensure validity at the point of dereference/conversion, the validity requirement must be well-documented at all points where the pointer is passed, in particular at any FFI boundary. Some aspects of validity can be checked with assertions and the FFI boundary is usually the right place to do that.
Pointer validity includes:
- pointers must be non-null,
- pointers must point to initialised data which has not been deallocated,
- pointers must point to well-aligned data,
- if the size of a value derived from the pointer's type (including any padding) is n bytes, then the pointer must point to at least n bytes from a single allocated object.
Thread safety invariants
You must ensure that data which is not Send
is not passed between threads and data which is not Sync
is not shared between threads, even in foreign code. Furthermore, if dealing with multi-threaded code, the uniqueness and mutability invariant will be especially difficult to uphold. Therefore, it is easiest if Rust data is always kept on a single thread in foreign code.
Panics
Stack unwinding due to Rust panics, C++ exceptions, or any other cause, must never cross the FFI boundary. On the Rust side, you can use catch_unwind
to help with this. Note that when catching panics, exceptions etc., you must ensure that no data is left in an inconsistent state. That is often impossible to achieve and aborting the thread or process is the only reasonable behaviour.
Derived safety invariants
Many types have their own invariants required in order to preserve safety. These are usually not exposed to the user, except in unsafe
functions where some requirements on the caller should be documented. All such requirements must be satisfied even if the function is called from foreign code. In addition foreign code may be able to create objects in ways which are impossible in Rust (e.g., by deserialization or casting from raw bytes). In these cases, you must ensure all invariants are properly established (this can be difficult since if these invariants are not user-facing in Rust code they may be poorly documented).
A good example of a 'derived' invariant is utf-8 validity. Rust strings must always be valid utf-8 and this is relied upon to ensure memory safety, even though utf-8 validity is not directly a memory safety issue. Whenever you create a Rust string, you must ensure it is valid utf-8 (see the String docs for details).
Memory management
There are several aspects of the object lifecycle to consider: deallocating memory, calling destructors, and ensuring expected lifetimes of objects. In Rust the object lifecycle is closely tied to ownership, so we discuss these aspects in terms of ownership. The tl;dr is that keeping ownership of an object (in terms of program design, not necessarily Rust types) in the language in which it was created is usually the best strategy.
Independent of FFI, memory must usually be deallocated by the same allocator which allocated it. Without some rather specialist effort, the allocators used from different languages will be different. Therefore, you must deallocate memory in the same language where it was allocated. If objects are passed across the FFI boundary by pointer, and that pointer is morally borrowed, then there is no tidying-up required. If ownership is transferred, then the programmer must keep around a callback to the creating language to deallocate the memory, or pass the object back for destruction.
Note that destructors will not be called automatically in the foreign language. So these must be called explicitly when the object is destroyed.
A common pattern for this is that the foreign language has a wrapper type who's destructor handles calling the creating language's destructor explicitly and calls back into that language to deallocate memory (this pattern works to or from Rust).
If objects are passed by value rather than by pointer, then they must implement Copy
in Rust. Otherwise they will be copied in the foreign language where Rust assumes they will be moved. Note that objects cannot implement both Drop
and Copy
so you will not need to worry about calling drop
in this case.
Any object accessed from Rust (whether the object was created in Rust code or not) must abide by Rust's ownership and borrowing discipline. With regards to lifecycle events, this means that destroying a borrowed pointer must not destroy the underlying object, that an owned object must not be destroyed if there are any borrowed pointers to it (or owning pointers to it if there is multiple ownership, e.g., via Rc
), and that an object should be fully destroyed when it goes out of scope if held by value, or when all owning pointers are destroyed if it is held by pointer. Regarding FFI, this generally requires that documentation is clear about whether raw pointers/C pointers are morally owning or borrowed (and that this is tracked through foreign code), and that the FFI boundary should not transfer ownership when there are extant borrowed pointers to the object.
C++
Interoperating with C++ is much more complicated than interoperating with C. If you follow the advice above to only interoperate at component boundaries and you design your component APIs in a conservative, C-like way (possibly by having a C-like library wrap the C++ one), then Rust/C++ interop can be fine - it is even quite well supported by Bindgen. If you must have more fine-grained interop, then things get interesting.
If you can (and plain bindgen is not enough), we recommend using cxx to generate a bridge layer and bindings between Rust and C++ code. autocxx is an extension if you prefer auto-generated bindings.
Quite a lot of C++ features work well across FFI, see the bindgen docs for details. There are more links to docs on C++ interop below. It can be a bit hit and miss figuring out exactly what works and what doesn't and unfortunately some issues are not caught at compile time.
TODO exec summary
Estimating the complexity of interop
This chapter aims to help you estimate the costs and risks of integrating Rust into an existing mixed-language project. You'll be able to make a better judgement if you read the rest of the documentation and can therefore understand the issues more deeply. Hopefully this chapter can give you a framework for estimation, and if you don't have time to dive deeper, then at least give you a rough idea.
This chapter does not aim to help with the question of whether Rust is a good choice for a project, only to cover the integration component of that choice.
Rewriting a project or adding to a project?
Integrating Rust with a foreign language typically happens in one of two scenarios: adding Rust to an existing non-Rust project, or converting an existing non-Rust project into a Rust one. The difference being that in the second case, the goal is for the entire project (or most of it) to be Rust, whereas in the first case, only a small portion may ever be Rust.
When adding to a project, the parameters which affect integration are usually well-known. Typically the project will not change much at the same time as adding a Rust portion. So there is not much opportunity to change the cost or risk of adding Rust. In this case the more that the Rust portion is self-contained, the better. Adding a little Rust all over the project is a much riskier undertaking than adding single Rust component with a limited interface.
When converting a project to Rust, as good software engineers, we want to do the conversion as incrementally and iteratively as possible. However, this has the requirement that existing foreign code integrates with new Rust code. The more incremental the conversion, the more fine-grained the integration must be and this will make interop more difficult. From the perspective of integration, the best-case scenario is to do a clean rewrite with zero temporary interop (taking advantage of existing test suites, documentation, requirements, etc. but not doing an incremental rewrite). Where this is not possible, doing a rewrite one component at a time, rather than one function, data structure, or file at a time is highly desirable. In my opinion, the former is likely to be successful (although depending on many other factors), but the latter is doomed to failure in nearly all circumstances.
Languages
As a general rule, integrating Rust with C is the easiest scenario. If the foreign language is C++, then things will be harder, but how hard will depend on the style used in the C++ code and the nature of the project (see below for more discussion). The more C-like the code (and especially the interface with Rust), the easier integration will be.
If the foreign language is a managed language (C#, Java, Python, Ruby, etc.) then integration is possible but will have very different characteristics to integrating with native languages. The costs and risks will be specific to the foreign language (some have much better support for Rust integration than others). These docs should be expanded in the future to better cover this scenario.
Tooling
Build and dependency management
Rust projects are natively built with Cargo. Cargo is a build system and package manager. These areas of functionality are tightly integrated and Cargo does not integrate easily with other build systems. Using rustc directly is a rare path and is thus unpleasant and poorly supported.
One straightforward scenario is where a Rust crate is not depended on by any foreign code (i.e., it is a leaf node in the dependency graph). In this case all foreign code is upstream of Rust code and build system integration is fairly straightforward. Similarly, if a Rust crate does not depend on any foreign code (at least outside of the standard library), then the Rust code can be compiled in relative isolation and again build system integration is simpler. Having multiple Rust components in one or other of these situations is also fairly straightforward, as long as they don't depend on each other and you don't mind some duplication of compilation where there are shared dependencies (i.e., the extra time and potential version incompatibility).
More overlap and constraints between components makes build system integration more complicated. If you will require multiple 'layers' of dependency between Rust and foreign code, and require the Rust components to play nicely with respect to Cargo's version resolution, then integration will be complicated and require effort (how much depends on which build system you're integrating with, see below, and the details of the project).
The most common pattern for build system integration, is to run Cargo's package management functionality offline. Rust dependency sources can then be stored in-tree and when building, they only need to be compiled (which, with some setup, does not require Cargo). Tooling exists to help with this (cargo vendor). The downside of this approach is that updating dependencies is a manual step. It also makes the source tree much larger and the VC history a little less clear. If your project can accommodate this pattern, then build system integration will be much, much easier than if it can't.
Some build systems have existing tools for Rust/Cargo integration (or are a well-troden path with good docs). If you are using these, integration will be easier:
- Buck
- Bazel
- make/cmake?
- TODO any more? Links and descriptions for above
Linking Rust binaries with other native binaries is generally straightforward because Rust follows native standards for each platform. Integration may be more difficult if you require dynamic linking since Rust has a bias towards static linking, and although dynamic linking is supported, it is a less common path. Link-time optimisation (and PGO) will work with Rust, including across languages (which means that there is not even an optimisation penalty for mixing Rust with C/C++ in many cases), however, Rust uses LLVM for its backend, and so LTO will only work within the LLVM ecosystem. TODO is this true? Is it possible to LTO without LLVM?
CI
TODO
Static analysis and related tooling
elf binaries, debuginfo, llvm ecosystem valgrind seems to work quite well relying on gcc or VS means hard work analyses which rely on optimisation might have issues anything which relies on source code or AST will not work Rust has fairly good tooling (rustfmt, clippy, miri, etc) often but not always works in the presence of foreign code TODO - more specific stuff for widely used tools
Debugging, profiling, and other developer tools
debugging and profiling generally work across language boundaries, IDEs generally don't (although usually you can get pretty good independent coverage of the different languages) only likely to be problems if tooling is tied to gcc or VS ecosystems
Nature of the project
This section covers some topics which are about various aspects of a projects architecture, design, and code style. These are mostly not 'black and white' topics, but rather have a bunch of nuance and subtlety. These are also mostly things where you can make choices early in our adoption of Rust which will make integration a lot easier.
If your project has a microservices-style architecture with components running on their own nodes or in their own processes and you can keep Rust in entirely new components, then integration should be easy (relatively). Rust has good support for many forms of IPC, networking, serialization/deserialization, etc. You may still have integration issues with build or CI which need considering, but for the code itself, you are in a best-case scenario.
Only slightly worse than the previous situation is if you have a monolithic application, but you can keep the new Rust code in a separate process. The integration will still be easy, but you're likely to have more design issues.
Assuming Rust code must be in the same process as existing foreign code, the more modular the architecture, the easier integration will be. Most challenges with interop are design challenges where it is difficult to ensure modularity (and the specific kind of modularity which make interop easier) in the face of requirements which favour tight integration of components. Keeping Rust to modules/components with strong, well-designed APIs will make interop easier.
More specifically, the kinds of modularity that benefit interop are:
- TODO
And a few things which don't help much:
- keeping data private and using accessor methods (unless those accessor methods enforce invariants of the data),
- TODO
Rust interop primarily uses the C ABI, but many C++ features work across the language boundary too. There are both C and C++ features which can make interop more or less difficult. Difficult can mean requiring more tooling, more conversion of data types at the language boundary, more invariants which must be enforced by the programmer, language features which must be manually emulated rather than implemented by the compiler, riskier code (i.e., bugs are more likely and/or harder to see), or that direct sharing of data or calling of functions is not possible.
The following features work without any runtime conversion and don't require tooling (though using bindgen may make things easier):
- primitive numeric and boolean data,
- structs and field access,
- simple (C-like) enums,
- function calls with C ABI,
- opaque pointers (i.e., pointers which are never dereferenced),
The following features require either manual emulation in Rust or some encoding:
TODO, for each, discuss the implications
- lifecycles methods (constructors, copy constructors, destructors, etc.),
- method calls,
TODO - features, C++ features enums inheritance move semantics pointer arithmetic, casting, etc generics, template tomfoolery Rust features into C/C++ - enums, traits, etc, etc
The mechanics of FFI
No runtime, bare metal means low-level interop is free
where the ABI of Rust and foreign lang coincide, interop is free. Where it doesn't we require abstraction layer work stable and de facto stable Rust ABI
function calls and data must agree
runtime behaviour async/concurrency unwinding
Declaring and defining functions
For a function to be called across languages, it must be declared in both languages (but only defined in one) so that both compilers can find it. The two declarations must end up associated with the same definition, and this requires linking object files correctly, see the chapter on building and linking for details. If declared correctly, calling the function requires no special effort and no runtime overhead (it can even be inlined across languages if using link-time optimisation (LTO)).
TODO dyn linking TODO calling is unsafe
Global variables can also be accessed across the FFI boundary, see the reference chapter.
Function defined in Rust
This section will cover defining a Rust function which can be called from a foreign language.
For a Rust function to be callable from a foreign language you must use the extern
keyword to specify the ABI. "C"
is the default and most common option, see the reference for a full list. If the function should be name-able from outside Rust, you should use the #[no_mangle]
attribute to prevent name mangling. You'll usually need this, unless the function is only used via a function pointer.
E.g.,
#![allow(unused)] fn main() { #[no_mangle] pub extern "C" fn foo() -> i32 { // ... } }
You'll need to declare the function in foreign code to use it. In C/C++ this will look the same as declaring a C function (usually declared in headers as required). E.g.,
int foo();
Function defined in C/C++
This section will cover defining a C/C++ function which can be called from Rust.
A function defined in C/C++ will be just a regular function, just don't declare it static
(the default is extern
, which is what we want).
In Rust the function must be declared. This is done inside an extern block
. Functions declared in an extern
block may not have bodies, are implicitly unsafe to call, and have a specified ABI ("C"
by default). Various attributes control how function definitions are discovered and linked, see the reference.
E.g.,
#![allow(unused)] fn main() { extern "C" { fn foo() -> i32; } }
Data
When data is passed across the FFI boundary, its type must be known on both sides (i.e., there must be declarations in both languages) and those types must be compatible. Compatibility involves both a type's representation (how it is laid out in memory) and invariants of the type (due to either the language or the type itself). Compatibility is not quite a symmetric relation because while the representations must match, the invariants must be the same or stronger on the callee side than the caller side. Where representations are compatible but invariants are not, the missing invariants can be made requirements for the caller to satisfy. Compatibility is also platform-dependent since different syntactic types may have different representations on different platforms.
Type compatibility is a big topic. We'll summarise here and give a complete description in the reference.
std::ffi defines type aliases for common numeric types which are platform-accurate; libc defines a few more aliases for less common types. Using these aliases is usually easier than using Rust types directly.
Primitive types (integers, floating point types, and booleans) are straightforwardly compatible, though the names are different in C and Rust and the correspondence is platform-dependent (using std::ffi
in Rust is the easiest solution). Characters are also a primitive type in C and Rust, but have different sizes and encodings in the different languages. C char
s are compatible with Rust's i8
or u8
. A Rust char
is compatible with a C unsigned int
or unsigned long
, however, a Rust char
must always be a valid Unicode scalar value, this requirement must be satisfied in the C code before the data has the Rust char
type.
If T_rust
and T_c
correspond, then *const T_rust
and const T_c*
correspond, and *mut T_rust
and T_c*
correspond. When calling foreign functions from Rust code, you can use &mut T_rust
and &T_rust
, respectively in the Rust function declaration (Rust references and raw pointers have the same representation). You can also use Option<&mut T_rust>
and Option<&T_rust>
, see the discussion on enums, below. You cannot do this in the opposite direction (i.e., use a Rust reference in a Rust function definition and a pointer in the C declaration), since the Rust reference types have more invariants which must be satisfied. (You technically can do this without any compile-time errors and require the additional invariants in the foreign code, however, these invariants are difficult to guarantee and failing to do so will cause undefined behaviour).
TODO ^ the pointees don't have to correspond fully. Box, Rc, etc.
C and Rust structs are compatible if they have corresponding fields and the field types are compatible (and the Rust struct is repr(C)
). Sometimes a C struct may have hidden fields (sometimes called an opaque struct). Such types can be represented in Rust as a zero-sized type with a private field and no constructor function (so the type cannot be instantiated), you can include a field with type PhantomData<(*mut u8, PhantomPinned)>
to prevent the type implementing the Send
, Sync
, or Unpin
auto-traits. See the opaque struct pattern for more.
C and Rust unions are compatible if the the Rust union is repr(C)
and their fields are compatible. Field compatibility is a bit more complex than for structs: the unions must end up the same size and any pair of fields which are used together must be compatible. It is up to the programmer to ensure compatible usage of the unions.
Rust enums with no embedded data which are repr(C)
are compatible with C++ enums if they have the same number of variants, the specified (or implicit) determinant types match, and the values of all variants match. It is possible to loosen these restrictions slightly. It is invalid for an enum to have a value which is not a declared variant. So as long as there is a Rust variant which matches any C++ variant which may be passed across the FFI boundary, the enums are compatible.
Enums which are 'option-like' (i.e., have two variants, one with a single field and one with no embedded data) and where the payload type is a non-null pointer type are compatible with the corresponding (nullable) pointer type in C/C++. E.g., Option<&Foo>
is compatible with const Foo*
(subject to the restrictions described above for pointer compatibility). Similarly an option-like enum with the non-zero numeric types is compatible with the (zero-able) numeric type in C/C++.
Most tuple types are incompatible with C/C++ types because their representation cannot be specified. The exception is the empty tuple, ()
which is compatible with the void
type in C/C++ when passed by pointer.
In general, zero-sized types are incompatible with foreign types.
C strings are pointers to char
s and as expected are compatible with *mut u8
or *mut i8
. Despite their resident module, std::ffi::CStr/CString
should not be used directly for FFI because their representations are not guaranteed.
TODO C++ types
Patterns
Architectural patterns
- Modular interop - a high level approach for ensuring effective interop
- Layered library design - how to structure libraries and crates for interop
- -sys crate
- Wrap a C library
- Serialization
- Cross-language ownership
Design patterns
- Foreign dtor
- Object-based API (https://rust-unofficial.github.io/patterns/patterns/ffi/export.html)
- Rust version of C object
- Something about intermediate types like CString/OsString (https://rust-unofficial.github.io/patterns/idioms/ffi/accepting-strings.html, https://rust-unofficial.github.io/patterns/idioms/ffi/passing-strings.html)
- Transparent smart pointer
- Consolidated wrapper (https://rust-unofficial.github.io/patterns/patterns/ffi/wrappers.html)
- Strings (how to actually use them, see strings links above, https://snacky.blog/en/string-ffi-rust.html, https://dev.to/kgrech/7-ways-to-pass-a-string-between-rust-and-c-4iebZ) )
Programming idioms and best practices
- Representing Rust errors in C (https://rust-unofficial.github.io/patterns/idioms/ffi/errors.html)
- Representing C errors in Rust
Anti-patterns
- Disguising pointers as values (unclear, disguises unsafety)
- Using C structs directly in Rust (back compat hazards including padding, due to different back compat between C and Rust)
Layered library design
When wrapping a foreign library for use in Rust, consider writing a first layer in C (especially if the legacy code is C++) with an API better suited for interacting with Rust. Then have a crate which is only bindings of C code into Rust (either hand-written or auto-generated). The next layer is a crate which only has the functionality of the foreign library (i.e., no client logic), but presented in a Rust-idiomatic way. The bindings crate will be all unsafe
, the idiomatic crate should aim to have a 100% safe API. Clients should only use the idiomatic crate and never use the bindings crate (some advanced usages may require using the bindings in unanticipated ways, however these clients should create safe abstractions of their own rather than use the bindings directly).
If following this pattern, it is common to give the idiomatic Rust crate the same name as the foreign library, and the bindings library that name with the -sys suffix, e.g., foo
and foo-sys
. (On the topic of naming, it is idiomatic to always avoid using an -rs
suffix on any Rust crate: it is nearly always obvious from context that the crate is a Rust library, so -rs
usually adds nothing).
------------------------
C/C++ library libfoo
------------------------
C wrapper libfoo-ffi
------------------------
Rust bindings (unsafe) foo-sys
------------------------
Rust wrapper (idiomatic) foo
------------------------
Rust users
------------------------
When making a Rust library available to foreign code, you can adopt a similar strategy. Here, we have an idiomatic Rust crate which can be used directly by Rust users and is idiomatic and mostly safe code. There is then a Rust wrapper which is more C-like and presents an API which is more convenient to use for FFI and includes unsafe functions which make clear the invariants callers must maintain. C bindings reflect this wrapper into the C world. This can be used directly by C code, or can there can be a C/C++ wrapper which is more idiomatic (this is much more useful for C++ rather than C, since it is possible to have an idiomatic C API with the direct bindings, but that is much harder for C++). C/C++ users (again, more likely C++) then use this wrapper library rather than the bindings.
------------------------
Rust crate (idiomatic) foo
------------------------
Rust wrapper (unsafe) foo-ffi
------------------------
C bindings libfoo-ffi
------------------------
C/C++ wrapper (optional) libfoo
------------------------
C/C++ users
------------------------
There are not strong naming conventions in this direction, and the above example names are not great.
Tooling
You can use bindgen to generate the bindings layer (foo-sys in the example). You can use cxx to generate both the C wrapper and the bindings layer, or at least parts of both.
In the other direction, you can use cbindgen to generate C bindings (e.g., libfoo-ffi) or cxx to generate both the Rust wrapper, C bindings, and C++ wrapper (although in this case the layers are not clearly defined).
See also
- -sys crate - separating the Rust bindings from the idiomatic Rust wrapper - a component of this pattern,
- Wrap a C library - the C wrapper layer - a component of this pattern.
Reference
This section is designed as a reference and you probably don't want to read it end to end. It is primarily aimed at those implementing and designing tools and low-level libraries, or users who need to do unusual and/or low-level interop work. Hopefully, if you're doing common integration work you mostly won't need this level of detail.
TODO mechanics vs safety I think this is the same as FFI types vs idiomatic types
TODO semi-opinionated some stuff is still undecided, but we describe the current state of the art try to cover different points of view and where there are differences, but not all
TODO assumes C/C++
- Functions and Methods
- statics and consts - TODO
used
attribute. Using theno_mangle
attribute implicitly impliesused
. Useextern
for external linkage - FFI types
- Idiomatic types
- Numeric types
- Strings
- Pointers, references, and arrays void pointers, fat pointers, const, arrays and slices, null/non-null, single allocation, no pointers into middle of an object, ZSTs, pointers to deallocated (e.g., dangling) mem, invalid metadata in wide pointers
- structs, tuples, and unions
- enums
- properties - send, sync, eq, hash, etc.
- classes? trait objects?
Linking
building C and Rust such that object files can be found, etc.
extern blocks
#[link(...)]
attribute
Safety and validity
Rust has several useful safety invariants. These are guaranteed to hold (by the compiler and by authors of unsafe code) in most Rust code, but they may be temporarily broken in code which is marked unsafe
. Unsafe code does not permit writing code which is not safe, but rather indicates the author of the code is responsible for safety, rather than the compiler.
Since all foreign code is outside of the remit of the Rust compiler, all foreign code is unsafe and must be called from within an unsafe
block or function. When writing an FFI layer, we usually want to present a safe API to Rust code, therefore the FFI layer must establish Rust's safety invariants and the safety invariants of any types in its API.
Precisely where safety invariants must hold is still being decided by the Rust project. The current preferred proposal is that the unsafe boundary is the public API of a module (including its sub-modules). Note that this is not defined in terms of unsafe blocks or functions! In more detail, for any public (i.e., visible from an outside module, not necessarily pub
, pub(crate)
, pub(super)
, etc. would count if there is an enclosing module in the crate) function (or method) of a module, there is a set of safety requirements. For safe functions, this set is empty, for unsafe
functions, this set should be documented. If all these requirements are satisfied, then calling the function will never result in a memory safety error occurring. During the function call, safety invariants (both Rust's and the module's) may be violated, but they must be re-established (perhaps relying on the function's safety requirements) by the time the function returns (or unwinds).
Rust also has validity invariants. These must hold in all Rust code both safe and unsafe (otherwise it is undefined behaviour). Validity invariants do not need to hold in foreign code, but must be re-established before control is returned to Rust (i.e., before crossing the FFI boundary). Contrast this with safety invariants which may be violated in foreign code and Rust code but must be re-established before crossing the unsafety boundary.
TODO safety and validity are per type and so establishing invariants might occur when data is reinterpreted.
Example
TODO passing a pointer and length from C code and using it as an array
C code:
void do_thing(char* arr, int c_arr) {}
extern void do_thing_impl(char* arr, int c_arr);
Rust code:
#![allow(unused)] fn main() { pub unsafe fn do_thing_impl(arr: *mut u8, c_arr: usize) {} fn do_thing_idiomatic(arr: &mut [u8]) { // regular Rust code } }
Functions and Methods
Interoperation of functions (including function-like things such as methods and closures) means calling Rust functions from a foreign language or foreign functions from Rust. This means declaring a function in Rust and defining it in C or vice versa. The definition and declaration must agree or there will be undefined behaviour at runtime. The definition must also be discoverable by code in the other language (this is partly an aspect of linking, described previously and partly an aspect of the function definition). This section describes what it means for function declarations and definitions to agree across languages, and how Rust functions must be defined in order to be discoverable by foreign code.
Functions
Definition
TODO
- visibility
Extern blocks
TODO https://doc.rust-lang.org/reference/items/external-blocks.html
link attribute ABI implicit unsafe see also statics
Name
The names of functions (and other items) are mangled by the compiler by default. Name mangling means that the name of the symbol in the compiled binary is not the same as the name in the source code. Name mangling is not stable, and you should not rely on mangled names being the same between compiler versions.
Use the no_mangle
attribute to prevent name mangling of a function's name. E.g.,
#![allow(unused)] fn main() { #[no_mangle] pub extern "C" foo() {} }
If you will call a function from foreign code by name then you must use no_mangle
(not doing so may cause linking errors or may cause incorrect runtime behaviour). If you will only call a function via a function pointer, then you don't need to.
Alternatively, you can use the export_name
attribute to explicitly specify the name to use for the exported symbol. E.g.,
#![allow(unused)] fn main() { #[export_name = "bar"] pub extern "C" foo() {} }
This function can be called using foo
from Rust, and bar
from C.
Likewise, by default C++ will mangle function names. This is inconsistent between platforms and compilers, so it is not advisable to use the mangled names (this can be done if absolutely necessary and tools like bindgen and cxx can help with this TODO is this true?). To prevent name mangling, define functions in an extern "C"
block.
You can specify which section of the binary the function is placed in using the link_section
attribute.
Calling convention
The extern
keyword is used on function definitions to specify the calling convention (aka the ABI) used to call them (and on extern blocks to define the calling convention used to call the foreign functions declared inside, see above). The syntax is extern "ABI"
where ABI
is the optional ABI identifier string. E.g.,
#![allow(unused)] fn main() { pub extern "C" foo() {} }
If no ABI identifier is supplied, then C
is used. If extern
is not used at all, then Rust
is used.
The calling convention used in the declaration and definition of functions must match. This is likely to entail a somewhat complex interaction of defaults across different platforms and languages, and attributes in different languages. If you have control of both sides of the FFI, then making both extern "C"
(either explicitly or by default) is probably the easiest option. You'll need to use other options if you want to match a calling convention in a library which cannot be changed.
The platform independent ABI identifier strings are:
Rust
: Rust's ABI; this is unstable and should not be used for FFI code,C
: the default C calling convention,system
: the platform default calling convention for calling 'system functions'. Usually the same as extern "C", except on Win32, in which case it's "stdcall".
The platform-specific ABI identifier strings are:
cdecl
: for x86_32 C code,stdcall
: for the Win32 API on x86_32,win64
: for C code on x86_64 Windows,sysv64
: for C code on non-Windows x86_64,aapcs
: for ARM,fastcall
: corresponds to MSVC's__fastcall
and GCC and clang's__attribute__((fastcall))
,vectorcall
: corresponds to MSVC's__vectorcall
and clang's__attribute__((vectorcall))
.
You might also come across rust-intrinsic
, rust-call
, and platform-intrinsic
. These are used by the compiler and standard library, but you shouldn't use them in user code.
There also exist -unwind
versions of the ABI identifier strings, e.g., C-unwind
. These are all unstable, see the section below on unwinding for more details.
TODO thiscall
is unstable, see discussion on methods
C/C++ linkage
C/C++ functions must have external linkage (this is the default, i.e., functions may not be marked static
).
Signature
The types of all arguments in the function and its return type as written in the declaration and definition must agree. For more on type agreement, see the sections on data types. The names of arguments do not need to match. In Rust declarations of foreign functions, _
may be used instead of an argument name. No other patterns may be used in arguments. Patterns may be used instead of names in the usual way for Rust functions which are exported; the foreign declarations should use a name instead of a pattern.
The number of arguments in definition and declaration must match, including variadic arguments. Declarations (but not definitions) in Rust may be variadic (to match variadic functions defined in C). E.g.,
#![allow(unused)] fn main() { extern "C" { fn foo(format: *const u8, args: ...); } }
If a function diverges, then in Rust it should have the !
return type. In C/C++ the function should have a 'no return' attribute (__attribute__((noreturn))
, [[noreturn]]
, [[__noreturn__]]
, [[_Noreturn]]
, etc. depending on the language, version, and compiler).
TODO what if sigs don't agree?
If the return type must be used, then the Rust function should have the #[must_use]
attribute and the C/C++ function the __attribute__((warn_unused_result))
attribute. TODO [[nodiscard]] on ctors. Getting this wrong will lead to missing warnings which may in turn lead to runtime errors.
Function calls
TODO
calling convention (should work) calling variadics (just works) see also data unsafe
const functions
TODO
Unwinding
TODO
TODO -unwind ABIs
Exceptions
TODO
Closures
TODO
Methods
TODO
- Virtual/static dispatch
- ctors
- dtors
- operator overloading
Other
TODO
- async
- generators
- templates/generics
Data Types
Data in both Rust and C/C++ is just ones and zeros in the computer's memory. These ones and zeros are independent of the language which generated them and there is no intrinsic sense of 'compatibility'. However, when data is used, the compiler must have semantics for those ones and zeros which in Rust, C, and C++ is determined by the types of the data. When data is passed across the FFI boundary, two compilers are involved and each must have a type defined in its own language with which to understand the data. For this to produce correct results, corresponding types in the two languages must agree. An abstract example, if a function f
is declared in both Rust and C and has a single argument with type T_Rust
in Rust and T_C
in C, then T_Rust
and T_C
must agree. If they do not then any operation on the data will be undefined behaviour.
This concept of agreement goes beyond what the bytes represent (e.g., that some sequence of four bytes should be interpreted as a little-endian, 32 bit, unsigned integer) and includes the safety and validity invariants of the type. These invariants may be due to rules of the language, or due to the specific type itself. What makes this difficult is that invariants due to the language may be specified in a reference or spec, but may just be assumed by the compiler authors and otherwise undocumented. Invariants due to a specific type may be documented but, especially if they are invariants which users would not usually need to be aware of or are considered implementation details, may not be documented (or only documented in the source code). These may still be a concern when writing interop code since C/C++ allows treating data in ways usually forbidden in Rust.
TODO safety and validity invariants - https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html
Invariants due to specific data types can be found in their documentation or source code. Invariants due to the kind of data types can be found below and in sub-chapters. Rust also has some invariants which apply to all data (or nearly all data), we'll cover those in the next few paragraphs.
Much of Rust's ABI and invariants are de jure undefined and may be subject to change. There have been several RFCs which cover this kind of thing, and work is ongoing within the Rust project and in academia to better specify language-wide invariants. However, there is a large body of code which works today and is unlikely to be broken, so much of this stuff is de facto standardized.
TODO what does this all mean for doing FFI?
TODO there is a two step process c_type -> binding_type -> rust_type, the C and binding types must agree, binding type typically has minimal requirements, binding type to rust type is a pure Rust conversion but to be valid we must establish the invariants of the rust type which may have to be done in foreign code. E.g., int* -> *mut i32 -> &mut i32
equivalence of int*
and *mut i32
is trivial (only wrinkle is the int types), but for the conversion from *mut i32
to &mut i32
, we must guarantee that the pointer is unique, which depends on what is happening in the foreign code.
- let's break the whole chapter up along these lines
Uniqueness and mutability
In Rust, data is immutable by default. Data must be known to be mutable from its type to be mutated (TODO phrasing). It is undefined behaviour to convert immutable data into mutable data, or to directly mutate immutable data (contrast this to C, where const-ness can be cast away).
Data can be mutated only if it is known to be unique, i.e., data cannot be accessed other than via the reference used to mutate the data. Such uniqueness may be established either statically (e.g., references &T
and &mut T
) or dynamically (e.g., RefCell
). All dynamic tracking of uniqueness must use unsafe
and raw pointers code at some level, usually wrapped so that end users only see safe abstractions.
References and values can only be mutated if they are declared as mutable (e.g., a &mut T
can be mutated and a &T
can never be mutated) and the compiler can prove they are unique at the point mutation occurs (e.g., a &mut T
cannot be mutated if there exists a live &T
referencing the same value). Raw pointers have the former constraint but not the latter. A *mut T
pointer can always be dereferenced (an unsafe
operation) and mutated. The programmer must ensure that when the pointer is dereferenced, TODO undefined?
An important aspect of Rust for preserving uniqueness is move semantics. When data is passed from one location to another (e.g., assigned to a variable or passed to a function), it is logically moved (you can think of this as a bitwise copy, then deleting the old copy, although the compiler may optimise that). That means that if data is unique before being passed, it is also unique after being passed. Compare this to C/C++ where data is copied or Java-like languages where passing most data implicitly passes a reference.
Some data in Rust is copied rather than moved. Primitive types, immutable references, raw pointers, and any data structure which implements the Copy
marker trait are copied rather than moved.
Note than both moving and copying are simple, bitwise operations. Neither invokes a constructor or destructor (in contrast to C++).
If passing an object from Rust to C/C++, care must be taken around uniqueness. If passing by reference, then pointers/references in C/C++ are copied and so will not be unique. If passing by value, then data will be copied, not moved. Therefore, Rust data which implements Copy
can be safely passed by value to C/C++ and passed around or stored. Immutable references/pointers can also be safely passed to C/C++ as long as they are never mutated or data is mutated through them. If non-Copy
data is to be passed by value, or data is passed by reference and is mutated, more care must be taken around these invariants.
Invariants around pointer and reference types are covered in detail in the chapter on pointers, references, and arrays.
TODO raw mut pointers requirement in foreign/unsafe code (interior mutability) UnsafeCell scope of requirement (data referenced from Rust? Allocated in Rust?) validity and safety invariants
Borrowing
TODO uniqueness mutable and immutable borrows overlapping borrows lifetimes lifetime due to scope - presence of dtor/attribute? and NLL drop check and phantomdata storing data 'static borrows variance
Initialization
In Rust, all memory must always be initialized unless explicitly marked using MaybeUninit
. For most data, it should be ensured that data is initialized on the foreign side of the FFI boundary. If data may not be initialized, then the Rust type must be MaybeUninit
(e.g., if passing a Foo
, then the Rust type must be MaybeUninit<Foo>
). See also the discussion on null
in the section below on pointers and references
.
concurrency
TODO effect on uniqueness send/sync unsafe and dynamic guarantees (arc, mutex, scoped threads) thread-safety and FFI Rust stuff thread-safety guarantees from C/C++ atomic/non-atomic access to shared memory (even volatile ops) memory model is C++20 (nomicon atomics link)
Layout and alignment
TODO alignment and storage address size = multiple of alignment bounds/OOB access can't assume layout without repr, see structs and enums DSTs ZSTs
Platform-specific invariants
CHERI WASM - function/data pointers ARM? padding bits two's compliment
Kinds of data type
Rust and C/C++ have many different kinds of data type. These include primitive data, compound data (enums and structs, etc.), pointers, and more. For data to agree across the FFI boundary the kind of data type must correspond (and then the details must agree, which will be covered in the following chapters).
TODO what goes here vs in the sub-chapters?
Primitive types
Primitive types are numeric (signed and unsigned integers, and floating point numbers), characters (but not strings), or booleans. These types have the same semantics and interpretation in C/C++ and in Rust. In particular, they are always passed by simple copying (i.e., without invoking a constructor, nor moved). The names of types and some details of their interpretation varies between C/C++ and Rust, see the chapter on numeric types for details. In particular, the names of types in both C/C++ and Rust can vary depending on the platform.
Because of the matching semantics and lack of aliasing, using these types for interop is usually very simple and efficient.
Both C and Rust have a void type: void
in C and ()
in Rust (or can be implicit in both languages). These types trivially agree. Most zero-sized types cannot be used for interop, ()
is an exception when used as a return type, but cannot be used as a type parameter. For void pointers, see the pointers and references section, below.
Compound data
Compound data types are structs, unions, and enums in C/C++ and Rust, tuples and tuple structs in Rust, and classes in C++. Structs, unions, and some enums basically correspond between C/C++ and Rust, see the following sub-chapters for details. Tuples in Rust cannot be used in FFI because they always have the default representation (see below and the chapter on structs, tuples, and unions). Tuple structs correspond with foreign structs. Classes in C++ correspond with structs in Rust (although this is a complex correspondence), see the chapter on classes.
Individual compound data types are likely to have their own invariants which will need to be maintained in foreign code (or by Rust code for foreign types).
By default, the Rust compiler can layout data however it likes and this can change between compiler versions (or even for with the same compiler version, in theory). This is incompatible with FFI, and so you must specify an alternative representation for data types for them to agree with a foreign type. We'll cover this in detail in the following sub-chapters.
Aliases (typedef
in C++ or type
in Rust) are present only at compile time and do not affect the representation or the invariants of the data. Rust's 'newtypes' (usually a tuple struct with a single field) are not aliases and have the same behaviour as other compound types, i.e., can introduce new invariants and may have a different representation (unless explicitly specified), compared with the underlying data.
Strings, smart pointers, and array-like collections (e.g., Vec
in Rust) are all compound data types in both Rust and C/C++. In principle, these do not require any special treatment over other user types. However, they are more likely to have important invariants which must be maintained for the sake of soundness. Several examples will be covered in the following sub-chapters.
pointers and references
TODO
layout - same as C DSTs/wide pointers validity: non-null, dangling, unaligned, aliasing (if &mut), pointed-to value is valid (safe?) smart pointers void pointers
arrays and slices
TODO
trait objects and class objects, and methods
TODO
Generic types
TODO
Numeric types
For numeric types to agree across an FFI, their kind (unsigned integer, signed integer, or floating point), size, and invariants must match. The size of most C/C++ types and usize
/isize
in Rust can vary depending on the platform. For all numeric types, if the size matches then the alignment will also match (on a single platform).
std::ffi defines type aliases for common numeric types which are platform-accurate; libc defines a few more aliases for less common types. Using these aliases is usually easier than using Rust types directly.
Integers, booleans, and characters
Rust integers
u8
... u64
and i8
... i64
are unsigned and signed respectively with the number in the type indicating the number of bits.
usize
and isize
are 32 bits on 32 bit platforms and 64 bits on 64 bit platforms.
C/C++ integers
A C/C++ integer is unsigned if it uses the unsigned
keyword and signed otherwise.
A char
is always 8 bits, a short
is always 16 bits, and a long long
is always 64 bits.
The size of int
and long
are platform dependent, see std::ffi::c_{int|long}_definition
128 bit integers
Rust supports i128
and u128
. These types are mostly not safe for FFI (will lead to UB) and must be avoided. In particular, they are not compatible with C's 128bit integer types where those exist. However, they can be used on non-Windows aarch64.
booleans
Rust (bool
) and C's (strictly, C99 and later, _Bool
) boolean types are compatible. Technically, C++'s bool
is not guaranteed to be the same representation as C's _Bool
, but they are on all known platforms, so it is safe to assume that Rust's bool
is compatible with C++'s.
It is common to use integers to represent booleans in C programs (especially older programs or when using older toolchains). These can be converted to Rust bool
s if the size matches and they are guaranteed to only have values 0
or 1
. (It is possible to use 0
for false and non-zero for true with C's boolean operators, however, storing any value other than 0
or 1
in a Rust bool
is UB. You can check and convert in either Rust or C code, but in the latter case you must not use a Rust bool
in your FFI).
characters
Rust and C character types are incompatible.
C character types can be converted to or from Rust's 8 bit integer types. unsigned char
is always u8
, signed char
is always i8
. char
may be either i8
or u8
depending on the platform, see std::ffi::c_char_definition.
A Rust char
is a 32 bit type which must be a valid Unicode scalar value. It is UB to create a char
which is not valid Unicode. You should probably avoid using char
in FFI unless you have a custom character type with the same size and invariant in your foreign code. Otherwise it is usually better to pass numeric bytes and use helper methods on char
to create the Rust char
.
TODO wchar_t
Non-zero integers
There are (currently unstable) type aliases for non-zero integers in core::ffi
. These map to the non-zero integer types in core::num
with the correct size for the C integer types. The user must maintain the non-zero invariant (whether that is a safety issue depends on how the types are used); i.e., Rust does not ensure that values with this type are in fact non-zero.
Floating point
A C float
is equivalent to a Rust f32
and a C double
is equivalent to a Rust f64
.
SIMD
SIMD vectors cannot be used in FFI (UB). There is an accepted RFC to address this, but it has not been implemented.
Strings
TODO see string patterns, pointer reference (since C strings are pointers)
Rust, C, and C++ strings
There are many string types in Rust and C/C++. I'll cover them here, focussing on their representations and invariants, since that is what is most important for language interop. For correct FFI, you need to understand a string's layout in memory, whether the string is nul-terminated (and whether nul characters may be embedded within the string), and the encoding of the string (e.g., UTF-8).
Rust
Rust has three classes of string types in the standard library, each of which has owned and borrowed1 types (the latter of which is usually a dynamically sized type, see the wide pointer section). The owned type is called a "string" and the borrowed type a "str". You could also use a sequence of characters or bytes as strings, or define your own custom string type (see the below section on Windows strings for some examples).
The standard Rust string types are String
and str
. Both are UTF-8 strings and must always be valid UTF-8. A String
is a newtype wrapping a Vec<u8>
; str
is a built-in type and always has the same representation as a [u8]
. This means that a String
is a pointer (a unique, non-null pointer to a sequence of u8
s, i.e., essentially a *mut u8
in terms of representation), a capacity (usize
), and length (usize
), in that order. A &str
is a wide pointer consisting of a (non-null) pointer to a sequence of u8
s and a length (usize
). However, the order of the components of a wide pointer is unspecified and unstable (i.e., may change in the future).
Rust has the std::ffi::CString
and std::ffi::CStr
types for working more easily with foreign language string types. These types are not directly FFI compatible with C strings. These strings must be nul-terminated, have no internal nul characters, but do not have to be valid UTF-8. Use the as_ptr
method to get a an FFI-compatible pointer. The representation of these strings is not part of their interface.
Similar to CString
/CStr
, std::ffi::OsString
and std::ffi::OsStr
are meant to make working with foreign string types easier but are not directly FFI compatible with foreign strings. OsString
/OsStr
are easily convertible to both platform-native strings and Rust strings (String
/str
). Neither their representation nor whether they are valid Unicode is part of their interface. On Unix platforms, OsStr
can be cheaply interconverted with byte slices, however, these are not nul-terminated. On Windows, OsStr
can be losslessly converted into a UTF-16 (wide) string, however, this requires copying and processing the string data; again, the output string is not nul-terminated.
technically, these are just dynamically sized string types and are not intrinsically borrowed (e.g., Box<str>
is a valid type). However, in practice these types are nearly always used with borrowed references (e.g., &str
) to represent borrowed strings. These are often called string slices since they can be a slice (aka substring) of the underlying string.
C
C strings are pointers to a nul-terminated sequence of char
s. They may have either pointer or array types (which are equivalent in C). C strings to not have a specified encoding, that is a program is free to interpret a C string as ASCII, UTF-8, UTF-32, or any other encoding.
C++
The C++ standard library includes the string
type (which is actually an alias of an instantiated generic type basic_string<char>
). Like the C string type it does not have a specified encoding. Its methods are all byte-oriented (i.e., have no concept of a character beyond a char
). It is not directly compatible with C strings and its representation is not part of its interface. It is easy to get a C string with the c_str
method, whether this is guaranteed to return a pointer to the data in the string or a copy of it depends on the version of C++.
Windows
Windows uses many different string types: HSTRING
, BSTR
, and the PSTR
family of types.
HSTRING
is primarily used with WinRT and is immutable. It is usually (but not always) reference counted. It is nul-terminated, but may also include embedded nuls (it stores a length so doesn't rely on nul-termination). It's UTF-16 encoded. Empty strings are represented as a null pointer.
BSTR
is primarily used with COM. It is a nul-terminated, mutable, UTF-16 string which may include embedded nuls. A null pointer is a valid BSTR
and represents the empty string, though empty BSTR
s may also be used. BSTR
s always work in conjunction with the system allocator (SysAlloc*
) and the length of the string is laid out in memory preceeding the data, and a nul character comes after the data in memory; neither are included in the BSTR
's length. A BSTR
is a pointer and points at the first character, not the length.
The PSTR
family of types are 'pointer to char's, pointing to a null-terminated sequence of characters (similar to C strings). If there is a C
in the name it is an immutable string (otherwise its mutable), if there is a W
then the characters are wide (two bytes per character) and the string is UTF-16 encoded. If there is no W
, then the characters are one byte and there is no specified encoding (i.e., may be ASCII or UTF-16 or whatever; these are compatible with C strings). An L
in the name can be ignored, e.g., PCWSTR
and LPCWSTR
are the same type.
There are Rust bindings for these types in windows-rs and macros for creating some of these string types in Rust. The type bindings are best used only for FFI: most are newtype wrappers of raw pointers, so it is very easy to create dangling pointers and other memory safety errors when using them.
Windows primarily uses UTF-16. Rust does not have UTF-16 strings in its standard library (though as mentioned above, OsString can losslessly handle UTF-16). The widestring crate provides types including several UTF-16 string types which can make working with Windows strings much easier.
FFI with foreign Strings
For the actual FFI, use the Rust string type which agrees with the foreign string type (see table below).
Foreign type | Rust type |
---|---|
C string [const] char [const] * | *{const|mut} c_char |
C++ string | cxx::CxxString |
HSTRING | windows::core::HSTRING |
BSTR | windows::core::BSTR |
PSTR /LPSTR | windows::core::PSTR |
PCSTR /LPCSTR | windows::core::PCSTR |
PWSTR /LPWSTR | windows::core::PWSTR |
PCWSTR /LPCWSTR | windows::core::PCWSTR |
Creating most of these strings in Rust is usually possible via some macro or conversion function.
The more interesting question is when and how to convert between the FFI-specific types and more standard Rust types (and which types to use). That is out of scope for the reference, but see TODO patterns.
Memory management
The usual rules of memory management with FFI apply: memory must be released in the same language it was allocated, and using borrowed data is easier.
FFI with Rust Strings
It is possible to pass Rust strings across FFI to foreign functions. However, if you are designing an API, it is usually easier to use foreign strings in the FFI and convert these to and from Rust strings internally in Rust code.
If you manipulate the contents of the strings (either in foreign code or unsafe Rust code), then you must respect both the usual invariants around pointers, and Rust's string invariants (from String
docs):
- the memory must have be allocated by the same allocator the standard library uses, with a required alignment of exactly 1,
- the
length
of the string must be less than or equal to itscapacity
, - the
capacity
of the string must be the correct size of the allocation, - the first
length
bytes of the string must be valid UTF-8.
Note that if you are using the string types in Rust functions with foreign bindings, then you must establish these invariants in the foreign code. Doing so in the Rust code is likely to be unsound.
To pass a Rust string to C++, you can use Cxx's bindings for String
or &str
.
To pass a Rust string to C, you can use a struct with the correct layout (you could look at the standard library source code, or just use the Cxx bindings as a reference).
Memory management
The easiest scenario is to create a String
in Rust, pass a borrowed &str
to foreign code and ensure that the foreign code does not store the pointer, pass it to another thread, call its destructor, or deallocate it.
If you must store the string in foreign code, then you must pass the owned type String
. In this case, you must ensure the pointer remains unique (in particular, you must not keep a reference in the Rust code) and pass it back to Rust for destruction.
If you allocate memory for the string in foreign code, then you must not run its destructor in Rust, and you must pass the string back to foreign code for destruction. The easiest way to do that is to pass &str
to Rust. If you must pass String
(or a raw pointer used to produce a String
in Rust code), then you must ensure that there is no copy of the pointer kept in foreign code, and that the pointer is returned to foreign code for destruction. Using a custom reference counted type might be a better alternative, see TODO pattern.
Resources
Tooling
Bindgen is the most popular and mature tool and is maintained by the Rust project. It is used to create bindings for C code (and some C++) in Rust code. Cbindgen can be used to create C bindings to Rust code. The other tools below are for C++ interop; cxx is the current favourite tool with the community, but is not suitable for all use cases.
- bindgen
- cbindgen
- cxx, repo
- autocxx (Google tool for 'integrating cxx with bindgen')
- diplomat
- rust-cpp (
cpp!
macro) - flapigen (formerly Swig)
- cxx-async
- ABI Cafe for comparing the output of compilers for ABI compatibility
- crubit experimental C++ interop from Chrome folk
You may want to use COM/WinRT for inter-language interaction, the best Rust support for COM and WinRT is in windows-rs.
Documentation
- Nomicon chapter
- Unofficial FFI guide
- FFI omnibus
- Firefox docs for C++ interop
- FFI idioms and patterns
- Chrome docs for C++ interop
- FFI chapter in ANSSI-FR Secure Rust Guidelines
Unsafe programming
Resources for learning about unsafe programming:
- Chapter in The Book.
- Nomicon
- Unsafe code guidelines
- Ralf's thesis
- Stacked borrows paper
- GhostCell paper
- Ralf's blog
- Gankra's thesis
- Gankra's blog
- MIRI repo