Posted by Chris Wailes – Senior Software program Engineer
The efficiency, security, and developer productiveness supplied by Rust has led to fast adoption within the Android Platform. Since slower construct occasions are a priority when utilizing Rust, significantly inside an enormous mission like Android, we have labored to ship the quickest model of the Rust toolchain that we are able to. To do that we leverage a number of types of profiling and optimization, in addition to tuning C/C++, linker, and Rust flags. A lot of what I’m about to explain is much like the construct course of for the official releases of the Rust toolchain, however tailor-made for the precise wants of the Android codebase. I hope that this publish will likely be usually informative and, in case you are a maintainer of a Rust toolchain, might make your life simpler.
Android’s Compilers
Whereas Android is definitely not distinctive in its want for a performant cross-compiling toolchain this truth, mixed with the massive variety of day by day Android construct invocations, implies that we should fastidiously steadiness tradeoffs between the time it takes to construct a toolchain, the toolchain’s dimension, and the produced compiler’s efficiency.
Our Construct Course of
To be clear, the optimizations listed under are additionally current within the variations of rustc which can be obtained utilizing rustup. What differentiates the Android toolchain from the official releases, moreover the provenance, are the cross-compilation targets out there and the codebase used for profiling. All efficiency numbers listed under are the time it takes to construct the Rust parts of an Android picture and is probably not reflective of the speedup when compiling different codebases with our toolchain.
Codegen Items (CGU1)
When Rust compiles a crate it’ll break it into some variety of code era models. Every unbiased chunk of code is generated and optimized concurrently after which later re-combined. This method permits LLVM to course of every code era unit individually and improves compile time however can cut back the efficiency of the generated code. A few of this efficiency could be recovered through using Hyperlink Time Optimization (LTO), however this isn’t assured to attain the identical efficiency as if the crate have been compiled in a single codegen unit.
To reveal as many alternatives for optimization as attainable and guarantee reproducible builds we add the -C codegen-units=1 choice to the RUSTFLAGS atmosphere variable. This reduces the scale of the toolchain by ~5.5% whereas rising efficiency by ~1.8%.
Bear in mind that setting this feature will decelerate the time it takes to construct the toolchain by ~2x (measured on our workstations).
GC Sections
Many initiatives, together with the Rust toolchain, have capabilities, courses, and even whole namespaces that aren’t wanted in sure contexts. The most secure and best possibility is to depart these code objects within the remaining product. This may improve code dimension and will lower efficiency (on account of caching and structure points), but it surely ought to by no means produce a miscompiled or mislinked binary.
It’s attainable, nonetheless, to ask the linker to take away code objects that aren’t transitively referenced from the major()operate utilizing the –gc-sections linker argument. The linker can solely function on a section-basis, so, if any object in a piece is referenced, all the part should be retained. This is the reason it is usually frequent to cross the -ffunction-sections and -fdata-sections choices to the compiler or code era backend. This may make sure that every code object is given an unbiased part, thus permitting the linker’s rubbish assortment cross to gather objects individually.
This is among the first optimizations we applied and, on the time, it produced vital dimension financial savings (on the order of 100s of MiBs). Nevertheless, most of those positive aspects have been subsumed by these produced from setting -C codegen-units=1 when they’re utilized in mixture and there may be now no distinction between the 2 produced toolchains in dimension or efficiency. Nevertheless, as a result of additional overhead, we don’t all the time use CGU1 when constructing the toolchain. When testing for correctness the ultimate velocity of the compiler is much less essential and, as such, we enable the toolchain to be constructed with the default variety of codegen models. In these conditions we nonetheless run part GC throughout linking because it yields some efficiency and dimension advantages at a really low value.
Hyperlink-Time Optimization (LTO)
A compiler can solely optimize the capabilities and knowledge it may well see. Constructing a library or executable from unbiased object recordsdata or libraries can velocity up compilation however at the price of optimizations that depend upon data that’s solely out there when the ultimate binary is assembled. Hyperlink-Time Optimization provides the compiler one other alternative to research and modify the binary throughout linking.
For the Android Rust toolchain we carry out skinny LTO on each the C++ code in LLVM and the Rust code that makes up the Rust compiler and instruments. As a result of the IR emitted by our clang could be a special model than the IR emitted by rustc we are able to’t carry out cross-language LTO or statically hyperlink in opposition to libLLVM. The efficiency positive aspects from utilizing an LTO optimized shared library are better than these from utilizing a non-LTO optimized static library nonetheless, so we’ve opted to make use of shared linking.
Utilizing CGU1, GC sections, and LTO produces a speedup of ~7.7% and dimension enchancment of ~5.4% over the baseline. This works out to a speedup of ~6% over the earlier stage within the pipeline due solely to LTO.
Profile-Guided Optimization (PGO)
Command line arguments, atmosphere variables, and the contents of recordsdata can all affect how a program executes. Some blocks of code could be used incessantly whereas different branches and capabilities might solely be used when an error happens. By profiling an software because it executes we are able to gather knowledge on how typically these code blocks are executed. This knowledge can then be used to information optimizations when recompiling this system.
We use instrumented binaries to gather profiles from each constructing the Rust toolchain itself and from constructing the Rust parts of Android photographs for x86_64, aarch64, and riscv64. These 4 profiles are then mixed and the toolchain is recompiled with profile-guided optimizations.
Because of this, the toolchain achieves a ~19.8% speedup and 5.3% discount in dimension over the baseline compiler. It is a 13.2% speedup over the earlier stage within the compiler.
BOLT: Binary Optimization and Format Instrument
Even with LTO enabled the linker continues to be accountable for the structure of the ultimate binary. As a result of it isn’t being guided by any profiling data the linker would possibly by accident place a operate that’s incessantly known as (sizzling) subsequent to a operate that’s hardly ever known as (chilly). When the new operate is later known as all capabilities on the identical reminiscence web page will likely be loaded. The chilly capabilities at the moment are taking over area that may very well be allotted to different sizzling capabilities, thus forcing the extra pages that do include these capabilities to be loaded.
BOLT mitigates this downside through the use of an extra set of layout-focused profiling data to re-organize capabilities and knowledge. For the needs of rushing up rustc we profiled libLLVM, libstd, and librustc_driver, that are the compiler’s major dependencies. These libraries are then BOLT optimized utilizing the next choices:
--peepholes=all
--data=<path-to-profile>
--reorder-blocks=ext-tsp
–-reorder-functions=hfsort
--split-functions
--split-all-cold
--split-eh
--dyno-stats
Any further libraries matching lib/*.so are optimized with out profiles utilizing solely –peepholes=all.
Making use of BOLT to our toolchain produces a speedup over the baseline compiler of ~24.7% at a dimension improve of ~10.9%. It is a speedup of ~6.1% over the PGOed compiler with out BOLT.
In case you are considering utilizing BOLT in your individual mission/construct I supply these two bits of recommendation: 1) you’ll must emit further relocation data into your binaries utilizing the -Wl,–emit-relocs linker argument and a couple of) use the identical enter library when invoking BOLT to supply the instrumented and the optimized variations.
Conclusion
By compiling as a single code era unit, rubbish accumulating our knowledge objects, performing each link-time and profile-guided optimizations, and leveraging the BOLT software we have been in a position to velocity up the time it takes to compile the Rust parts of Android by 24.8%. For each 50k Android builds per day run in our CI infrastructure we save ~10K hours of serial execution.
Our business shouldn’t be one to face nonetheless and there’ll absolutely be one other software and one other set of profiles in want of accumulating within the close to future. Till then we’ll proceed making incremental enhancements in the hunt for further efficiency. Pleased coding!