Sunday, 8 January 2023

Beginning Rust

I'd wanted to learn Rust for a while, but it's not enough to read a book or even follow some examples - the way to learn is to use the language. In December I used Rust for all of the Advent of Code 2022 puzzles. AoC is a really good way to learn a new language.

I'm glad I spent a little time before December working on puzzles from earlier years, since Rust is challenging for the same reason it is interesting: it requires the programmer to be explicit about ownership and borrowing. Rust code is very clear about whether things are passed by value or by reference, which code "owns" some data and which is borrowing it, and whether it is borrowed as read-only or mutable. Although these concepts are not unique to Rust, they're not enforced in most other languages. In Rust, the language syntax includes these concepts and the compiler statically checks them. I was very curious about how well this would work and what it would actually involve.

When I worked through the first few AoC 2015 and 2019 exercises, I wrote programs that just used built-in Rust data types such as Vec and HashMap, and this was straightforward, because no pointers or references were involved. All data was owned by the containing data structure and borrowed temporarily for use elsewhere.

But then I reached 2019 day 6, which involved a tree data structure, and I tried to write this in the "obvious" way, where each tree node is an object, and each node has a pointer to its parent. This approach resulted in a big fight with the Rust compiler because it was no longer possible to just borrow data temporarily from a containing Vec or HashMap. Instead, everything was referenced via pointers, and hence it became much harder to satisfy the compiler's rule checks.

In Rust programs, pointers can't be copied freely. They can be moved or borrowed, but at any time, there is only one copy of each pointer. This is a difficulty for a tree where each node has a pointer to its parent, because a node with N children will be referenced by at least N pointers!

The intended tree data structure is not complicated and it would be easy to write in most other languages, but only because those languages avoid questions about ownership.

Initially I tried to make Rust do things the way I wanted. I simplified the code and data structure still further, seeking anything that would work, until it became a singly-linked list. In this case, each node really is only referenced by one pointer. But even this simple data structure is still difficult to write in Rust. The same ownership questions exist.

This learning path is well-trodden. It leads to a book, "Learn Rust With Entirely Too Many Linked Lists". The book is not just about linked lists. It's about all data structures that involve pointers, the problems you encounter when trying to use them in Rust, and the solutions that exist.

The main thing I learned from the book is to think differently. If a tree or a graph data structure is needed, it doesn't have to involve pointers. It can be a hash table in which every node has a name or number (the hash key) and references to other nodes use the same names/numbers. In this case, the ownership problems disappear, because the HashMap owns everything, and the code borrows from it.

In C, Java and many more, pointers are the obvious way to make a tree data structure. Rust turns that over: they're way down the list of good choices.

I was able to do all of the AoC 2022 exercises without using pointers. Complex data structures were sometimes needed, but I just built these around Vec, HashMap or BinaryHeap. I got good results this way. Though I messed up many times, the compiler errors were relatively easy to solve, generally just a result of "moving" when I intended to "borrow", or failing to mark some data structure as Copy or Clone.

I even used this approach for one case where a custom tree data structure provided a better solution than Vec. This was 2022 day 20, and details are here. In this case I needed a binary tree with special properties, so I implemented it myself (well, translated it from my earlier C# implementation, itself based on pseudocode from a textbook). But I still used a HashMap to store the nodes in order to avoid the hassle of using pointers and dealing with ownership issues. This meant that the C# code translated very easily, as hash keys directly replaced the C# references.

Having finished reading "Rust With Entirely Too Many Linked Lists", I have revisited 2019 day 6 and rewritten my solution twice: first to use HashMap with no pointers, and again to use pointers with Rc and RefCell. I think the eventual solution is ok, but the HashMap solution is much clearer. At some point I may also return to 2022 day 20 and switch to Rc and RefCell. I think this will be faster, but it will require a lot more thought, and the code will be harder to understand.

To new Rust users, my suggestion is to treat pointers as a last resort. Don't be afraid to avoid them, don't feel you are "missing out" or not doing it "properly". It's not C/C++ or anything like that, so your approach can be different. You're learning to think in a different language, and this is a good thing. Eventually, you are bound to need pointers for something, but you can do many things without them and get familiar with the rest of the language first.

Generally I liked Rust. I like the strong typing rules, which were almost at Ada's level. If you mix integer types (i8 and i16, say) then you have to be explicit about the type conversions, completely avoiding integer promotion issues. I like the compiler errors and the strictness. I don't enjoy the permissiveness of C and I'm not the sort of person who ignores or disables warnings. I'm one of those -Werror people. If you don't like this, stay away from Rust because it's fussy.

I also liked the ability to write (and run) unit tests alongside the code being tested without any third-party addon. Python has this too, but it's not normally used. Being able to do this encouraged me to write the code for unit testing, and write more unit tests. Unfortunately Rust's library seems to be missing a random number function, which would be useful for some unit tests. I know that random numbers are hard to generate properly, and that they can be misused, and so it's a good idea to force users to think about whether they need random numbers that are good enough for a game, or good enough for science, or good enough for cryptography, with appropriate levels of effort in each case, but still it would be nice if batteries were included in this case.

I liked and disliked Rust's enum type simultaneously. I liked how it is more of a "tagged union" or "variant record" type, since each enum element can have values associated with it. I disliked how this also meant that it was impossible to iterate through all enum values without installing some non-standard feature, and you also can't use enum values as array indexes, which can sometimes be useful. It seems like it might have been better to separate enums and variant records.

What's a bit disappointing about Rust is to find that the standard library itself makes quite widespread use of "unsafe Rust", which is Rust without the ownership rules. There are some chapters on this within "Rust With Entirely Too Many Linked Lists", and unfortunately it seems to be necessary to use unsafe Rust to implement some linked list functionality (e.g. splicing lists). This points to an unsatisfying incompleteness in the language, since in general, the standard library for a language can be written in that language, with exceptions only for the interfaces needed for external I/O.

The whole topic of "unsafe Rust" is rather frightening because of the possibility of code that appears to work but introduces a subtle bug in some cases, for example, by breaking some aliasing assumption made within safe Rust. It seems that unsafe Rust might be even more dangerous than C, because mistakes could propagate into the safe code, as it assumes everything else is safe. There is some discussion of this in the book, and the message that unsafe Rust should be avoided, but that message is undermined by the fact that the top Rust experts did use it in the standard library, and not just in edge cases like system calls or an interface to C. I hear Saruman saying: "Why should we fear to use it?"

Unsafe Rust is a potential problem, given programmer psychology. If this language is mandated at any company, there will be programmers who are happy to learn it properly and write idiomatic code, and there will be others who will get really frustrated with it, just want to get their work done, and see any delay as a problem with the language. The latter group don't like any attempt from a language designer to force them to think differently, and they'll resist it. "unsafe" is the cheat code for the expert mode that says "I know what I'm doing here". It doesn't hassle you with compile errors about ownership.

I have seen similar problems with Ada, which shares some design philosophy with Rust - for example, it's also strongly-typed, with a fussy compiler, and intended to improve the reliability of software. When I used Ada at work, I found that Ada's design was extremely annoying to some colleagues, who just wanted to write C and be awesome 10X programmers. I sense that "unsafe" would have similar appeal to such people. "unsafe" has been used inside the library, and is therefore approved by experts. This is a green light to anyone who might consider themselves an expert, which is not a well-defined term.

I would like to see Rust adopted more widely. I liked using it. The design philosophy appeals to me, and just as with Ada, I see the fussiness as helpful, if sometimes frustrating. Most of all, if the code compiles, it will probably work, and this is a wonderful property for a language to have.