Ruby Deep Dive I: How Ruby Executes Code
After work, I have found myself poking around the Ruby codebase (which is open source). One thing I particularly enjoy when working in C# at DocuSign is diving into the source code, available via the source code browser here (does Ruby have one of these? it should.).
There were times when I was diving into a particular aspect of the language, how async worked, the differences between the synchronization primitives, etc. and a thorough understanding required a peek into the internals. Thus, upon discovery of the Ruby source code on GitHub I was thrilled.
I spent some time while at Colby College diving into programming language things. Most notably I hacked together a somewhat functioning programming language called Sailfish and developed a novel algorithm for discovering asynchronous method inputs for JavaScript.
Ok, novel was what the crew called my algorithm in the first draft of the paper… which was rejected. In the second draft, which was ultimately the accepted version, they put unsound.
Thus, programming languages have been a lingering interest for a few years at this point (I’d say probably since RustConf 2018? or maybe since I binge listened to the New Rustacean?). Having the opportunity to actually work on the same team as Graydon Hoare during my 2019 summer internship at the Stellar Development Foundation was definitely a huge plus. Anyway, I digress.
While the ultimate goal here is to learn more about Ruby, I have no idea where this source code exploration adventure will take me. Maybe I will hack on my own VM? Maybe I will try to develop an async debug console (check out Tokio’s, it is super neat!). Maybe I will replicate Coyote, but for Ruby? Who knows?
Let’s get to hacking.
How Ruby Executes Code
Answering the question how Ruby Executes is the ultimate goal of the first post in this series. Our journey starts in main.c with a good old main method.
int main(int argc, char **argv) {
/* shortened for brevity and sanity */
return rb_main(argc, argv);
}
There is an optimization if we have WASM or EMSCRIPTEN. We ignore it here. Thus, we end up here:
int
rb_main(int argc, char **argv)
{
/* shortened for brevity and sanity */
ruby_sysinit(&argc, &argv);
{
RUBY_INIT_STACK;
ruby_init();
return ruby_run_node(ruby_options(argc, argv));
}
}
Initializing Ruby
ruby_sysinit grabs some args and then ruby_init() does some setup work. An interesting line here is:
if (GET_VM())
return 0;
where GET_VM()
is defined as:
#define GET_VM() rb_current_vm()
and finally rb_current_vm():
static inline rb_vm_t *
rb_current_vm(void)
{
/* shortened for brevity and sanity */
return ruby_current_vm_ptr;
}
which is actually set as part of Init_BareVM
Note to self: if I want to eventually point to my own VM, this is where I’d do it.
Executing Ruby
Jumping back to where we were in the main method, we next look to where the code is actually executed:
int
rb_main(int argc, char **argv)
{
/* shortened for brevity and sanity */
ruby_sysinit(&argc, &argv);
{
RUBY_INIT_STACK;
ruby_init();
return ruby_run_node(ruby_options(argc, argv)); <---- WE ARE HERE ----
}
}
Digging through the source code we see this calls ruby_run_node
which in turn calls ruby_exec_node
which calls rb_ec_exec_node
and eventually we get to rb_iseq_eval_main (seen below).
VALUE
rb_iseq_eval_main(const rb_iseq_t *iseq)
{
/* shortened for brevity and sanity */
val = vm_exec(ec, true);
return val;
}
So at this point, we are executing our instruction sequence (instruction set/bytecode) on the VM. Woohoo!
But wait!!! Where did this instruction sequence come from?
Stepping back in the code, this instruction sequence was generated as part of ruby_options(argc, argv)
which we see is passed to ruby_run_node
above in the rb_main
method. Investigating, we see how ruby_options is defined:
void *
ruby_options(int argc, char **argv)
{
/* shortened for brevity and sanity */
void *volatile iseq = 0;
ruby_init_stack((void *)&iseq);
EC_PUSH_TAG(ec);
if ((state = EC_EXEC_TAG()) == TAG_NONE) {
SAVE_ROOT_JMPBUF(GET_THREAD(), iseq ruby_process_options(argc, argv)); <--- Look here ---
}
else {
/* shortened for brevity and sanity */
}
EC_POP_TAG();
return iseq;
}
This eventually gets us to process_options which takes in the command line options as well. Here we do a lot of things. A few things stuck out to me. In the method linked here we:
- create a new parser
- parse/compile ast and here as well depending on a few things I glossed over…
- get the instruction
- dump instructions, assuming this is –dump=insns
Fascinating!
Conclusion
In this post, we discovered the basics of what happens when you run ruby <FILEPATH>
.
- ruby initializes, including initializing the stack
- the cli options are parsed
- the code is parsed and an AST is generated
- the AST is walked and bytecode (instruction sequence) is generated
- the instruction sequence is passed to the vm
- the vm executes this code
By understanding this process, we discovered where we could dig deeper to learn more about each compilation step and even where we could insert our own VM.
Hope you enjoyed this post and learned a thing or two (besides how aweful C macros are). On to the next one!