In this post, I am gonna walk you through compiling python code to CPython bytecode, what code objects are, how to construct them, how to disassemble them, and how to decompile them.
I will be using CPython 3.6.5.
>>> codestr = """
print('Witness me!')
"""
>>> compiled_codestr = compile(codestr, '<string>', 'exec')
>>> type(compiled_codestr)
<class 'code'>
Whoo, we have created our first code object.
We passed arguments to the compile function as follows:
codestr
is, as you might have guessed, our code as a string.
The second argument is the filename of the file from which the code was read, since we passed this in an interpreter, we defined it as '<string>'
as per the documentation.
The third argument is called mode
in the documentation and it specifies what kind of code must be compiled, we could have used eval
since we’re compiling a single expression,
refer to the compile function documentation for more details on the mode
argument and refer to
this for a detailed explanation of eval
, exec
, and the differences between them.
Let’s see some of the attributes that this code object has,
>>> compiled_codestr.co_consts
('Witness me!',)
>>> compiled_codestr.co_filename
'<string>'
co_consts
is a tuple of constants, you can see the string that we had in our codestr
here as the first element of the tuple.
co_filename
is the filename which this code object belongs to. Since we defined this as '<string>'
in our compile function, that’s what we get here.
We will explore more attributes (but not all) of the code
object later on, let’s see the list of all available attributes:
>>> dir(compiled_codestr)
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__',
'__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'co_argcount', 'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno', 'co_flags', 'co_freevars',
'co_kwonlyargcount', 'co_lnotab', 'co_name', 'co_names', 'co_nlocals', 'co_stacksize', 'co_varnames']
We’re interested in the attributes that start with co_
, for a complete description of these attributes
you can refer to:
Code objects in CPython’s documentation.
A description of the attributes of a code object is available in this table, under the code
entry.
The definition of the PyCodeObject
struct in CPython.
Well, we now have constructed a code
object, we’ve looked at some of its attributes, what else can we do with it? Well, we can exec
it:
>>> exec(compiled_codestr)
Witness me!
Now let’s look at the code object of a function
>>> def hello(name):
... print("Hello,", name)
>>> codeobj = hello.__code__
>>> type(codeobj)
code
Since this function takes one argument, let’s see if the co_argcount
attribute of the code
object reflects this
>>> codeobj.co_argcount
1
This code object should also have the string that we use in the call to print
in its co_consts
attribute
>>> codeobj.co_consts
(None, 'Hello,')
We’d also like to see the name of the function,, we can find this by checking the co_name
attribute
>>> codeobj.co_name
hello
Let’s check out the local variables of our codeobj
# Number of locals
>>> codeobj.co_nlocals
1
# Names of locals
>>> codeobj.co_varnames
('name',)
Now, let’s look at one final attribute of code
objects, co_code
>>> codeobj.co_code
b't\x00d\x01|\x00\x83\x02\x01\x00d\x00S\x00'
oh, look at that, it’s a bytes
object. But what does it represent? This is the bytecode representation of the code of the hello
function.
Yes, it looks unwieldy to understand, but fortunately there’s a better way to understand code objects.
The CPython virtual machine (CPython VM) is a stack-based VM, this means that the bytecode works by pushing things onto the stack and popping things off of it.
Let’s see an example, continuing with the code
object of the hello
function that we have defined before.
>>> from dis import dis
>>> dis.dis(codeobj)
2 0 LOAD_GLOBAL 0 (print)
2 LOAD_CONST 1 ('Hello,')
4 LOAD_FAST 0 (name)
6 CALL_FUNCTION 2
8 POP_TOP
10 LOAD_CONST 0 (None)
12 RETURN_VALUE
Before we dig into the meaning of the instructions, let’s first define what each column in the previous output means,
2 0 LOAD_GLOBAL 0 (print)
| | | | |
| | | | +--------- Interpretation of the parameters in parentheses.
| | | +------------- Operation Parameters.
| | +------------------------------ The operation code name.
| +---------------------------------------- The address of the instruction.
+---------------------------------------------------- The line number, for the first instruction of each line.
Sometimes there can be more columns, but we’ll stick to the ones that were generated from our code
object, for a complete description of the output of dis()
you can refer to its documentation.
Now let’s dig into the disassembly and figure out what each line means
2 0 LOAD_GLOBAL 0 (print)
Here, the LOAD_GLOBAL
instruction will push the global co_names[namei]
onto the stack, in this case it’s loading co_names[0]
, let’s verify
this by inspecting our code
object
>>> codeobj.co_names
('print',)
Cool, now let’s move to the 2nd line
2 LOAD_CONST 1 ('Hello,')
The LOAD_CONST
instruction will push co_consts[consti]
onto the stack, in our case this will load co_consts[1]
, verifying
>>> codeobj.co_consts
(None, 'Hello,')
Moving onto the next line
4 LOAD_FAST 0 (name)
The LOAD_FAST
instruction will push a reference to the local co_varnames[var_num]
onto the stack, verifying
>>> codeobj.co_varnames
('name',)
Alright, things are about to get a little more interesting, but first let’s review what our stack looks like currently.
We’ve done 3 operations which push things onto the stack, roughly:
push print
push 'Hello,'
push (ref name)
Translating this into a visual representation, this is what our stack looks like:
| | +--------+ |ref name| +--------+ |'Hello,'| +--------+ | print | +--------+
Alright, let’s continue looking at the disassembly, the next line is
6 CALL_FUNCTION 2
The CALL_FUNCTION
as obvious from its name will call a function, but what does the argument that it takes, 2
in our case, mean?
Well, it indicates the number of parameters that the function will be called with, this number is interpreted as a 2-byte (16-bit) number,
where the low byte indicates the number of positional parameters, the high byte the number of keyword parameters.
In our case it simply means that the function will take 2 positional parameters by popping them off the stack, the order of the parameters with regards to passing it to the function is reversed. In other words, the rightmost parameter is on the top of the stack.
So to summarize what CALL_FUNCTION
will do here:
Pop 2 arguments off of the stack.
Pass them to the function that’s below them in the stack, in our case print
.
Push the return value of the function onto the stack.
Awesome, let’s move to the next line in the disassembly
8 POP_TOP
The instruction POP_TOP
will simply remove the item on the top of the stack, so we’ve now removed the value that was returned by our last CALL_FUNCTION
instruction.
The final 2 lines in our disassembly are
10 LOAD_CONST 0 (None)
12 RETURN_VALUE
The first will LOAD_CONST
as before, the constant being None
. The second will return the value on the top of stack i.e. it will return None
to the caller of the function.
WOOHOO! We’ve now compiled python code into bytecode, disassembled it, and explored the disassembly. The dis
module documentation
has information on more bytecode instructions if you want to dig deeper.
Suppose you somehow stumbled upon python bytecode in the wild, maybe from malware, maybe from proprietary code and you want to understand what’s going on but you are not in the mood to read disassembly, what can you do? Well you can use a decompiler, in this case we’ll use uncompyle6, go ahead and install it.
First, let’s start by creating a file hello.py
which contains the hello
function
def hello(name):
print('Hello,', name)
Now we want to compile this file into python bytecode, let’s do this by running this in a shell
>: python -m compileall .
You will now find a __pycache__
directory in your current-working-directory,
>: cd __pycache__
>: uncompyle6 hello.cpython-36.pyc
This should output
# ...
# ...
def hello(name):
print('Hello,', name)
# okay decompiling hello.cpython-36.pyc
Voila, you now have your source-code back.
A description of the attributes of a code object is available in this table, under the code
entry
The definition of the PyCodeObject
struct in the CPython source-code