How does reverse engineering of software work?

hardypart@feddit.de · 1 year ago

How does reverse engineering of software work?

SHITPOSTING_ACCOUNT@feddit.de · 1 year ago

Software consists of instructions for a computer to do something. These are made to be easy to follow for a computer, not for a human.

Humans write software in a human-readable form, the source code. This then (usually) gets converted to a machine-readable form, called machine code (or bytecode for some languages).

Depending on the programming language and settings used, more or less information is completely lost in the process. For some languages (.NET, Java) you can get most of the structure, sometimes even with most original variable and function names, back from the bytecode, and see relatively easily what the program does.

For other languages (e.g. C/C++), even the structure is lost - you can’t even reliably tell which parts of the program belong to the same function. You can read the machine code, and it “clearly” says what it does, but trying to make sense of that mess is slow, error-prone, and you won’t fully understand every part (it’s just too much), so you will mostly be looking for parts that seem related to what you’re interested in. For example, if you’re looking for an encryption algorithm, you may look for code that opens two files, reads from one and writes to the other, then look for a piece of code “nearby” that’s doing a lot of math. Or for malware, you may want to focus on network connections. Since the software needs to talk to the operating system to make network connections, this tends to happen in a standardized way and you can quickly find the part of the code that talks to the network features of the OS.

You can also run the program step by step and observe what it does (possibly messing with it while doing so to see how that changes the behavior).

For an example of how machine code looks, what in source code would be ShowDialog('hello') could become

put 0x1005f225 (your reverse engineering tool helpfully will add a note that this is the address where a text “hello” is stored) into register 1
increment stack pointer by 4
push R1 onto the stack
call 0x10000443C (you now look at the code there)
put the value of R1 into R4
if r1 is zero jump to 0x10004458
put r4 onto the stack
call the OS function to show a dialog (if you’re lucky your reverse engineering tool has identified this for you.)

(Made up inaccurate example just to illustrate the idea. It’s horrible to read.)

WalrusByte@lemmy.world · 1 year ago

To understand this you need to know how code is compiled into machine code. So basically computers only understand ones and zeros, but that’s really hard for humans to work with. So we created something called assembly, which allows us to convert more human understandable phrases like “add” and “sub” to perform calculations and map them to certain machine code instructions (AKA ones and zeros). But it turns out using just assembly was also pretty tedious, so they created languages like C, where you have another program called a compiler take in C code which was easier for humans to understand and convert it to the equivalent assembly automatically.

So most software you run on the computer is a binary, meaning it’s a bunch of the machine code that was previously compiled from some other language like C. You can decompile these binaries back into assembly, which you can then manually read and convert back to a more human readable language. There’s also other tools out there that make this process easier, but that’s the basic idea: take ones and zeros, convert it back into assembly, then try and figure out how it works from there.