Notes
Introduction
Reverse Engineering in Software
Code languages
- High level (similar to English, readable) vs low level language (computer understands, human generally doesn't)
- Ex: C, C++, Java, Python (relatively high) --> compile --> binary (low level)
- Note: not all use compilation, some use interpreters
Example of c code:
int main() {
int variable = 1;
return 0;
}
Program binaries are just a bunch of gibberish and not readable. Ex:
Assembly
- A low level programming language
- It provides a way for humans to read machine code
- Direct hardware control - asm is almost 1 to 1 with machine code instructions
- Gives direct control over the processor, memory, and hardware components
- Architecture specific - x86, arm, mips, etc
- Each has it's own instruction sets, registers, etc
- Source code: malware authors don't provide source code, so analysts must work with binaries.
- Misconceptions
- Machine code is not faster or lower-level than asm
- 2 representations of the same thing
- Machine code
- Bits, cpu instruction list, appears as hex
- ASM
- Text representation of bits, human-readable instruction names, ex: MOV, XCHG
- ASM --> machine code --> ASM
- OPcodes
- each asm command = unique number
- object code = sequence of opcodes + op numbers
- CPU cycle = read --> decode --> execute
- translated with assembler (asm to binary) and disassembler (bin to asm)
Compilers
- high level languages to machine code
- Translation process:
- Input source code files (c, c++)
- Use compiler (gcc, clang)
- Output machine code file executable by cpu
- Modern compiler optimizations
- Minimize code size, improve execution performance, higher efficiency
- Impact of reverse engineering:
- Straightforward instructions --> mathematically equivalent but obscure instructions
- Optimized code becomes difficult to read
- Original program logic gets buried under optimizations
Disassemblers
- Inputs binary file
- Output asm code
- Consider architecture
- entire program or specific parts
- types of disassemblers
- standalone disassemblers - dedicated tools with features
- built-in disassemblers - embedded in debuggers
Decompilers
- Binary back to high level code (ex: binary --> c)
- ex:
- Input binary
- output example.c
- Limitations
- original source code recovery usually not possible
- Decompiler vs Disassembler
- Dissassembler --> asm
- Decompiler --> high level code (c, cpp, etc)
Registers
- CPU has a few major parts
- ALU
- CU (control unit)
- many registers
- registers are much faster than ram
- Advantages
- fastest data access method
- central to asm operations
- Limitations
- very few available
- short-term only
- manual management
- Programmers must manually
- load data from ram --> registers
- store data from registers --> ram
- manage register space
Assembly
- Only about 14 asm instructions account for 90% of code
Data types
- Binary, decimal, hex, etc
- In C, int, short, long, float, etc
- Negative numbers
- idk man ill teach myself
- nvm stackoverflow taught me goated website
I made a gif explaining it found here.
Architectures
- Intel uses CICC - Complex Instruction Set computer
- Many special purpose intstructions (likely wont ever see)
- Variable-length instructions, 1-16 bytes long
- Other major architectures
- RISC - Reduced instruction set compiler
- typically more registers, less and fixed-sized instructions
Endian
- Little Endian
- little end first
- intel is little endian
- Big Endian
- big end first
- network traffic is big endian
Registers
- 8 general purpose registers + instruction pointer (points at next instruction to execute)
- x86-32 registers are 32 bits long
- x86-64 registers are 64 bits long
Register Conventions (intel)
- registers
- EAX - stores function return values
- EBX - base pointer to data section
- ECX - counter for string and loop operations
- EDX - i/o pointer
- ESI - Source pointer for string operations
- EDI - Destination pointer for string operations
- ESP - Stack pointer
- EBP - Stack frame base pointer
- EIP - Pointer to next instruction to execute (“instruction pointer”)
- caller-save registers (eax, edx, ecx)
- if something in registers needs to be stored, the caller is in charge of saving the value before calling a subroutine and restoring the values after the call returns
- caller-save registers are likely to be modified
- callee-save registers (ebp, ebx, esi, edi)
- if callee needs more registers than are saved by the caller, the callee must save them
- EFLAGS
- register holds many single bit flags
- zero flag (zf) - set if the result of some instruction is zero
- sign flag (sf) - set equal to most-significant bit of the result, which is the sign bit of a signed integer (0 = positive, 1 = negative)
- instructions
- NOP - do nothing
- used to pad/align bytes or delay time
- can be used to make exploits more reliable
The Stack
- the stack is a conceptual area of main memory (ram) which is designated by the os for programs
- last-in-first-out (LIFO/FILO) data structure
- by convention the stack grows toward lower memory addresses
- when adding to the stack the "top" of the stack is a lower memory addresses
- ESP points to the top of the stack (the lowest address in use)
- the stack keeps track of which fucntions were called before the current one. it holds local variables and is used to pass arguments to the next function
note:
- push --> incrememnd stack size
- pop --> shrink stack size
Calling convetnions
- how code calls a subroutine is compiler-dependent and configurable
- example using cdecl and stdcall
- cdecl (c declaration) - most common calling convention
- function parameters pushed onto stack right to left
- saves old stack frame poitner and sets up a new stack frame
- EAX or EDX:EAX returns the result for primitive data types
- caller must clean up the stack
- stdcall
- same as cdecl, except the callee must clean up the stack - not the caller