1. Emulation
2. Pre-scan code manipulation
3. JIT code manipulation
4. A combination of the above
Emulation is the obvious method, however its slow. In reality, the ARM CPU is not particularly powerful due to the low MHz of the CPU and the RISC nature of the instructions. As such, emulating each instruction is extremely CPU intensive. At best a Pi would achieve ARM2 speeds, possibly ARM3 with careful coding.
Pre-scan code manipulation is really only useful for 32-bit code analysis as it doesn't support self-modifying code, which would break most game protection. Detecting certain code, such as jump tables is also problematic, often requiring some understanding of what the code is trying to achieve.
JIT holds the most promise as once the code has been converted, it can run at native speed. It can also be extended to support self-modifying code and requires no prior knowledge of the code itself.
The ideal would be a combination of Emulation and targeted JIT, using instruction emulation initially and JIT for actual game code that's frequently used. The main advantage here is in handling self-modifying code which is mainly used for game protection in loaders etc, ideally this would be best handled via emulation with the main code being handled by JIT for full CPU speed.
For simplicity I'm going to focus on JIT and how it could be implemented, but first here are the main issues when running 26bit code on the ARM11:
1. SWP instruction is deprecated
2. PSR manipulation (LDM Rx, {PC}^ ... MOVS PC, R14 ... ORRS PC, R14, #&80000000 ... TEQP R0, #3 etc)
3. NV is deprecated
4. Self modifying code is deprecated due to the split cache
5. Supervisor mode
6. Stored PC (STMFD R13!, {PC} ... STR PC, [R13, #-4]!)
7. PC as Rn (MOV R0, PC ... ORR R0, PC, #3)
8. PC as Rd (MOV PC, R14)
SWP instruction
It's unlikely this is used in games, however we still need to support it as it's a valid 26bit instruction. On the ARM11 LDREX/STREX is the preferred method, alternatively as we're only single core we could use LDR/STR with interrupts disabled. SWP is still available on ARMv6, so could be used on the Pi.
PSR manipulation
This includes both flag manipulation - setting V for example, or changes made to IRQ / FIQ or CPU mode.
There are quite a few methods to manipulate the PSR, where the PC is manipulated its fairly obvious as for example MOVS PC, R14 is clearly attempting to set the PSR flags in a 26bit PC. There are not so obvious methods though, take for example the following code fragment:
Code: Select all
BL manipulate_flags
MOV PC, R14
.manipulate_flags
TST R14, #&80000000
MOVEQ PC, R14
...
Some PSR manipulations do have direct equivalents on ARMv6, for example:
Code: Select all
ORRS PC, R14, #3 << 26
Code: Select all
CPSIE if
MOV PC, R14
Code: Select all
TEQP PC, #3
Code: Select all
CPSIE if, #%10011
NOP
MSR CPSR_f, #0
Code: Select all
MSR CPSR_all, #%10011
Code: Select all
TEQP R0, #&C0000000
Code: Select all
STMFD R13!, {R0, R1}
EOR R0, R0, #&C0000000
AND R1, R0, #(%1111 << 28) + %11
AND R0, R0, #%11 << 26
ORR R1, R1, R0, LSR #20
MRS R0, CPRS
BIC R0, R0, #(%1111 << 28) + %11
BIC R0, R0, #%11 << 6
ORR R0, R0, R1
MSR CPRS_all, R0
MOV R0, R0
LDMFD R13!, {R0-R1}
Code: Select all
BICS PC, R14, #1 << 27
Code: Select all
CPSID i
MOV PC, R14
Code: Select all
ORRS PC, R14, #1 << 28
Code: Select all
MSR CPSR_f, #1<<28
MOV PC, R14
Code: Select all
TSTP R0, #&C0000000
Code: Select all
STMFD R13!, {R0, R1}
AND R0, R0, #&C0000000
MRS R1, CPRS
BIC R1, R1, #&C0000000
ORR R1, R1, R0
MSR CPRS_f, R1
LDMFD R13!, {R0-R1}
TEQP:
Code: Select all
EOR Rx, Rx, #<immediate>
AND Rx, Rx, #(%111111 << 26) + %11
BIC PC, PC, #(%111111 << 26) + %11
ORR PC, PC, Rx
Code: Select all
AND Rx, Rx, #<immediate>
AND Rx, Rx, #(%111111 << 26) + %11
BIC PC, PC, Rx
NV conditional instructions
Ignoring self-modifying code for the minute (some legacy C compilers used NV conditional instructions at the entry points, which are later changed to AL), NV instructions can be presumed to be a NOP and simply changed to MOV R0, R0
Self-modifying code
This is where the real fun starts! Although there's nothing inherently wrong with self-modifying code or on-the-fly code creation (RISC OS for example uses this for sprite handling I believe), the split cache nature of the SA onward means these code changes won't be consistent in the Data cache, Instruction cache and memory at the same time.
The obvious solution is to surround self-modifying code with a cache flush, or force a cache flush whenever code memory is written too. However, how do you determine on-the-fly if a write is going to code or data?
The other issue in our case is where the instruction written is one that's not 32bit compatible and would need re-interpreting before being allowed to execute.
Supervisor mode
Care has to be taken around how any implementation enters the re-interpreter whilst in Supervisor mode, to avoid R14 corruption. If BL or SWI are used when the CPU is in SVC32 mode and the code hasn't already preserved R14, it will be corrupt on exit from the re-interpreter. One possible solution is to ensure the re-interpreter is entered in another mode, such as Abort to ensure SVC32 R14 isn't touched at any point.
Stored PC
Storing the PC on ARM3 stores <instruction address + 12> on SA+ it stores <instruction address + 8>. Some C compilers use stacked PC's to free up R14, take for example the following code sequence which is quite commonly used:
Code: Select all
STMFD R13!, {PC}
MOVNV R0, R0
B label
ADD R1, R0, R1
Code: Select all
MOV R0, R0
STMFD R13!, {PC}
B label
ADD R1, R0, R1
PC as Rn
Where PC is used as the first operand, it needs to be setup as it's 26bit equivalent, with NZVC, FI and CPU mode flags in the correct place. eg
Code: Select all
MOV R0, PC
Code: Select all
STMFD R13!, {R1}
MRS R0, CPSR
AND R1, R0, #%1100000
AND R0, R0, #(%1111 << 28) + %11
ORR R0, R0, R1, LSL #(26-6)
ORR R0, R0, PC
LDMFD R13!, {R1}
PC as Rd
Where the PC is set from a register, we need to ensure no flags are taken across.
Code: Select all
MOV PC, R14
Code: Select all
BIC PC, R14, #&FC000003
JIT implementation
Resolving the above issues with a JIT adds additional complications, which I'll go into more detail at a later date. Examples include:
1. Where an instruction needs expanding into several instructions to perform the same task under 32bit, code blocks need generating elsewhere and some form of jump to them implemented.
2. Memory referencing, if code is being re-interpreted into code blocks elsewhere, there's no longer a 1:1 memory relationship which would break both self-modifying code and branches.
3. If code is being run outside of it's normal address, PC relative addressing breaks requiring further instructions to refer to the correct address, potentially slowing the code considerably if code blocks are required.
4. Supervisor mode should ideally be avoided to ensure SVC32 R14 isn't corrupt when 26bit code switches into SVC26.
5. How to enter the JIT and know where it was entered from. One solution is to use an undefined instruction, which will cause the JIT to be entered in the und CPU mode, with PC+4 in R14_und and the CPSR in CPSR_und. The instruction chosen is:
Code: Select all
MRC CP8, 0, R0, C0, C0