26bit CPU support

JonAbbott · Post by **JonAbbott** » Tue Nov 12, 2013 11:48 am

Providing 26bit instruction support on ARM11 can be done via several methods:

1. Emulation
2. Pre-scan code manipulation
3. JIT code manipulation
4. A combination of the above

Emulation is the obvious method, however its slow. In reality, the ARM CPU is not particularly powerful due to the low MHz of the CPU and the RISC nature of the instructions. As such, emulating each instruction is extremely CPU intensive. At best a Pi would achieve ARM2 speeds, possibly ARM3 with careful coding.

Pre-scan code manipulation is really only useful for 32-bit code analysis as it doesn't support self-modifying code, which would break most game protection. Detecting certain code, such as jump tables is also problematic, often requiring some understanding of what the code is trying to achieve.

JIT holds the most promise as once the code has been converted, it can run at native speed. It can also be extended to support self-modifying code and requires no prior knowledge of the code itself.

The ideal would be a combination of Emulation and targeted JIT, using instruction emulation initially and JIT for actual game code that's frequently used. The main advantage here is in handling self-modifying code which is mainly used for game protection in loaders etc, ideally this would be best handled via emulation with the main code being handled by JIT for full CPU speed.

For simplicity I'm going to focus on JIT and how it could be implemented, but first here are the main issues when running 26bit code on the ARM11:

1. SWP instruction is deprecated
2. PSR manipulation (LDM Rx, {PC}^ ... MOVS PC, R14 ... ORRS PC, R14, #&80000000 ... TEQP R0, #3 etc)
3. NV is deprecated
4. Self modifying code is deprecated due to the split cache
5. Supervisor mode
6. Stored PC (STMFD R13!, {PC} ... STR PC, [R13, #-4]!)
7. PC as Rn (MOV R0, PC ... ORR R0, PC, #3)
8. PC as Rd (MOV PC, R14)

SWP instruction
It's unlikely this is used in games, however we still need to support it as it's a valid 26bit instruction. On the ARM11 LDREX/STREX is the preferred method, alternatively as we're only single core we could use LDR/STR with interrupts disabled. SWP is still available on ARMv6, so could be used on the Pi.

PSR manipulation
This includes both flag manipulation - setting V for example, or changes made to IRQ / FIQ or CPU mode.

There are quite a few methods to manipulate the PSR, where the PC is manipulated its fairly obvious as for example MOVS PC, R14 is clearly attempting to set the PSR flags in a 26bit PC. There are not so obvious methods though, take for example the following code fragment:

Code: Select all

BL  manipulate_flags
MOV PC, R14

.manipulate_flags
TST R14, #&80000000
MOVEQ PC, R14

...

In this example, the TST instruction is expecting flags to be present in R14, which on ARM11 they won't be. We can't simply recode anything that touches R14 assuming there are flags, as R14 could validly be used for other things. The only solution in this case is to ensure R14 has flags on entry to a BL, which are then moved into the PSR on exit.

Some PSR manipulations do have direct equivalents on ARMv6, for example:

Code: Select all

ORRS PC, R14, #3 << 26

could be encoded as:

Code: Select all

CPSIE if
MOV PC, R14

Code: Select all

TEQP PC, #3

could be encoded as ARMv6:

Code: Select all

CPSIE if, #%10011
NOP
MSR CPSR_f, #0

ARMv4:

Code: Select all

MSR CPSR_all, #%10011

Code: Select all

TEQP R0, #&C0000000

could be encoded as:

Code: Select all

STMFD R13!, {R0, R1}
EOR R0, R0, #&C0000000
AND R1, R0, #(%1111 << 28) + %11
AND R0, R0, #%11 << 26
ORR R1, R1, R0, LSR #20
MRS R0, CPRS
BIC R0, R0, #(%1111 << 28) + %11
BIC R0, R0, #%11 << 6
ORR R0, R0, R1
MSR CPRS_all, R0
MOV R0, R0
LDMFD R13!, {R0-R1}

Code: Select all

BICS PC, R14, #1 << 27

could be encoded as:

Code: Select all

CPSID i
MOV PC, R14

Code: Select all

ORRS PC, R14, #1 << 28

could be encoded as:

Code: Select all

MSR CPSR_f, #1<<28
MOV PC, R14

Code: Select all

TSTP R0, #&C0000000

could be encoded as:

Code: Select all

STMFD R13!, {R0, R1}
AND R0, R0, #&C0000000
MRS R1, CPRS
BIC R1, R1, #&C0000000
ORR R1, R1, R0
MSR CPRS_f, R1
LDMFD R13!, {R0-R1}

NOTE: TEQP, TSTP, CMPP and CMNP set bits 0-1 and 26-31 (f and c flags) of the ALU result directly into the PSR. Effectively performing:

TEQP:

Code: Select all

EOR Rx, Rx, #<immediate>
AND Rx, Rx, #(%111111 << 26) + %11
BIC PC, PC, #(%111111 << 26) + %11
ORR PC, PC, Rx

TSTP:

Code: Select all

AND Rx, Rx, #<immediate>
AND Rx, Rx, #(%111111 << 26) + %11
BIC PC, PC, Rx

NV conditional instructions
Ignoring self-modifying code for the minute (some legacy C compilers used NV conditional instructions at the entry points, which are later changed to AL), NV instructions can be presumed to be a NOP and simply changed to MOV R0, R0

Self-modifying code
This is where the real fun starts! Although there's nothing inherently wrong with self-modifying code or on-the-fly code creation (RISC OS for example uses this for sprite handling I believe), the split cache nature of the SA onward means these code changes won't be consistent in the Data cache, Instruction cache and memory at the same time.

The obvious solution is to surround self-modifying code with a cache flush, or force a cache flush whenever code memory is written too. However, how do you determine on-the-fly if a write is going to code or data?

The other issue in our case is where the instruction written is one that's not 32bit compatible and would need re-interpreting before being allowed to execute.

Supervisor mode
Care has to be taken around how any implementation enters the re-interpreter whilst in Supervisor mode, to avoid R14 corruption. If BL or SWI are used when the CPU is in SVC32 mode and the code hasn't already preserved R14, it will be corrupt on exit from the re-interpreter. One possible solution is to ensure the re-interpreter is entered in another mode, such as Abort to ensure SVC32 R14 isn't touched at any point.

Stored PC
Storing the PC on ARM3 stores <instruction address + 12> on SA+ it stores <instruction address + 8>. Some C compilers use stacked PC's to free up R14, take for example the following code sequence which is quite commonly used:

Code: Select all

STMFD R13!, {PC}
MOVNV R0, R0
B label
ADD R1, R0, R1

On ARM3 (ARMv3), PC will be stacked at the "ADD R1, R0, R1" instruction. On SA+ (ARMv4+) it will be stacked at "B label" and cause a circular loop. It should be recoded as:

Code: Select all

MOV R0, R0
STMFD R13!, {PC}
B label
ADD R1, R0, R1

PC as Rn
Where PC is used as the first operand, it needs to be setup as it's 26bit equivalent, with NZVC, FI and CPU mode flags in the correct place. eg

Code: Select all

MOV R0, PC

shoule be encoded as:

Code: Select all

STMFD R13!, {R1}
MRS R0, CPSR
AND R1, R0, #%1100000
AND R0, R0, #(%1111 << 28) + %11
ORR R0, R0, R1, LSL #(26-6)
ORR R0, R0, PC
LDMFD R13!, {R1}

PC as Rd
Where the PC is set from a register, we need to ensure no flags are taken across.

Code: Select all

MOV PC, R14

should be encoded as:

Code: Select all

BIC PC, R14, #&FC000003

JIT implementation
Resolving the above issues with a JIT adds additional complications, which I'll go into more detail at a later date. Examples include:

1. Where an instruction needs expanding into several instructions to perform the same task under 32bit, code blocks need generating elsewhere and some form of jump to them implemented.
2. Memory referencing, if code is being re-interpreted into code blocks elsewhere, there's no longer a 1:1 memory relationship which would break both self-modifying code and branches.
3. If code is being run outside of it's normal address, PC relative addressing breaks requiring further instructions to refer to the correct address, potentially slowing the code considerably if code blocks are required.
4. Supervisor mode should ideally be avoided to ensure SVC32 R14 isn't corrupt when 26bit code switches into SVC26.
5. How to enter the JIT and know where it was entered from. One solution is to use an undefined instruction, which will cause the JIT to be entered in the und CPU mode, with PC+4 in R14_und and the CPSR in CPSR_und. The instruction chosen is:

Code: Select all

MRC CP8, 0, R0, C0, C0

DavidS · Post by **DavidS** » Tue Nov 12, 2013 4:53 pm

I am rewriting my JIT for your use and adding support for VIDC1/1a/2/20 as I go. Also your list of instructions is a bit incomplete do not forget:
CMPP, TEQP, TSTP,CMNP.

I am currently working on getting it working by replacing most ops with compatable OPs and using a small area of code at another location replacing those instructions that can not be handled in a single instruction with a BL (Your suggestion thank you).

Though I am just maping areas that corospond to HW registers as not present and doing the minipulations in the abort handler, with the exception of the Frame Buffers as these can be handled much easier.

As to the issue of self modifying code, still looking into that as there are often areas of code mixed with data, so this one is a bit difficult with out incuring a speed penalty.

tlsa1 · Post by **tlsa1** » Tue Nov 12, 2013 6:20 pm

JonAbbott wrote:Providing 26bit instruction support on ARM11 can be done via several methods

Are you particularly targeting the ARM11, or aiming for compatibility with the xscale, cortex-a8, and cortex-a9 too?

DavidS · Post by **DavidS** » Tue Nov 12, 2013 6:40 pm

The target is to include cortex CPUs and should work well on the XScale and even SA110, ARM710, ARM610 etc in 32Bit mode as well. Or at least this is my understanding and what I am aiming for. We will see what Jon jas to say though/

Though it should be noted that these CPUs are similar enough that there will not need to be any difference in the implementation as long as care is taken with any loads that are not word aligned.

JonAbbott · Post by **JonAbbott** » Tue Nov 12, 2013 7:12 pm

The only reason I list the ARM11 specifically is due to the additional deprecated features that need covering. There's no need to use anything that's ARM11 specific so I see no issues with X-Scale or StrongARM support.

My post above isn't complete, I got dragged away so didn't write up all my notes hence the missing instructions.

David - we'll need to use B instead or BL to avoid R14 corruption and B explicitly to the next instruction on return. Self-modifying code, I'll write up my notes when I get a chance, in short my idea was to leave all RAM as read/write, use a table to track which words were instructions (as noted by the fact we scanned them), switch each page to read only as we touch them with the translator and on an abort check if the word being written too is flagged as data or an instruction. Data writes can be ignored, instructions need to be written to the original memory and the new memory a jump back into the interpreter. Hopefully that makes sense.

DavidS · Post by **DavidS** » Tue Nov 12, 2013 7:22 pm

What follows is a quick overview of how I am looking at the implementation of a translator for running 26bit code on 32bit only ARM CPUs. This is only a quick foot note version quickly compiled from my source code, so any inacuracies are the result of reading my own source to quickly. Thus said I think that this will show my solutions to most of the problems.

The issue of Self modifying code is one that is still a hastle for me on this as I do not have any older software that uses self modifying code.

**** **** **** ****
First thing is first: Handling the instructions that will not behave as expected on the 32bit ARM.

**** EFFECTIVE NOP (NV) *****
Any op with bits 28-31 set to %1111.

Replace the op with a MOV R0,R0, and store the original in case needed for self modifying code later.

**** WRITE R15 WITH STATUS BITS ****
Any instruction containing %00 in bits 26-27, having bit 20 set and having the value %1111 in bits 12-15.

In this case need to substitute a group of instructions to perform the operation and update PSR. So replace the instruction with a B to a specialy created small code block.

**** P POSTFIX ****
Having bits 22-25 contain %1011,%1001,%1010,or %1000 with %00 in bits 26-27 and bits 12-15 set to %1111.

Create an effective replacement using MSR. As this is most often used as a quick way back into user mode it should be a one for one replacement in 99% of cases.

**** LDM ****
Any instruction having %100 in bits 25 through 27, with bit 15 set, having bit 22 set, and bit 20 Set.

Going to have to create a custom code block and go there by a B op. We will need to update the PSR and PC as apropriate for the OP.

SWP
Any case where bits 23-27 are %00010 and bits 21-22 are %00, and bits 4-11 are %00001001.

We create a custom code block replacing op with the B op. This custom code block dissables interupts and does a LDR/SDR combo, preserving the temperary register.

BL
We do not have flags in R14 so we have to implement them.

We will have to replace every original BL with a code block that merges the status bits into the LR and performs as normal. This will again require a B to a custome code block.

**** **** **** ****
And that should cover the issue of incompatable code. Next up is the issue of self modifying code.

If a write accures to the instruction area we will need to catch it, thus we will have to have that area set as read only, this could be interesting though as there is also data that is likelymixed in with the code, and we will have to make sure to transfer the data over during the translation, or make the page non readable (i do not know a way to do that on the ARM and still keep it executable) until all of the contents of the block have been translated. Perhaps I need another look through the ARM ARM.

**** **** **** ****

Though managing the simulated HW is simple just keep an eye out for dirty pages. Though this could require recoding for different MMUs it should work unmodified with most current 32bit only ARMs.

DavidS · Post by **DavidS** » Tue Nov 12, 2013 7:25 pm

David - we'll need to use B instead or BL to avoid R14 corruption and B explicitly to the next instruction on return.

Yes we shall I made that post a bit to quickly, my appologies.

Self-modifying code, I'll write up my notes when I get a chance, in short my idea was to leave all RAM as read/write, use a table to track which words were instructions (as noted by the fact we scanned them), switch each page to read only as we touch them with the translator and on an abort check if the word being written too is flagged as data or an instruction. Data writes can be ignored, instructions need to be written to the original memory and the new memory a jump back into the interpreter. Hopefully that makes sense.

I like that solution I think that I will run with it (for now anyway).

DavidS · Post by **DavidS** » Sat Nov 16, 2013 5:55 pm

I have almost got my code cleaned up enough to show off what is working.

So it should not be long before I have something to show off, I hope.

The system is fairly simple at this time:

A module that does the actual translation, and will be extend for more things. The module filename is 26Bitter, the Chunk name is 26Bit, currently using chunk &CE000 until I get around to getting an allocation from ROOL.
A stub applicataion that is used to load the actual application and call the module to begin translation.
A loader application that is used to explicitely load any application or module that needs to be translated.

I intend to expand this to include support for loading applications that are run from the Filer or from a *Command, though first I am going to be tracing down the filing vectors to see what I need to sit on and watch for in order to correctly scan for a non 32 bit application to be loaded.

Also I still need to add some wrapper replacements for some SWI Calls in order to get things working well, as many things that do work tend to fail on some SWIs.

Currently it is only working with some well behaved WIMP based apps, and text mode apps that follow all of the rules. Currently it seems as every other modification breaks something so I would call the current state PreAlpha. Though the list of what works will expand with time.

JonAbbott · Post by **JonAbbott** » Sat Nov 16, 2013 9:47 pm

Excellent progress by the sound of things, I'm looking forward to testing it out.

JonAbbott · Post by **JonAbbott** » Tue Nov 26, 2013 9:41 pm

Pac-mania (F1044701) is a good one to test 26bit support on, as far as I can tell it's only its use of MOVNV R0, R0 that's breaking it. You'll need to run it under ADFFS500221 using the F1044701 Obey file to fix the 4-bit mode though.

forums.jaspp.org.uk

26bit CPU support

26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support

Re: 26bit CPU support