Page 24 of 25

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 10, 2022 3:21 pm
by foft
Real dhrystones:
400MHz: 47169 dhrystones/s, 26 DMIPS (and mouse jerky!)
800MHz: 84033 dhrystones/s, 47 DMIPS
1000MHz: 126582 dhrystones/s, 72 DMIPS
1200MHz: 169491 dhrystones/s, 96 DMIPS

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 10, 2022 3:31 pm
by foft
Now, back to "why does it feel slow".
Sysinfo drive speed (DH1):
with ARM as cpu: 701,545 bytes/second
with 68020 soft core: 2,665,871 bytes/second

So drive reads are almost 5x slower, which would make things like browsing disk feel slow. All down to interrupt latency?

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 10, 2022 3:47 pm
by foft
One more data point, at 1200MHz Musashi is still not worth it. Still significantly slower than TG68 (like 30% of the speed...). Qemu seems worth it, other than the latency...

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 10, 2022 3:53 pm
by foft
I also noticed in sysinfo that 'chip speed vs A600' is 12 for the tg68k. Its about 3.18 in qemu.

Now 701545*12/3.18 =~2600000. Very similar to the drive speed fraction, hmmm.

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 10, 2022 3:54 pm
by foft
So, since I vanished since January did anyone try anything fun with this? Aranym jit, emu68k?

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 10, 2022 4:09 pm
by foft
Some thoughts on chipram speed...

Actually on the bustest qemu does about the same as an A1200 (6MB/s). However the TG68k does much better than the A1200 (18MB/s).

The ARM should be able to benefit from the same, even if its own caching is off. However its after the HPS-FPGA bridge bottleneck.

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 10, 2022 7:46 pm
by Neocaron
foft wrote: Wed Aug 10, 2022 3:21 pm Real dhrystones:
400MHz: 47169 dhrystones/s, 26 DMIPS (and mouse jerky!)
800MHz: 84033 dhrystones/s, 47 DMIPS
1000MHz: 126582 dhrystones/s, 72 DMIPS
1200MHz: 169491 dhrystones/s, 96 DMIPS
Thanks for the testing!
The upgrade is still very good!
Any instability during benchmarks at 1.2ghz?

Re: Lets actually try Hybrid Emulation

Posted: Thu Aug 11, 2022 6:24 pm
by foft
Yes it seems stable at 1.2GHz. Though I didn't run it for long...

I'm investigating speeding up the chip/chipram access. This goes via the full HPS-FPGA bridge.

I have a few thoughts on this:
i) Frequency the bridge is clocked at.
ii) Increase bridge width
iii) Try the lightweight bridge (supposed to be faster)
iv) Burst support

So far on these:
i) I was previously using 114MHz having found 32MHz glacially slow. I just tried it at 228MHz with some improvement.
ii) I made a 32-bit version of the bridge (I was using 16-bit). This also supports the 'longword' feature of the core for faster chipram access, though I'm not yet sure I'm using it right.
iii) I tried to change this setting hoping it would 'just work'. It gave me an error that it can only address 17 bits, though I need to confirm this.
iv) Not tried yet.

At 228MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~27MB/s.
With the arm at 1200MHz that increases slightly to ~35MB/s.
(For reference, if the bridge was 'ideal' it could get to 1GB/s at this speed...)

At 114MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~20MB/s.
With the arm at 1200MHz that increases slightly to ~25MB/s.

At 32MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~8MB/s.
With the arm at 1200MHz that increases slightly to ~10MB/s.

Another idea just popped in my head when writing this. Perhaps I should see if the lightweight bridge is faster despite the limited address range. It may be possible to bank-switch. Though I'm not sure how I could plug that into qemu, but that is another problem.

Re: Lets actually try Hybrid Emulation

Posted: Thu Aug 11, 2022 6:59 pm
by Neocaron
foft wrote: Thu Aug 11, 2022 6:24 pm Yes it seems stable at 1.2GHz. Though I didn't run it for long...

I'm investigating speeding up the chip/chipram access. This goes via the full HPS-FPGA bridge.

I have a few thoughts on this:
i) Frequency the bridge is clocked at.
ii) Increase bridge width
iii) Try the lightweight bridge (supposed to be faster)
iv) Burst support

So far on these:
i) I was previously using 114MHz having found 32MHz glacially slow. I just tried it at 228MHz with some improvement.
ii) I made a 32-bit version of the bridge (I was using 16-bit). This also supports the 'longword' feature of the core for faster chipram access, though I'm not yet sure I'm using it right.
iii) I tried to change this setting hoping it would 'just work'. It gave me an error that it can only address 17 bits, though I need to confirm this.
iv) Not tried yet.

At 228MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~27MB/s.
With the arm at 1200MHz that increases slightly to ~35MB/s.
(For reference, if the bridge was 'ideal' it could get to 1GB/s at this speed...)

At 114MHz no burst, arm at 800MHz, if I do 32-bit writes in a tight loop to something FPGA side that is always ready, I get ~20MB/s.
With the arm at 1200MHz that increases slightly to ~25MB/s.

Another idea just popped in my head when writing this. Perhaps I should see if the lightweight bridge is faster despite the limited address range. It may be possible to bank-switch. Though I'm not sure how I could plug that into qemu, but that is another problem.
Would faster DDR 3 ram helps?

I know Coolbho3k was looking into getting the DDR3 ram running at its rated 1066 speed instead of the current 800. My guess is that for any latency or bandwidth limiting scenarios it could make a massive difference. Maybe you should talk to him about this, or investigate on your own to see what's possible.

Here's what he said on the subject:
" There may be a way to overclock the memory too. The memory chips on the DE10 Nano BOM are rated at DDR3-1066, while the DE10 Nano runs them at DDR3-800. I'm not sure if this will affect the FPGA side of things. If so, I'm also not sure if this would help alleviate the need for the SDRAM for some cores. It might be worth looking into."

Re: Lets actually try Hybrid Emulation

Posted: Thu Aug 11, 2022 7:15 pm
by foft
Well faster ram is always good. It doesn't help with this bottleneck much probably.

Although... there are always alternative approaches. In this one I took the approach of leaving the core mostly intact. Then I plumbed in the CPU emulator using the HPS-FPGA bridge to access chip ram and hardware registers.

I think we're stuck with the HPS-FPGA bridge to write the the hardware registers.

Chip ram though, that could be changed. The DDR ram can already be accessed from both HPS and FPGA pretty transparently. This is used by e.g. the scalar. So we could put chip ram in the DDR and then point the hardware logic dma at this instead.

Without changing that we could also enable the caching on the chip ram area. I tried it and it booted fine, but when I went to sysinfo I saw a corrupted screen. So we need to flush the cache sometimes - but when and how? Is chip ram uncachable on all 'real' accelerators?

Re: Lets actually try Hybrid Emulation

Posted: Fri Aug 12, 2022 8:33 pm
by foft
So I wrote a very simple program in devpac:
loop:
move.l $100000,d0
move.l $100004,d1
move.l $100008,d2
move.l $10000c,d3
move.l $100010,d4
move.l $100004,d5
move.l $100008,d6
move.l $10000c,d7
jmp loop

Which I captured in signaltap, specifically the chip sdram reading signals.

Firstly here is how it looks on TG68:
read_tight_loop_unrolled_tg68.png
read_tight_loop_unrolled_tg68.png (8.69 KiB) Viewed 10989 times
Then using qemu with the HPS/FPGA bridge clocked at 28MHz:
read_tight_loop_unrolled_arm_28.png
read_tight_loop_unrolled_arm_28.png (11.65 KiB) Viewed 10989 times
Finally with qemu and the HPS/FPGA bridge clocked at 28*7MHz: (*8 does not always synthesize ok)
read_tight_loop_unrolled_arm_28x7.png
read_tight_loop_unrolled_arm_28x7.png (7.69 KiB) Viewed 10989 times
Now on TG68 you can see how it takes about 20 114MHz (28*4) cycles to read 4 bytes. So about 21MB/s (4*4*28000000/20/1024/1024)
On the ARM with natural bridge (32-bit and 28MHz) it takes about 56 114MHz (28*4) cycles to read 4 bytes. So about 7.5MB/s.
On the ARM with the sped up bridge (32-bit and 28*7) it takes about 28 114MHz (28*4) cycles to read 4 bytes. So about 15MB/s.

In the January release I was using the sped up bridge at 28*4MHz and a 16-bit HPS-FPGA bridge. Can you guess the improvement I see in 'bustest' by changing it to 28x7MHz and 32-bit HPS-FPGA bridge. None, arg! It shows me 6MB/s.

So what is going on with bustest? Well it seems like every other transaction is slow for some reason:
bustest_longword.png
bustest_longword.png (59.43 KiB) Viewed 10989 times
Oh and this last picture has the signal names, which I accidentally chopped off the others - oops.

Re: Lets actually try Hybrid Emulation

Posted: Fri Aug 12, 2022 8:53 pm
by foft
Actually TG68 has a similar pattern on bustest. i.e. slow/fat/slow/fast. Just its fast is very fast!
tg68_bustest.png
tg68_bustest.png (62.43 KiB) Viewed 10974 times

Re: Lets actually try Hybrid Emulation

Posted: Fri Aug 12, 2022 9:04 pm
by foft
Anyway long story short, there is a 10 cycle (at 114MHz) 'waste' overhead due to the HPS-FPGA bridge. Then another 4 cycles (average) due to clock domain alignment (from 28x7 to 28). So 14 cycles waste per transaction. So I guess we have only ~6 cycles (1.5 cycles at 28MHz) to do the actual memory access to still reach TG68 level memory access performance.

Re: Lets actually try Hybrid Emulation

Posted: Sat Aug 13, 2022 9:58 pm
by foft
Some promising experiments with fifos for immediate write completion and pipelined reads...

Re: Lets actually try Hybrid Emulation

Posted: Sun Aug 14, 2022 3:05 pm
by foft
Well I updated a new core and matching qemu with what I have to github - hardware description, patched qemu and the compiled setup one:
https://github.com/scrameta/Minimig-AGA_MiSTer_Hybrid
https://github.com/scrameta/qemu_MiSTer_Hybrid
https://github.com/scrameta/MiSTer_Hybrid_Support

The changes did not give as big a boost as I hoped, though they are as follows:
i) HPS-FPGA bridge changed to 32-bit from 16-bit.
ii) HPS-FPGA bridge clock changed from 114MHz to 170MHz.
iii) HPS-FPGA 16-deep fifo for writes.
iv) HPS-FPGA 16-deep pipelined read support.
v) Expose CACR and VBR to the FPGA. For now qemu just defaults them to 1 and 0.

Note that I've only seen memcpy native use the pipelined read and only then 2 deep, so it doesn't help much in reality.

edit: Update, I reverted the rtg cache change, it caused corruption. Also note that e.g. doom runs much nicer overclocked.

Re: Lets actually try Hybrid Emulation

Posted: Sun Aug 14, 2022 7:43 pm
by foft
So I was thinking, perhaps it'd be better to change how the hard disk works in hybrid mode.

Its kind of bonkers to go the route it goes... sd card->ide.cpp->spi->fpga->hps/fpga bridge->qemu :lol:
It'd probably make more sense to go sd card->qemu.

I've noticed the MiSTer poll slows down when qemu is used. I don't know yet if this is cpu contention or down to it waiting for something from the FPGA.

Re: Lets actually try Hybrid Emulation

Posted: Sun Aug 14, 2022 8:24 pm
by kolla
Yes, keep as much as possible “close" to the CPU and fast ram, especially I/O. Ideally, when RTG is used, the FPGA should have almost no use :)

Re: Lets actually try Hybrid Emulation

Posted: Sun Aug 14, 2022 9:15 pm
by LamerDeluxe
Really great that progress is being made again on this project. It is very fascinating to follow.

Re: Lets actually try Hybrid Emulation

Posted: Mon Aug 15, 2022 6:56 am
by Caldor
This does sounds like it could end up making the CPU emulation faster than just using the FPGA overall :)

I was speculating a bit on what might make for faster disk access, but ended up concluding I just do not know enough about the FPGA code and how much work different solutions might require, or what might and might not be possible. But some way of accessing disks differently ought to help.

I do think its a similar problem the PiStorm has? Well... I think its disk access is faster, but the PiStorm problem I think is access to the slow RAM? So if that is the case I would suspect that giving QEMU direct access to the disk would give similar results to what PiStorm sees.

Re: Lets actually try Hybrid Emulation

Posted: Tue Aug 30, 2022 7:31 pm
by foft
I'm trying again to get uae4all working.

This code is really non-trivial to tear apart, I've tried about 3 times, so I'm trying to instead get it running with minimal changes.

Memory -> point to hps/fpga bright or memory instead
GUI -> just make it start straight away
video/audio -> point to the dummy device

Once that lives I can try turning off some more parts!

So far... diagrom runs, but its set to 68k for some reason and no jit, but that is probably just a setting. So I'll change that setting then add interrupts. Then, fingers crossed:)

Re: Lets actually try Hybrid Emulation

Posted: Tue Aug 30, 2022 8:07 pm
by foft
OK, jit from uae4arm lives too.

Next, wire up interrupts again, then to workbench... Tomorrow!

I also think its probably pausing the jit to do other uae stuff, so I should find/remove that too.

Re: Lets actually try Hybrid Emulation

Posted: Wed Aug 31, 2022 8:11 pm
by foft
Interrupts wired up, a bunch of not-needed graphics sound code and threads disabled.

DiagROM seems to be running well. :)

Real kickstart gives me a yellow screen briefly then it reboots. :? This is even before getting to the disk prompt etc.

Re: Lets actually try Hybrid Emulation

Posted: Fri Sep 02, 2022 12:05 pm
by Solskogen
are you using amiberry or the old uae4arm?

Re: Lets actually try Hybrid Emulation

Posted: Fri Sep 02, 2022 2:07 pm
by foft
I’m using TomB’s uae4arm

Re: Lets actually try Hybrid Emulation

Posted: Thu Sep 15, 2022 1:57 pm
by SuperFrog
Can someone please explain how to test minimig hybrid?!

I would really love to check it, but I have no idea where to start from. :(

Re: Lets actually try Hybrid Emulation

Posted: Fri Sep 16, 2022 4:31 pm
by foft
Pop this on your sd card, then start the minimig hybrid core:
https://github.com/scrameta/MiSTer_Hybrid_Support

Re: Lets actually try Hybrid Emulation

Posted: Fri Sep 16, 2022 5:46 pm
by SuperFrog
foft wrote: Fri Sep 16, 2022 4:31 pm Pop this on your sd card, then start the minimig hybrid core:
https://github.com/scrameta/MiSTer_Hybrid_Support
Will try it tonight!

Thank you!!!

Re: Lets actually try Hybrid Emulation

Posted: Mon Apr 22, 2024 10:25 am
by Juri

Hi, what happened to the minimig hybrid core? Dead project? Thanks


Re: Lets actually try Hybrid Emulation

Posted: Sat Apr 27, 2024 4:02 pm
by Arek0xff

Is the project alive?


Re: Lets actually try Hybrid Emulation

Posted: Fri Jun 28, 2024 8:51 am
by foft

I come back to it every year or so.

No-one else is interested in pushing this further though? Thought someone might try porting for instance emu86 or merge the core changes etc.

I was hopeful with the uae4arm cpu. Got that working a while back with diagrom but the actual os didn’t boot. Must be something simple…