Created
March 30, 2013 12:22
-
-
Save ssvb/5276498 to your computer and use it in GitHub Desktop.
Benchmark for Allwinner A10 SRAM
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In the kernel: | |
diff --git a/drivers/char/sun4i_g2d/g2d_driver.c b/drivers/char/sun4i_g2d/g2d_driver.c | |
index be4efe2..33c43ba 100644 | |
--- a/drivers/char/sun4i_g2d/g2d_driver.c | |
+++ b/drivers/char/sun4i_g2d/g2d_driver.c | |
@@ -317,15 +317,11 @@ int g2d_mmap(struct file *file, struct vm_area_struct * vma) | |
unsigned long mypfn; | |
unsigned long vmsize = vma->vm_end-vma->vm_start; | |
- if(g2d_mem[g2d_mem_sel].b_used == 0) | |
- { | |
- ERR("mem not used in g2d_mmap,%d\n",g2d_mem_sel); | |
- return -EINVAL; | |
- } | |
- | |
- physics = g2d_mem[g2d_mem_sel].phy_addr; | |
+ physics = 0x20000; // g2d_mem[g2d_mem_sel].phy_addr; | |
mypfn = physics >> PAGE_SHIFT; | |
+ vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); | |
+ | |
if(remap_pfn_range(vma,vma->vm_start,mypfn,vmsize,vma->vm_page_prot)) | |
return -EAGAIN; | |
In the userland: | |
int g2d; | |
uint8_t *sram_addr; | |
#define SRAM_SIZE (64 * 1024) | |
if ((g2d = open("/dev/g2d",O_RDWR)) == -1) | |
{ | |
printf("Failed to open /dev/g2d\n"); | |
return 1; | |
} | |
sram_addr = mmap(NULL, SRAM_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, g2d, 0); | |
And the results for patched tinymembench: | |
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! | |
!!!! Benchmarking 64KiB of SRAM (mapped as pgprot_writecombine) !!!! | |
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! | |
tinymembench v0.2.9 (simple benchmark for memory throughput and latency) | |
========================================================================== | |
== Memory bandwidth tests == | |
== == | |
== Note 1: 1MB = 1000000 bytes == | |
== Note 2: Results for 'copy' tests show how many bytes can be == | |
== copied per second (adding together read and writen == | |
== bytes would have provided twice higher numbers) == | |
== Note 3: 2-pass copy means that we are using a small temporary buffer == | |
== to first fetch data into it, and only then write it to the == | |
== destination (source -> L1 cache, L1 cache -> destination) == | |
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in == | |
== brackets == | |
========================================================================== | |
C copy backwards : 71.0 MB/s (0.2%) | |
C copy : 71.0 MB/s | |
C copy prefetched (32 bytes step) : 74.8 MB/s | |
C copy prefetched (64 bytes step) : 74.6 MB/s (0.2%) | |
C 2-pass copy : 70.0 MB/s | |
C 2-pass copy prefetched (32 bytes step) : 70.0 MB/s | |
C 2-pass copy prefetched (64 bytes step) : 70.1 MB/s | |
C fill : 382.7 MB/s | |
--- | |
standard memcpy : 78.1 MB/s (0.2%) | |
standard memset : 382.8 MB/s | |
--- | |
NEON read : 667.9 MB/s | |
NEON read prefetched (32 bytes step) : 667.9 MB/s | |
NEON read prefetched (64 bytes step) : 667.8 MB/s | |
NEON copy : 243.7 MB/s (0.1%) | |
NEON copy prefetched (32 bytes step) : 243.7 MB/s | |
NEON copy prefetched (64 bytes step) : 243.7 MB/s | |
NEON unrolled copy : 243.7 MB/s | |
NEON unrolled copy prefetched (32 bytes step) : 243.6 MB/s (0.2%) | |
NEON unrolled copy prefetched (64 bytes step) : 243.7 MB/s | |
NEON copy backwards : 243.7 MB/s | |
NEON copy backwards prefetched (32 bytes step) : 243.7 MB/s | |
NEON copy backwards prefetched (64 bytes step) : 243.7 MB/s (0.2%) | |
NEON 2-pass copy : 240.9 MB/s | |
NEON 2-pass copy prefetched (32 bytes step) : 240.9 MB/s | |
NEON 2-pass copy prefetched (64 bytes step) : 240.9 MB/s | |
NEON unrolled 2-pass copy : 239.6 MB/s (0.2%) | |
NEON unrolled 2-pass copy prefetched (32 bytes step) : 239.1 MB/s | |
NEON unrolled 2-pass copy prefetched (64 bytes step) : 239.3 MB/s | |
NEON fill : 382.6 MB/s | |
NEON fill backwards : 382.7 MB/s | |
ARM fill (STRD) : 381.5 MB/s (0.2%) | |
ARM fill (STM with 8 registers) : 381.5 MB/s | |
ARM fill (STM with 4 registers) : 381.5 MB/s | |
ARM copy prefetched (incr pld) : 78.1 MB/s (0.2%) | |
ARM copy prefetched (wrap pld) : 77.9 MB/s | |
ARM 2-pass copy prefetched (incr pld) : 71.5 MB/s | |
ARM 2-pass copy prefetched (wrap pld) : 71.5 MB/s | |
========================================================================== | |
== Memory latency test == | |
== == | |
== Average time is measured for random memory accesses in the buffers == | |
== of different sizes. The larger is the buffer, the more significant == | |
== are relative contributions of TLB, L1/L2 cache misses and SDRAM == | |
== accesses. For extremely large buffer sizes we are expecting to see == | |
== page table walk with total 3 requests to SDRAM for almost every == | |
== memory access (though 64MiB is not large enough to experience this == | |
== effect to its fullest). == | |
== == | |
== Note 1: All the numbers are representing extra time, which needs to == | |
== be added to L1 cache latency. The cycle timings for L1 cache == | |
== latency can be usually found in the processor documentation. == | |
== Note 2: Dual random read means that we are simultaneously performing == | |
== two independent memory accesses at a time. In the case if == | |
== the memory subsystem can't handle multiple outstanding == | |
== requests, dual random read has the same timings as two == | |
== single reads performed one after another. == | |
========================================================================== | |
block size : read access time (single random read / dual random read) | |
2 : 80.5 ns / 162.1 ns | |
4 : 80.5 ns / 162.1 ns | |
8 : 80.6 ns / 162.1 ns | |
16 : 80.6 ns / 162.1 ns | |
32 : 80.5 ns / 162.1 ns | |
64 : 80.5 ns / 162.1 ns | |
128 : 80.5 ns / 162.1 ns | |
256 : 80.6 ns / 162.1 ns | |
512 : 80.5 ns / 162.1 ns | |
1024 : 80.6 ns / 162.1 ns | |
2048 : 80.5 ns / 162.1 ns | |
4096 : 80.5 ns / 162.1 ns | |
8192 : 80.7 ns / 162.4 ns | |
16384 : 80.9 ns / 162.5 ns | |
32768 : 80.7 ns / 162.3 ns | |
65536 : 80.8 ns / 162.5 ns | |
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! | |
!!!! Benchmarking 64KiB of SRAM (mapped as pgprot_noncached) !!!! | |
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! | |
tinymembench v0.2.9 (simple benchmark for memory throughput and latency) | |
========================================================================== | |
== Memory bandwidth tests == | |
== == | |
== Note 1: 1MB = 1000000 bytes == | |
== Note 2: Results for 'copy' tests show how many bytes can be == | |
== copied per second (adding together read and writen == | |
== bytes would have provided twice higher numbers) == | |
== Note 3: 2-pass copy means that we are using a small temporary buffer == | |
== to first fetch data into it, and only then write it to the == | |
== destination (source -> L1 cache, L1 cache -> destination) == | |
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in == | |
== brackets == | |
========================================================================== | |
C copy backwards : 40.6 MB/s (5.0%) | |
C copy : 40.6 MB/s (0.2%) | |
C copy prefetched (32 bytes step) : 40.3 MB/s | |
C copy prefetched (64 bytes step) : 40.3 MB/s | |
C 2-pass copy : 39.9 MB/s | |
C 2-pass copy prefetched (32 bytes step) : 39.5 MB/s (0.1%) | |
C 2-pass copy prefetched (64 bytes step) : 39.5 MB/s (1.7%) | |
C fill : 78.8 MB/s (6.0%) | |
--- | |
standard memcpy : 40.6 MB/s | |
standard memset : 78.8 MB/s | |
--- | |
NEON read : 144.9 MB/s | |
NEON read prefetched (32 bytes step) : 142.9 MB/s | |
NEON read prefetched (64 bytes step) : 144.9 MB/s (0.1%) | |
NEON copy : 49.0 MB/s (4.4%) | |
NEON copy prefetched (32 bytes step) : 48.8 MB/s (0.2%) | |
NEON copy prefetched (64 bytes step) : 48.9 MB/s | |
NEON unrolled copy : 69.2 MB/s | |
NEON unrolled copy prefetched (32 bytes step) : 68.7 MB/s | |
NEON unrolled copy prefetched (64 bytes step) : 69.2 MB/s (0.1%) | |
NEON copy backwards : 49.0 MB/s (1.7%) | |
NEON copy backwards prefetched (32 bytes step) : 48.8 MB/s (6.1%) | |
NEON copy backwards prefetched (64 bytes step) : 48.9 MB/s | |
NEON 2-pass copy : 47.0 MB/s | |
NEON 2-pass copy prefetched (32 bytes step) : 47.0 MB/s | |
NEON 2-pass copy prefetched (64 bytes step) : 47.0 MB/s (0.1%) | |
NEON unrolled 2-pass copy : 67.2 MB/s | |
NEON unrolled 2-pass copy prefetched (32 bytes step) : 66.0 MB/s (6.1%) | |
NEON unrolled 2-pass copy prefetched (64 bytes step) : 67.0 MB/s | |
NEON fill : 132.2 MB/s | |
NEON fill backwards : 132.4 MB/s | |
ARM fill (STRD) : 78.8 MB/s | |
ARM fill (STM with 8 registers) : 80.6 MB/s | |
ARM fill (STM with 4 registers) : 80.0 MB/s | |
ARM copy prefetched (incr pld) : 40.6 MB/s (0.1%) | |
ARM copy prefetched (wrap pld) : 40.6 MB/s | |
ARM 2-pass copy prefetched (incr pld) : 40.2 MB/s (4.8%) | |
ARM 2-pass copy prefetched (wrap pld) : 40.2 MB/s | |
========================================================================== | |
== Memory latency test == | |
== == | |
== Average time is measured for random memory accesses in the buffers == | |
== of different sizes. The larger is the buffer, the more significant == | |
== are relative contributions of TLB, L1/L2 cache misses and SDRAM == | |
== accesses. For extremely large buffer sizes we are expecting to see == | |
== page table walk with total 3 requests to SDRAM for almost every == | |
== memory access (though 64MiB is not large enough to experience this == | |
== effect to its fullest). == | |
== == | |
== Note 1: All the numbers are representing extra time, which needs to == | |
== be added to L1 cache latency. The cycle timings for L1 cache == | |
== latency can be usually found in the processor documentation. == | |
== Note 2: Dual random read means that we are simultaneously performing == | |
== two independent memory accesses at a time. In the case if == | |
== the memory subsystem can't handle multiple outstanding == | |
== requests, dual random read has the same timings as two == | |
== single reads performed one after another. == | |
========================================================================== | |
block size : read access time (single random read / dual random read) | |
2 : 86.5 ns / 174.1 ns | |
4 : 86.5 ns / 174.1 ns | |
8 : 86.5 ns / 174.1 ns | |
16 : 86.5 ns / 174.1 ns | |
32 : 86.5 ns / 174.1 ns | |
64 : 86.5 ns / 174.1 ns | |
128 : 86.5 ns / 174.1 ns | |
256 : 86.5 ns / 174.1 ns | |
512 : 86.5 ns / 174.1 ns | |
1024 : 86.5 ns / 174.1 ns | |
2048 : 86.5 ns / 174.0 ns | |
4096 : 86.5 ns / 174.1 ns | |
8192 : 86.5 ns / 174.1 ns | |
16384 : 86.5 ns / 174.1 ns | |
32768 : 86.5 ns / 174.1 ns | |
65536 : 86.5 ns / 174.1 ns |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment