Optimization guide
This guide covers some techniques one can use to optimize DS/DSi homebrew with BlocksDS.
- Measure CPU performance on hardware only! As of writing (May 2024), the most popular emulators are not sufficiently cycle-accurate to provide accurate profiling results; this might lead, at worst, to mistaken optimizations.
- Avoid using floating point values (
float
,double
). As the DS does not include any hardware for accelerating floating point math, operations on them have to be emulated in software, which is quite costly. - On the ARM9, place “hot” (often used) code and data in ITCM and DTCM, respectively. More information is available below.
ITCM stands for Instruction Tightly Coupled Memory, and on the DS refers to 32 kilobytes of fast memory connected directly to the ARM9 CPU to be used for instruction code. Analogously, DTCM refers to 16 kilobytes of Data Tightly Coupled Memory.
To place a function in ITCM or DTCM, you can use one of two approaches:
- Rename the .c or .cpp file containing it to use the .itcm.c, .itcm.cpp, .dtcm.c or .dtcm.cpp suffix instead - to move the file’s entire contents,
- Annotate the specific function with one of the following macros:
ITCM_CODE
- for code,DTCM_DATA
- for data pre-initialized with values,DTCM_BSS
- for data which is not pre-initialized or is filled with zeros.
Note that neither ITCM nor DTCM are accessible to hardware external to the ARM9 main CPU, such as the ARM7 or the DMA hardware.
More details, as well as information on how to use ITCM/DTCM in assembly code, is available in the usage notes.
- By default, the versions of
printf()
andscanf()
linked by the toolchain contain code to handlefloat
anddouble
values. The required conversion code costs a few kilobytes of memory. Instead, integer versions can be utilized, as documented below.
By default, the build of picolibc
of BlocksDS makes printf()
, sscanf()
and
similar functions support float and double conversions. This is done to maintain
compatibility with existing code (and the C standard), but it increases the size
of the final binaries. It is possible to switch to integer-only versions of the
functions by adding the following line to the DEFINES
of your Makefile:
DEFINES := [all other options go here] \
-DPICOLIBC_LONG_LONG_PRINTF_SCANF
There are additional variants available, depending on the codebase’s specific requirements:
-DPICOLIBC_DOUBLE_PRINTF_SCANF
: the default; supports float and double values.-DPICOLIBC_FLOAT_PRINTF_SCANF
: supports float values, but not doubles.-DPICOLIBC_LONG_LONG_PRINTF_SCANF
: supports all integer types, but not float types.-DPICOLIBC_INTEGER_PRINTF_SCANF
: supports 32-bit integer types, but not 64-bit integer types such aslong long
or float types.-DPICOLIBC_MINIMAL_PRINTF_SCANF
: the smallest version; supports only 32-bit integer types and omits support for many non-basic printf/scanf features. Full compatibility withlibnds
is not guaranteed at this level, and it should be used with caution.
For more information, please read the relevant picolibc documentation.
If you’ve programmed for ROM-based platforms before, like the GBA, you might be used to accessing in-ROM assets or filesystems instantly. That is not the case on BlocksDS; only data loaded to memory is fast to access, while filesystem reads including NitroFS are comparatively slow. Note that the speed also differs depending on hardware. Emulators and the DSi SD card are faster than commonly available flashcarts. Some SD cards are also slower than others. In addition, fragmented FAT filesystems can be slower to read than defragmented ones. As such, one should not rely on filesystem accesses being of a particular minimum speed.
There are things which can be done to make the most out of filesystem I/O:
- Read larger chunks of data at a time. As starting an SD card read has latency of its own, processing more than one sector in one read call is much faster than doing so across multiple read calls. An optimal value is the size of one cluster (typically between 4 and 32 KB), though BlocksDS can also sometimes optimize reads larger than that.
- If your data format is based around small sequential reads, you may benefit
from enabling the C standard library’s buffering mechanism using the
setvbuf
function. Note that this slows down random reads, as a buffered random read will require re-filling the whole buffer. For example, to allocate a 4 KB buffer for a file, one can use:
setvbuf(file, NULL, _IOFBF, 4 * 1024);
- If your data format is based around random reads, you may benefit from enabling a lookup cache. This reduces the amount of SD card reads required for finding the location of the specific chunk of data in the filesystem. The lookup cache has to be of a certain maximum size, but on defragmented devices will typically not exceed a few hundred bytes. Note that the lookup cache is automatically enabled for NitroFS accesses. For example, to try and enable the lookup cache on a file:
fatInitLookupCacheFile(file, 2 * 1024); /* 2 KB maximum lookup cache */
The GL_RGB
texture format isn’t actually supported by the hardware. When a
texture is loaded with glTexImage2D()
with format GL_RGB
, it is actually
converted to GL_RGBA
internally by setting the alpha bit of all pixels to 1.
This is pretty slow, and it isn’t normally needed because any graphics
conversion tool (such as grit
) can set the alpha bit to 1 when the image is
converted. None of the 3D examples of BlocksDS use GL_RGB
. The ones that use
GL_RGBA
also show how to convert images the right way.
Also, consider using other texture formats. GL_RGBA
isn’t a very efficient way
to store textures. All other formats (the ones that use palettes) use a lot less
memory. In many cases GL_COMPRESSED
is the best format you can use (but
converting images to this format is only currently supported by ptexconv
).
BlocksDS has examples of how to use all texture formats (except for GL_RGB
,
which works the same way as GL_RGBA
, and is never used in order to not
encourage new users to use it). Take a look at the examples for more details.
Final note: The video capture circuit of the NDS sets the alpha bit to 1, so it isn’t needed even if you want to use the saved image as a texture, or as a 16-bit bitmap background. There is an example of how to render to a texture too.