# For more options and information see
# http://rpf.io/configtxt
# Some settings may impact device functionality. See link above for details
# uncomment if you get no picture on HDMI for a default "safe" mode
#hdmi_safe=1
# uncomment the following to adjust overscan. Use positive numbers if console
# goes off screen, and negative if there is too much border
If you are a user of Intel Performance Primitives (or possibly any other library that performs memory allocation for you), then you may have experienced this issue before: Memory leaks due to heap overflows seem to occur randomly, or rather, only very 'large' leaks seem to crash your program and/or trigger the corresponding errors in checkers like Valgrind/AddressSanitizer.
This, it seems, is entirely due to the fact that IPP explicitly in their documentation says that the memory allocated by their functions, ippMalloc, ippsMalloc, ippsMalloc_L
and others, are all built to align the data pointers to a 64-byte boundary.
As one may find after a bit of googling, getting memory aligned to a certain byte boundary requires you to allocate more memory than is necessary; specifically, one allocates more memory by that exact byte alignment. For example, if we require 100 bytes aligned to a 64-byte boundary, then we ask the system for (64 + 100) bytes in total.
NOTE: Technicall
The number of posts around this topic are few, and many posts just revert to the 'oh let the compiler handle this'. Some people want to learn what the compiler is doing..
My question started with this: in developing a library I started by using template specializations to prevent 'disallowed' types from working. However, I soon realised that this may have been overkill, and just using simple overloading would have been enough, since I was template specializing for every type allowed anyway.
But then one should begin to ask:
- Since templates are supposedly compile-time constructs which only 'generate code' when they are invoked, then the template specializations, when compiled, should have less 'dead' code in them than if a bunch of overloaded functions are used right?
- In that case, when do unused overloaded functions get stripped out?
- Or more generally, when do unused functions/templates get stripped out?
Compiled for commit d87888b
i.e. 1.7.4
cmake .. -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_PYTHON=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release -DFAISS_OPT_LEVEL=generic -DBLA_VENDOR=Intel10_64ilp_seq -DMKL_LIBRARIES=%MKLROOT%\lib\intel64\mkl_core.lib;%MKLROOT%\lib\intel64\mkl_sequential.lib;%MKLROOT%\lib\intel64\mkl_intel_ilp64.lib -DBUILD_TESTING=OFF
You have a repository on github, with a submodule that points to another of your repos on github.
You pull both, and push them both to your internal/offline github/gitea.
There's an issue when updating the one that requires a submodule: The submodule still has its path pointing to something on github.
- Go to
.gitmodules
and change the url to your internal github/gitea repository for the submodule. - Run
git submodule sync
in the repo with the submodule - Then you can run
git submodule update --remote
as per normal.
Just look at this directly.
This seems like a better replacement for nvprof than ncu. Had to add this to PATH myself to find it though..
This gist serves to document certain cupy calls and the wall-clock timings associated with them. For each code snippet (usually timed simply by %%timeit), the corresponding timing output will be provided. For most/all calls, -n 100 is added because the processing time taken increases exponentially when the GPU is flooded with too many loops from %%timeit (not sure why, but probably tries to queue too many kernel calls together).
Yes, wall-clock time is not the correct way to measure GPU processing times, but usually when control switches between the interpreter back and forth, this is the more 'reasonable' time to look at.
If you're like me, you have some (or most) of your github repositories stored in some parent folder like ~/gitrepos
. Sometimes you want to just update multiple repositories together without having to git pull
each one individually. Hopefully this will help you.
First off, SSH access is required for private repos, so we'll assume you have this set up already. Github has enough documentation to cover this.
If your keys are password locked (as they should be), you should run
https://github.com/falkenber9/falcon Tested on 1.3.0 release source tar.gz. Tested with 3.15.0 UHD/1.72.0 Boost.
In order to use your own self-built UHD + self-built Boost, simply attach the similar CMake variables in cmake-gui when prompted, for both UHD and Boost.
However, the boost includes are not properly set up when the generation of makefiles is done. So during the build, for any files that error on boost includes, open the corresponding flags.make file and append the boost directory yourself accordingly. The following files were the culprits for me:
- build/srsLTE-build/lib/src/phy/rf/CMakeFiles/srslte_rf.dir/flags.make (add to both C_INCLUDES and CXX_INCLUDES)
- build/src/gui/model/CMakeFiles/model.dir/flags.make