This option applies only to MIC cards. It first scatters threads to each core, so that each core has at least one thread, and it sets thread numbers utilizing the different hardware threads of the same core are close to each other
Below is an example of setting KMP_AFFINITY to various options to allocate 6 OpenMP threads on one MIC card. For illustration simplicity, assume each MIC card has only 3 cores instead of 60 cores.

This is a new environment variable available only for the MIC cards. It does not replace KMP_AFFINITY, but works with it to set exact but still generic thread placement.
NOTE: the operating system (OS) always runs on logical processor core 0, which lives on physical core 59 on Babbage. OS procs on core 59 are threads 0, 237, 238, and239. Please avoid using proc 0; i.e., use max_threads=236 on Babbage.
An environment variable I_MPI_PIN_DOMAIN can be used to set MPI process affnity. Thread affinity then works within process affinity. The value of I_MPI_PIN_DOMAIN can be set in 3 different formats.
I_MPI_PIN_DOMAIN=:, where can be omp, auto, or an explicit value; Here the value of ranges from 1 to 240, which is the max number of logical cores. auto is the default. can be platform, compact, or scatter. scatter is the default.
Notice that core numbering is different on the MIC cards than on the host nodes. On the host nodes, core numbering starts with 0, while on the MIC cards, core numbering starts from 1, since core 0 is reserved for the operating system. Mult-core shape and explicit shape schemes as listed above will automatically account for this. Nested OpenMP
• Use "get_hostfile" instead of "get_micfile" in your batch script, and use "-hostfile hostfile.$SLURM_JOB_ID" in the mpirun command. You can also create a custom hostfile with lines such as "bc1013-ib."
• When running on the Babbage host processors, the maximum MPI tasks times OpenMP threads per compute node is currently 16, because HT is not enabled. This compares with the maximum number of 240 on each MIC card.
When building software and libraries on MIC cards using autoconf/configure scripts, sometimes a test program needs to be run. Since the build is a cross-compile from the login node (or host compute nodes), the binary generated from the test program is for running on the MIC cards, so this test program will fail during the configure process (binaries are not compatible between the host nodes and MIC cards).
In order for such configure to succeed, and a resulting Makefile can be generated to be used for successfully building the intended software or libraries for the MIC cards, we suggest two workarounds:
The first option to try is to use the "–host=x86_64-unknown-linux-gnu" option for configure so that many test programs can be skipped. If this fails, another trick is to define "-DMIC" for the the compiler options such as CC, CXX, FC, etc. used in "configure": export CC="icc –DMIC", … . Then replace all "-DMIC" in the generated Makefile with "-mmic", then compile and build.
• Intel MPI dynamically selects the most appropriate network fabric for communications. Inter node communication uses "shm", intra node communication uses tcp, or dapl and ofa based on Infiniband. Use environmnt variable "I_MPI_FABRICS" to "intranode fabric: internode fabric" at run time to specify network fabric. The default fabric is "shm:dapl". Available I_MPI_FABRICS choices on Babbage are "shm:dapl". "shm:ofa", "shm:tcp". Try different fabrics with your application to choose the one that helps performance the most. MPI fabrics used be displayed if environemtn variable "I_MPI_DEBUG" to is set to 2 or higher.
When -opt-assume-safe-padding is specified, the compiler assumes that variables and dynamically allocated memory are padded past the end of the object. This means that code can access up to 64 bytes beyond what is specified in your program. To satisfy this assumption, you must increase the size of static and automatic objects in your program when you use this option.
Enables generation of streaming stores for optimization. It helps especially when the application is memory bound. Do not read the original content of entire cache line from memory when we overwriting its whole content completely.
Lets you specify a level of accuracy (precision) that the compiler should use when determining which math library functions to use. Low is equivalent to accuracy-bits = 11 for single-precision functions; accuracy-bits = 26 for double-precision functions.
The compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation. It gives slightly less precise results than full IEEE division.
Users should explore single node performance of your code on Babbage in order to prepare your application for N8 architecture (Intel KNL). Fully utilize vectorization and thread scalability on the Babbage KNC cards are especially important.
• NERSC Application Readiness Case Studies: Some examples of challenges and strategies used to optimize scientific applications and kernel codes performance on Babbage.
Intel Trace Analyzer and Collector (ITAC) is a tool to help understand the MPI application behavior, quickly find bottlenecks and achieve high performance on parallel applications.
Then compile with the flag "-trace" (with VT library to trace entrance of each MPI call), or "-tcollect" (with full tracing). At run time, add the "-trace" flag to the mpirun.mic option. A "*.stf" file will be generated, which can be used via the "traceanalyzer" command to open a GUI. ITAC can be used on both the host and on the MIC cards.
-bash-4.1$ amplxe-cl -collect advanced-hotspots -target-system=mic-host-launch -app-working-dir /global/homes/y/yunhe/MIC/test_codes — mpirun.mic -n 8 -host bc0903-mic0 /global/homes/y/yunhe/MIC/test_codes/jacobi_mpiomp.mic
You can also do “-collect general-exploration” or “-collect bandwidth”, or add additional flags in the above command line by using the show “command line” button from GUI.
You can also do "-collect knc-general-exploration" or "-collect knc-bandwidth", or add additional flags in the above command line by using the show "command line" button from GUI. See "amplxe-cl" man page for more command line options.
from within a batch script as above. GUI can be used to open performance data collected from command line session, or start a new analysis directly. See our VTune web page for more information.
The goal of the Advisor is to help you find sections of your application to which parallelism can be added to give you the best performance gains and scalability while maintaining correct results.
Intel Math Kernel Libraries (MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, FFT, and vector math. MKL path and environment variable $MKLROOT are defined as part of the default loaded "intel" module.