Modern x86 machines use two levels of caching. These levels are L1 and
L2, in which L1 is a split cache that consists of Instruction cache(I1) and
Data cache(D1). L2 is a unified cache.
The configuration of a cache means its size, associativity and number
of lines. If the data requested by the processor appears in the upper level
it is called a hit. If the data is not found in the upper level, the
request is called a miss. The lower level in the hierarchy is then accessed to
retrieve the block containing requested data. In modern machines L1 is
first searched for data/instruction requested by the processor. If it is a
hit then that data/instruction is copied to some register in the processor.
Otherwise L2 is searched. If it is a hit then data/instruction is copied to
L1 and from there it is copied to a register. If the request to L2 also is
a miss then main memory has to be accessed.
Valgrind can simulate the cache, meaning it can display the things that
occur in the cache when a program is running. For this, first compile your program
with -g option as usual. Then use the shell script cachegrind instead of valgrind.
Sample output:
==7436== I1 refs: 12,841
==7436== I1 misses: 238
==7436== L2i misses: 237
==7436== I1 miss rate: 1.85%
==7436== L2i miss rate: 1.84%
==7436==
==7436== D refs: 5,914 (4,626 rd + 1,288 wr)
==7436== D1 misses: 357 ( 324 rd + 33 wr)
==7436== L2d misses: 352 ( 319 rd + 33 wr)
==7436== D1 miss rate: 6.0% ( 7.0% + 2.5% )
==7436== L2d miss rate: 5.9% ( 6.8% + 2.5% )
==7436==
==7436== L2 refs: 595 ( 562 rd + 33 wr)
==7436== L2 misses: 589 ( 556 rd + 33 wr)
==7436== L2 miss rate: 3.1% ( 3.1% + 2.5% )
|
L2i misses means the number of instruction misses that occur in L2
cache.
L2d misses means the number of data misses that occur in L2 cache.
Total number of data references = Number of reads + Number of writes.
Miss rate means fraction of misses that are not found in the upper
level.
|
The shell script cachegrind also produces a file, cachegrind.out, that
contains line-by-line cache profiling information which is not humanly
understandable. A program vg_annotate can easily interpret this
information. If the shell script vg_annotate is used without any arguments it will read the file cachegrind.out and produce an output which is humanly understandable.
When C, C++ or assembly source programs are passed as input to
vg_annotate it displays the number of cache reads, writes, misses etc.
I1 cache: 16384 B, 32 B, 4-way associative
D1 cache: 16384 B, 32 B, 4-way associative
L2 cache: 262144 B, 32 B, 8-way associative
Command: ./a.out
Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Thresholds: 99 0 0 0 0 0 0 0 0
Include dirs:
User annotated: valg_flo.c
Auto-annotation: off
|
User-annotated source: valg_flo.c:
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
. . . . . . . . . #include<stdlib.h>
. . . . . . . . . int main()
3 1 1 . . . 1 0 0 {
. . . . . . . . . float *p, *a;
6 1 1 . . . 3 0 0 p = malloc(10*sizeof(float));
6 0 0 . . . 3 0 0 a = malloc(10*sizeof(float));
6 1 1 3 1 1 1 1 1 a[3] = p[3];
4 0 0 1 0 0 1 0 0 free(a);
4 0 0 1 0 0 1 0 0 free(p);
2 0 0 2 0 0 . . . }
|