Tuesday, September 25, 2007

Update: Using Valgrind/massif with fpc

In a previous post i claimed that the option to hide memory allocation wrappers in massif was broken or not working with fpc.

The problem is that i was using only the pascal name of the function (CMEM_CGETMEM) instead of the mangled internal name (CMEM_CGETMEM$LONGINT$$POINTER).

Thanks to Michalis Kamburelis that gave the hint and also provided this script:

#!/bin/sh
set -eu

valgrind --tool=massif \
--alloc-fn='CMEM_CGETMEM$LONGINT$$POINTER' \
--alloc-fn='CMEM_CREALLOCMEM$POINTER$LONGINT$$POINTER' \
--alloc-fn='SYSTEM_GETMEM$LONGINT$$POINTER' \
--alloc-fn='SYSTEM_GETMEM$POINTER$LONGINT' \
--alloc-fn='SYSTEM_REALLOCMEM$POINTER$LONGINT$$POINTER' \
--format=html \
"$@"
I updated the previous charts so they show where in pascal code the memory is allocated.

PS: it's annoying that the function names are truncated in the charts.

Sunday, September 23, 2007

Zlibar memory behavior compressing many small files

The previous analysis of zlibar memory behavior was done taking as example the compression of one big file. Let's with many small files (all *.pas files under lazarus/lcl dir).

Original (load the entire file in memory):
The heaptrc dump:
18573 memory blocks allocated : 496119560/496173048
18573 memory blocks freed : 496119560/496173048
0 unfreed memory blocks : 0
True heap size : 4423680
True free heap : 4423680


After memory optimization (load file in small buffer):
The heaptrc dump:
17446 memory blocks allocated : 439659064/439708048
17446 memory blocks freed : 439659064/439708048
0 unfreed memory blocks : 0
True heap size : 2326528
True free heap : 2326528



We can take some conclusions:
  • The memory usage is almost equal over time (the graph scale does not help much here)
    UPDATE: the heaptrc dump shows that the original code really takes more memory.
  • In the optimized build the memory is allocated in a continuous fashion, always growing. The original build the memory is allocated and freed all over time while still growing in the end. This can lead to more memory fragmentation.
  • In the optimized build there's not the final peak. This is not really expected since the section of code responsible by the peak was not changed. Some options: 1) the peak exists but valgrind does not detect 2) a bug in the optimized code 3) an unexpected (and good) side effect

Reduce zlibar memory usage - step1

Previously we learned that zlibar uses 1.5MB of memory to compress a 1.1MB file. It compress each file in three passes: 1) loads all the file into memory 2) feed the deflate function with this data using a small buffer as a bridge 3) calculate the md5 signature transversing the file data again.

The option is to compress using only one pass: load the file data incrementally into a memory buffer and than feed deflate and md5 functions with it. It has two advantages: the memory usage (in this step) is constrained by the size of the buffer and you save the memory copy from the stream (that holds the file data) to the deflate buffer and to md5 buffer.

The modified InternalCompressStream function would be something like:

MD5Init(Context);
z.next_in := @input_buffer;
z.avail_in := FileRead(InStream, input_buffer, MAX_IN_BUF_SIZE);
MD5Update(Context, input_buffer, z.avail_in);
while z.avail_in > 0 do
begin
repeat
z.next_out := @output_buffer;
z.avail_out := MAX_OUT_BUF_SIZE;
err := deflate(z, Z_NO_FLUSH);
OutStream.Write(output_buffer, MAX_OUT_BUF_SIZE - z.avail_out);
until Z.avail_out > 0;
z.next_in := @input_buffer;
z.avail_in := FileRead(InStream, input_buffer, MAX_IN_BUF_SIZE);
MD5Update(Context, input_buffer, z.avail_in);
end;
MD5Final(Context, Result.Md5Sum);

Lets run valgrind/massif to see what we got:

Comparing with the previous graph we notice a great memory usage reduction: from 1.5MB to 0.5MB.

The heaptrc dump:
55 memory blocks allocated : 2811961/2812056
55 memory blocks freed : 2811961/2812056
0 unfreed memory blocks : 0
True heap size : 950272
True free heap : 950272

Let's do a deeper analysis:
  • The memory used by deflate functions is close to the expected 256kb.
  • The pink area is the memory used by the stream that holds the compressed data. It is allocated incrementally so the ascending angle.
  • There's a peak after the deflate memory is freed and just before the program finishes. This represents the copy from the compressed stream to the output stream that doubles the data in memory. More on this later.
Someone may say that load a file using a small buffer is slower than reading all data directly in memory. This is true but it would be reasonable only with small files. In a general usage packer it would be a big limitation.

Tuesday, September 18, 2007

Using Valgrind to profile fpc applications

Before starting to optimize is necessary to know beforehand what and where to optimize. It's here that the profiler tools plays a role. Valgrind, and its brother KCachegrind, are know unix profiler tools that makes success in the C crowd. Let's see if is useful to fpc programmers.

Zlibar is a fpc component that encapsulates the paszlib functions in a programmer friendly way. I use it in the Becape application and for sometime i have a plan to optimize it.

The heart of the component is the TZlibWriteArchive.CreateArchive method that compress the files (InputFiles) into a stream (OutputStream):

TmpStream := TMemoryStream.Create;
TmpFile := TMemoryStream.Create;
[..]
for X := 0 to fInputFiles.Count-1 do begin
[..]
TmpFile.LoadFromFile(fInputFiles.FileName[X]);
[..]
FileInfo.CompressedSize := InternalCompressStream(X, TmpFile, TmpStream); //(1)
FileInfo.Md5Sum := StreamMD5(TmpFile);
[..]
end;
WriteHeader(AHeader);
OutStream.CopyFrom(TmpStream, TmpStream.Size);//(2)
[..]


It creates two temporary memory streams (TmpFile and TmpStream). TmpFile will hold the uncompressed data of the file being added. TmpStream will be filled with the compressed data in step (1). The process continues until all files are compressed in the TmpStream.
After that the header is written in the OutputStream and then the compressed data is written in the OutputStream.

The problems:
  1. The file is stored entirely in memory before is processed. For small files is fine but for larger files it would be problems.
  2. Even for small files the memory of TmpFile will be reallocated in most of the LoadFromFile calls
  3. After step (2), you will have three streams in the heap: an uncompressed file, the compressed data of all files and a header + the compressed data of all files
To see this in action i created a small application that just compress only file (the VirtualTrees.pas with 1.1MB), and compiled with -gv (Generate code for Valgrind) and -gl. Run with valgrind(callgrind):

valgrind --tool=callgrind ./zlibar_opt
It was created a file with the pattern callgrind.out.[pid] that i loaded in KCacheGrind. In this tool is possible to see most of the function calls the application did, the times that each function were called, who called who, and the time each function spent.
To my surprise the memory allocation routines does not spent much time (in fact was zero). The most expensive was, of course, the compression related functions.





Now let's use the massif tool:

valgrind --tool=massif ./zlibar_opt
It creates two files: massif.[pid].ps and massif.[pid].txt. The txt file contains info about the callstack and how much memory each function allocated. The ps file contains a graphic showing the functions that were responsible for most memory allocation and the evolution in time. See below:


Now you say "what hell is this"?
To work with valgrind the -gv option forces the use of the cmem memory manager which is a wrapper around malloc, so massif understand the cmem* functions as the programmers allocation routines. I tried to use the fn-alloc option to force the display of the pascal functions without success.

Update: i was passing only the pascal name function (CMEM_GETMEM) to fn-alloc while is necessary also the parameter list names. I updated the charts to show where in pascal the memory is allocated.

Anyway in the graphic we can see that 1.5MB of heap memory is allocated (used?*) at its peak and that is the point where we can optimize.

Here's the heaptrc dump:
58 memory blocks allocated : 3926105/3926208
58 memory blocks freed : 3926105/3926208
0 unfreed memory blocks : 0
True heap size : 950272
True free heap : 950272

In the next articles, i will take a look in the possible optimizations.

Notes:
  • * The fpc heap manager pre allocates space in the heap that sometimes is not all used but i dont know if this is still valid when using the cmem functions
  • The -gv option is necessary to run the massif tool but is dispensable when using the callgrind
  • The valgrind checkmem tool is of little utility to fpc since it provides the heaptrc unit with a lot of advantages
  • One drawback of valgrind is that is exclusive to unix. No windows.
  • For more info see the valgrind manual

Back to blogging and to lower levels

This blog has been quiet lately, but is not dead.

In the last months a played a lot with LCL, sometimes in the VirtualTreeView port, sometimes writing original components. I will take a rest in high level/GUI programming and return to lower levels.

Starting from today i will post a sequence of articles about memory usage and general optimizations to fpc/Lazarus applications.

Later, i plan to talk about file system integration and IPC techniques.