Rafiki Cheesecake: programming

Tuesday, July 9, 2013

Optimization in c++

Recently, a question arose about the most efficient way to store certain data in a program. In essence, a unique identification is needed for a detector component, which can be broken down into properties A, B, C, and D. Many of these things will need to be initialized, stored, and sorted, and we wanted to know if a c++ struct (essentially identical to a c++ class, but conventionally used for storing related information together; at least in my experience, they tend to have far fewer methods and their fields are almost entirely public) or a sort of hashed integer (multiply property A by 10000, add B times 1000, etc.) would be more efficient.
A struct can have a manual comparison function, so it would be sortable, but we were interested in determining how efficient this sorting process is compared to sorting plain old integers, so I ran a quick little study to compare them.

The three candidate data structures are:

An integer, which is defined as $(A \times 10000) + (B \times 1000) + (C \times 100) + D$. The sorting function is a simple comparison, but is provided manually in order to ensure similarity between the trials.
A Specification struct, which contains fields for track number, view, plane, and wire, with a comparison function that accesses one or more of these fields in order, henceforth known as the naive struct.
A Specification struct with the same fields as above, but with an additional integer uniqueID field, calculated after all setting is done according to the formula above. The sorting function compares only the uniqueID. This type is referred to as the smart struct.

What I investigated was the execution time to create and sort $10^7$ objects of these different data types, using c++'s clock type and clock() method to count CPU cycles spent in the program (so as to avoid influences from system interruptions).

I timed both creation and sorting the data structures with no optimization and then with gcc -O3 optimization. Results are summarized in the figures below, and discussion is below the relevant figure.

Unoptimized times to create and sort $10^7$ objects of the three types

Here, unsurprisingly, we see that the integer is faster in all respects than the structs. Since we have a total of $10^7$ objects, the total time difference boils down to around 150 nanoseconds per object of extra time. As we would expect, it takes a little longer to create the smart struct than the naive one, since we have to set up the uniqueID, but it sorts much faster. The interesting stuff starts to happen when the code is compiled with maximum (standards-compliant) optimization.


Optimized times to create and sort $10^7$ objects of the three types. The purple/pink is an integer with its default sorting mechanism.

For one thing, it's obvious that optimization speeds up the code, especially the sorting, by a lot: it runs around a factor of four faster than in the unoptimized version. It also brings the performance of the three data structures much closer together, and for some strange reason, the smart struct appears to actually sort faster than the integers; what a mystery!
The last consideration is what I call the bare integers; that is, the same uniqueID as held by the regular integers, but with the default integer sorting mechanisms, rather than a custom-provided function. The running time difference between the bare and bloated integer types is on the order of 20 nanoseconds, which corresponds to around 40 assembly instructions; probably around the right number to correspond to a function call.

My conclusion from this study is that when compiled with optimization, structs and ints are roughly identical in terms of CPU time required to create and sort them. Since structs pretty dramatically increase readability of the code, I'm sticking with those instead of tiptoeing around a funky hashed integer.

Wednesday, May 29, 2013

Monte Carlo

Math is pretty awesome, but there are some problems that just have no closed-form solution. Similarly, there are many problems in physics that are most easily solved by numerical simulation. There are many approaches to numerical simulation. One of the easiest (quite probably the easiest, actually) is called Monte Carlo, and it operates by injecting randomness in a system, seeing how it pans out, and repeating many times to reduce statistical uncertainty. That's how I would define it, in any case - you'll get different answers if you ask different people. The presence of randomness means that by increasing the number of samples, you can decrease your error, which contrasts sharply with deterministic methods, for which you're pretty much stuck with what you have.

Some examples of the usefulness of Monte Carlo techniques include risk analysis in investment, particle physics simulations (how do you think I heard about it?), and randomizing the parameters of a simulation in order to determine sources of error. It's a useful technique in many fields.

A simple application of Monte Carlo techniques is in numerical integration. Some functions have no friendly antiderivative, which means we're stuck with one method of numerical integration or another. I'm by no means saying that Monte Carlo is the right way to go on this one; there are plenty of deterministic methods that will also work quite well, but it's an easy example to get your head around the power of randomness in numerical simulation. And it turns out that once you get to around 20-dimensional integrals (not too uncommon, it turns out, in theoretical physics), Monte Carlo results in lower uncertainties than deterministic methods.

Here's the plan: we have a function $f(x)$ that we want to integrate from $x_0$ to $x_1$. To do so, we'll generate a lot of random points in the rectangular region from $x_0$ to $x_1$ and $y=0$ to $y_{max}$ (suppose for the sake of simplicity that $f(x)$ is positive everywhere - the method extends to negative functions as well with a little more hassle). We'll then check each point to determine whether it's under the curve. In the end, we take the ratio of the number of points under the curve to the number of points we generated, scale by the size of the rectangular region, and the result is our numerical integration! Here's a demonstration of how it works:

Demonstration of Monte Carlo numerical integration, on
a Gaussian function with standard deviation 0.3.
1000 points (little blue dots) were generated, but only 379
were under the curve (circled in black). Thus this particular
simulation yielded a value of 0.7580 for the integral, as
compared with the true value, 0.7468.

This plot was generated with the following simple MATLAB code:

function result = nIntegrate(func, minX, maxX, maxY, n)
    % Use maxY instead of finding the maximum to avoid evaluating the 
    % function unnecessarily. 
    totalArea = (maxX - minX) * maxY;
    
    % Generate and scale the random points
    xs = (rand(n,1) * (maxX - minX)) + minX;
    ys = rand(n,1) * maxY;
    pts = [xs ys];
    
    % Figure out which ones are under the curve
    filtered = pts(ys < func(xs),:);
    nUnderCurve = length(filtered);
    
    result = (nUnderCurve / n) * totalArea;
    
    % Make a nice plot to go with the analysis.
    figure; hold on;
    title(sprintf('Monte Carlo integration: %d points', n));
    xlabel('x');
    ylabel('y');

    % All points
    scatter(xs, ys,1);
    
    % The function
    realX = linspace(minX, maxX, 100);
    realY = func(realX);
    plot(realX, realY,'Color','red','LineWidth',2);

    % The points under the curve
    scatter(filtered(:,1), filtered(:,2), 15, 'k');
    hold off;
end

The function was then called like this:

nIntegrate(@(x)gaussmf(x,[.3,0]),-1,1,1,1000)

(the at sign makes an anonymous function; in this case, a function that will return the value of the Gaussian centered at zero with standard deviation 0.3 at a given value of x.)

I can't quite figure out how to get MATLAB syntax highlighting in there, but you get the idea. The marvelous thing is that it takes only six lines of code to do the whole integration numerically. The rest is just plotting the results.

Furthermore, you can easily see that the statistical uncertainty decreases as you increase the number of samples:

The error in the integration decreases with an increase
in the number of points.

Cute, huh? And that's barely scratched the surface of what Monte Carlo can do.