Tinkering‎ > ‎

Monte Carlo Simulations and solving multi-variable differential equations

posted Nov 13, 2014, 8:56 AM by Ted Bibby   [ updated Nov 13, 2014, 1:19 PM ]
Here's a generalized run down of how I have solved multiple variable equations with multiple acceptable solutions, and then further modeled the probability of the solution based on measurement error. Dr. Daniel Morgan (Vanderbilt) has been instrumental in applying this method to modeling surface processes from concentrations of cosmogenic isotopes in rocks. I use Matlab to run my models but other languages work too.

Papers in my field that use Chi-Squared minimization:
1.                   Balco, G. & Stone, J. O. Measuring middle Pleistocene erosion rates with cosmic-ray-produced nuclides in buried alluvial sediment, Fisher Valley, southeastern Utah. Earth Surf. Process. Landforms 30, 1051–1067 (2005).         
2.                   Morgan, D. J., Putkonen, J. K., Balco, G. & Stone, J. O. Degradation of glacial deposits quantified with cosmogenic nuclides, Quartermain Mountains, Antarctica. Earth Surf. Process. Landforms 36, 217–228 (2010).
3.                   Balco, G. Contributions and unrealized potential contributions of cosmogenic-nuclide exposure dating to glacier chronology, 1990–2010. Quat. Sci. Rev. 30, 3–27 (2011).
4.                   Balco, G., Stone, J. O., Lifton, N. a. & Dunai, T. J. A complete and easily accessible means of calculating surface exposure ages or erosion rates from 10Be and 26Al measurements. Quat. Geochronol. 3, 174–195 (2008).
5.                   Auer, M. et al. Cosmogenic 26Al in the atmosphere and the prospect of a 26Al/10Be chronometer to date old ice. Earth Planet. Sci. Lett. 287, 453–462 (2009).
6.                   Vermeesch, P. et al. Interlaboratory comparison of cosmogenic 21Ne in quartz. Quat. Geochronol. (2012).

Background: When we measure things in the lab, like concentration of arsenic in water, there is an associated error with that measurement. One normally sees measurements reported as: 25 +/- 0.23 g/ml.
Now, if we have an equation that we solve for some other variable(s) that use the measured arsenic concentration, then there is an associated error with the solution. This is general propagation of measurement error and there are many (better than I) resources that explains how it works.

In my work, I measure concentrations of isotopes (think chemicals) in depth profiles from the ground surface to about a meter below ground. In the picture below you'll see our sampling strategy. Every 10 cm or so we collect some dirt and measure the concentration of isotopes (represented by purple dots). Now, based on some principles of the cosmogenic isotope method, the concentration at depth should follow an expected pattern (pink line) which in my case is mathematically described by a multi-variable equation with 3 unknowns (simplest version). For example sake, the concentration at depth  is controlled by 1) time, 2) previous isotope concentrations (think contamination), and 3) the local erosion rate.

Chi-Squared minimization:

So, how do you solve for these three unknowns? You change the 3 variables over and over and over till the resulting line matches the measured concentrations. We can then calculate the fit of our line to our data (see linear regression). I use chi-squared.

In Matlab, you'll need a Function to solve your equation(Csolution=x+y+z). We have our inputs (3 unknowns) x, y, z, (and 1 known) Cmeasured. So you call on the function, input your variables and calculate Csolution. You can change the variables over and over and over and generate many solutions for c, but we want to match our solution to actual measured concentrations.

for    fit=Function(x,y,z)

            Csolution=x+y+z
end

This gives us an answer for Csolution.

Now we modify our function to find the difference between our solution and our measurements

for    fit=Function(x,y,z,Cmeasured)

            Csolution=x+y+z 
            Diff=sum(Cmeasured-Csolution)

end

So every time you change your input variables you'll get two answers. Csolution (calculated from your unknowns), and Diff.
To generate a line that fits our data as closely as possible we want to minimize Diff, so that it's as close to zero as possible.
We could manually do this over and over and over, and make a spreadsheet to record our results, but Matlab will do it for us. This is the basic foundation of chi-squared minimization. The above equation to calculate Diff , is used as a simplified example, the equation for chi-squared is slightly more complicated.

I use a function written by John D'Errico called fminsearchcon(), it's modified from fminsearch(), but allows you to put limits on acceptable answers. Such as if you know x is greater than 10 but less than 50 and you only want solutions for x between 10 and 50.

So this function's job is to modify the input variables x, y, z, over and over and over till it generates the smallest chi-squared value. And once it does that, it spits out what x, y, z, are. 
Ok great! now I have three answers to three unknowns, but how can I prove that they are statistically valid given the my measurement errors? If this doesn't make sense think about this: If my Cmeasured is 25 +/- 0.3, then my solution of x, y, z, only takes into account Cmeasured as 25. But, what if Cmeasured is 25.03? What would be my resulting solution of x, y, z, then? What if Cmeasured is 24.07? This becomes increasingly complicated when you're generating a single solution to fit 2 or more points. 


Here, two plots show the variable solutions. The initial solution (1st figure) over shoots fitting a line to the circle data points (Diff=20), the second solution (2nd figure) adjust the unknowns, thereby lowering the expected concentrations and giving a better fit (Diff = 2). Ignore the square sample points, they were not used in the line fit statistics.

Think about this: If you have four data points of measurements (Cmeasured), then you will have one solution for each data point, thus four (Csolution)'s, as a function of three unknowns.

ex) You've measure 4 data points, so Cmeasured = [ 2+/-0.03, 2.3+/-0.03, 2.8+/-0.03, 3+/-0.03] (I have the error sub scripted to remind us that there is measurement error)
        and you use your function to solve for Csolution, then you've generated a solution for each data point Csolution= [ 2, 2, 2, 2] (or whatever the corresponding solution is to depth)
        and our function calculates the differences between Cmeasured and Csolution, so Diff = [0, 0.3, 0.8, 1]
        and the sum of Diff = 2.1, for sake of argument this is our minimized chi-squared value
        now, what if your Cmeasured was actually the max error, or min error, or any fraction anywhere in between? What would be Csolution then? And what would the resulting Diff be?
You get where i'm going? All of a sudden you have generated a ton of solutions for  x, y, z, and they are all, for better or for worse, acceptable.
Now you need to prove that there is a single "best"  x, y, zsolution, and define what the +/-  x, y, z, boundaries of the acceptable solution are.

On to Monte Carlo Simulations:

Ok, so we're able to solve for 3 unknowns, but we want to model what all other solutions would be if our Cmeasured was any other number within the rage of the reported errors. To do this, we need to write some code that randomly generates new Cmeasured numbers that fall within the measured error for each point. The idea is that for any number of times say 10, we generate a new random set of Cmeasured values. Time 1, generate random numbers and solve for minimized Diff. Time 2, again generate random numbers and solve, and again, and again, and again. The more times you run this loop the better and all along the way record the solutions.

The code would look something like this:

for i=1:100         %time steps (i.e. do 100 times)

Cmeasured(i)=random_number_between_(Cmeasured+error)_and_(Cmeasured-error)        %generate random data points with error bounds for each point

    for    fit=Function(x,y,z,Cmeasured)        %solve/minimize x, y, z, for randomly generated Cmeasured

            Csolution=x+y+z 
            Diff=sum(Cmeasured(i)-Csolution)
            
            Recordx(i)=x                                 % record the x value for each iteration in an array
            Recordy(i)=y                                 % record the y value for each iteration in an array   
            Recordz(i)=z                                 % record the z value for each iteration in an array  
            RecordDiff(i)=x                              % record the Diff value for each iteration in an array

    end
end

Once the code has cycled through itself 100 times (or more depending on robustness) you will have results of all your possible solutions given any random data point with the error.
The next step is to visualize this. Matlab has a convenient "cumulative distribution function" plot that organizes all the potential solutions and ranks them.
The top figure here is a bar plot of my solution distributions divided into bins. In the above plot "time" is one of my solved variables. Red line marks mean solution, green lines mark 68% confidence intervals. The bottom plot is what cdfplot() returns (blue line) within Matlab, showing the cumulative distributions of my solved variables. For the original example at hand, you can generate 3 cdfplot()'s, one for each solution (x, y, z). These will be your reported solutions +/- the confidence interval of your choosing (generally 1, 2, or 3 sigmas).

There are a number of other nuances to write about, considerations, yadda, yadda, but this should be a good jumping off point. Have fun!


Comments