Generating Uniform, Normal, or Lognormal Random Numbers in SQL Server (2024)

By: Rick Dobson |Updated: 2024-05-31 |Comments | Related: > TSQL

Problem

I support a data science team that often asks for datasets with different distributionvalues in uniform, normal, or lognormal shapes. Please present and demonstratethe T-SQL code for populating datasets with random values from each distributiontype. I also seek graphical and statistical techniques for assessing how arandom sample corresponds to a distribution type.

Solution

Data science projects can require processing and simulation of all kinds of data.Three of these kinds of data are briefly referenced below.

Uniform Value Distribution:Sometimes, a data science project may require the creation of sample datasetswhere each of several items occurs an approximately equal number of times.For example, a lottery can require winning lottery numbers from uniformly distributeddata—that is, every possible lottery number has an equal chance of winning.
Normal Value Distribution:Other times, a data science project may need data that are normally distributedwhere values occur most commonly for a central value, such as a mean, and therelative frequency of values declines in a symmetric way the farther away theyare from a central value. This kind of distribution is common for theIQ scores or the heights of students at a school.
Lognormal Value Distribution:Another type of distribution sometimes used in data science projects is onewhere values are skewed to the right. A lognormal function is one kindof distribution that exhibits this tendency. In a lognormal distribution,there are relatively many values near the lower end of a distribution of values,and the relative frequency of values declines sharply for larger values.The distribution of personal incomes or the population of cities in a collectionof cities can have this type of distribution.

This tip presents and describes the T-SQL code for generating the preceding threedistribution types. You will also be exposed to quantitative as well as graphicaltechniques for verifying the distribution of values in a dataset.

Generating and Verifying Uniform Value Distributions

The main objective of this tip section is to illustrate how to generate a setof random integer values that are uniformly distributed – that is, whose frequenciesof occurrence are about the same for all integers.

TheMicrosoft documentation for SQL Server reveals that the T-SQL rand() functionreturns pseudo-random float values from 0 through 1, exclusive. Floatvalues returned by the rand() function can be transformed to integers by anexpression that depends on the minimum and maximum integer values as well asa floor() function.
This section also demonstrates how to program a Chi Square goodness of fittest to assess statistically how well the rand() function and floor functionwork together to return uniformly distributed integer values. The statisticaltest draws on values that can be returned by the CHISQ.INV.RT() function inExcel. If you decide to use another approach for generating uniformlydistributed random integers besides the one demonstrated in this tip, you shouldconsider using the statistical test to ensure that your alternative approachalso satisfies the requirement of uniformity across a range of integer values.
The section closes with a histogram chart for the results set from a T-SQLselect statement for the output from the process described in this section.You can use the histogram to visually assess how uniformly the integers aredistributed.

The following script shows a T-SQL script for generating the random integersas well as implementing the Chi Square test to assess if the random integers areuniformly distributed across a designated set of values. The script also showsthe outcome of copying a T-SQL results set into Excel for creating a histogram.

The script has four sections; each section is separated from the preceding oneby a line of comment markers.

The first section starts with a use statement to designate a default database(DataScience); you can use any other database you prefer.

The @min_integer and @max_integer local variables allow you to designatethe range of integer values in your results set of random integer values.
The @loop_ctr and @max_loop_ctr local variables are for controlling thenumber of uniform random integers generated.
The @number_of_integers local variable is for storing the count of integervalues that are randomly generated.

The second section is to create a table for storing the random integer values(#rand_integers). The rows of the table are populated with the rand() functioninside a while loop. The loop populates the #rand_integers table with @max_loop_ctrrandom integer values. At the conclusion of this section, the uniform randomnumbers are generated. The next two sections are for helping you to verifyby either a statistical test or a histogram chart if the generated values are uniformlydistributed.

MSSQLTips.com previously publisheda tip for introducing the rand() function to T-SQL developers. One ofthe readers of that tip left a comment with his preferred approach that he believesis sufficiently faster to use in place of the approach demonstrated in this tipand the earlier one. I do not believe that while loops are an inherently evilapproach – especially for dashing off a solution that is quick to code andeasy for many beginners T-SQL developers to understand. If you adapt the commenter'sapproach you should consider running some assessment of the uniformity of the valuesgenerated as well as its performance in comparison to the solution illustrated below.

use DataSciencego -- This code sample returns a uniform random distribution of -- digits from @min_integer (1) through @max_integer (10) -- declare min and max random integer values-- and compute number of integers in a set-- echo assigned/computed local variable values-- set up to generate 100 uniform random digits for a distribution declare @min_integer tinyint = 1,@max_integer tinyint = 10,@number_of_integers tinyint ,@loop_ctr int = 0,@max_loop_ctr int = 100 set @number_of_integers = (@max_integer - @min_integer) + 1 select @max_loop_ctr [@max_loop_ctr],@max_integer [@max_integer],@min_integer [@min_integer],@number_of_integers [@number_of_integers] -------------------------------------------------------------------------- -- create a fresh copy of #rand_integersdrop table if exists #rand_integers create table #rand_integers(rand_integer tinyint) -- loop @max_loop_ctr times while @loop_ctr < @max_loop_ctrbegin -- generate a random digit from @min_integer through @max_integer-- and insert it into #rand_integers insert #rand_integers(rand_integer)select floor(rand()*(@max_integer - @min_integer + 1) + @min_integer) set @loop_ctr = @loop_ctr + 1 end -------------------------------------------------------------------------- -- assess the goodness of fit-- to a uniform distribution -- create a fresh copy of #ChiSquareCriticalValues-- the critical values are for a distribution testdrop table if exists #ChiSquareCriticalValuescreate table #ChiSquareCriticalValues( [df/prob.] tinyint null,[.05] float null,[.01] float null,[.001] float null) -- bulk insert forChiSquareCriticalValues.cs file -- to #ChiSquareCriticalValuesbulk insert #ChiSquareCriticalValuesfrom 'C:\My Text Files for SQL Server\forChiSquareCriticalValues.csv'with ( format = 'CSV',firstrow = 2,fieldterminator = ',')--,rowterminator = '0x0a') select * from #ChiSquareCriticalValues -------------------------------------------------------------------------- -- compute, save, and display chi_square_component for each integerdrop table if exists #chi_square_components select for_chi_square.rand_integer,for_chi_square.frequency,for_chi_square.[expected frequency],power((for_chi_square.frequency - for_chi_square.[expected frequency]),2) /for_chi_square.[expected frequency] [chi_square component] into #chi_square_componentsfrom(-- compute frequency and expected frequency values for each integerselect rand_integer,count(*) [frequency],cast(@max_loop_ctr as real)/cast(@number_of_integers as real) [expected frequency]from #rand_integersgroup by rand_integer) for_chi_squareorder by for_chi_square.rand_integer select * from #chi_square_components -- Compare Chi Square test statistic to -- Critical Chi Square value at .05 probability levelselect sum([chi_square component]) [Chi Square test statistic],@number_of_integers-1 [degrees of freedom],(select [.05] from #ChiSquareCriticalValues where [df/prob.] = (@number_of_integers-1)) [Critical Chi Square for at least the .05 probability level], (case when sum([chi_square component]) < (select [.05] from #ChiSquareCriticalValues where [df/prob.] = (@number_of_integers-1)) then 'it is a uniform distribution' else 'it is not a uniform distribution' end) [test result]from #chi_square_components

The last two sections in the preceding code segment illustrate how to run a ChiSquare goodness-of-fit test. This test depends on the CHISQ.INV.RT functionfrom within Excel. The following screenshot excerpt shows critical Chi Squaregoodness-of-fit values for degrees of freedom from 1 through 30 for significancelevels of .05, .01, and .001. For those who want more in-depth coverage fromMSSQLTips.com of the Chi Square goodness-of-fit test, you are referred tothis prior tip.

You can copy the values from the spreadsheet content to a CSV file so the criticalChi Square values appear as below. This CSV file is saved at C:\My Text Filesfor SQL Server\forChiSquareCriticalValues.csv for use by the preceding script.

Generating and Displaying Normal Value Distributions

The normal distribution is among the most popular distributions for data scienceprojects. Many kinds of data are normally distributed, and it is thereforeuseful to be able to create normally distributed datasets for analysis and simulationobjectives. Dallas Snider authored a tip onpopulating with T-SQL datasets of normally distributed random values based onthe Box-Muller transform. According toWikipedia,this transform converts pairs of uniformly distributed values to normally distributedvalues. Snider's implementation of the transform shows how to use SQLServer's rand() function output to create normally distributed value sets.

The normal distribution values in this tip section are different from the uniformdistribution values in the preceding section in several key respects.

Normally distributed values have a bell-shaped frequency count by normaldeviate value. Uniformly distributed values have a rectangular-shapedfrequency count by uniform deviate value.
The frequency counts for normally distributed values near the mean are greaterthan those farther away from the mean. The frequency counts for uniformlydistributed values are generally the same, no matter how close or far away theyare from the mean.

This tip section briefly describes a slightly updated version of the originalT-SQL code included in the Snider tip on generating normally distributed values.This tip also offers fresh comments about the code to help you derive your own normallydistributed datasets. The section concludes with a couple of histogram chartsshown in Excel for two different parameter sets to control the shape of a normallydistributed value set.

The script below has four sections for deriving normally distributed values.

The first section starts by designating a default database (Database), butyou can name any other database you prefer. Next, a fresh version of the##tblNormalDistribution table is created. The first section concludesby declaring and populating a set of local variables (from @pi through @precision).The initial assignments for the local variables are:
- @iteration is set equal to 0.
- @pi is set equal to pi().
- @2pi is set equal to 2.0 times @pi.
- @mean and @stdDev are set to 0 and 1, respectively, for the set of normaldeviates that you are generating because the code is for normal deviatesfrom a standardized normal distribution. If your target distributionis not a standardized normal distribution, then you can use other valuesbesides 0 and 1 for initializing @mean and @stdDev.
- @precision is the number of spaces to the right of the decimal pointfor computed normal deviate values.
- @numberOfIterations is the number of iterations that generate normallydistributed values. The initial setting of 500 returns 1000 normaldeviate values – two normal deviates for each iteration.
The second section includes a while loop for transforming 1000 uniform deviatesinto 1000 normal deviates. Two uniform deviates are transformed to normaldeviates and inserted into the ##tblNormalDistribution table on each pass throughthe while loop.
The third section computes and displays selected statistics on the contentsof the ##tblNormalDistribution table after the while loop in the second sectionconcludes.
The fourth section bins the values for display as a histogram in Excel.Each successive run of the following script can generate a different set ofrandom normal deviates. Furthermore, modifying the initial values forlocal variables can also change the generated normal deviates in predictableways.

-- adapted from SQL Server T-SQL Code to Generate A Normal Distribution-- in MSSQLTips.com Use DataSciencego -- creating a temporary table for normal deviatesdrop table if exists ##tblNormalDistribution create table ##tblNormalDistribution (x float)go -- declare and set variablesdeclare @pi float, @2pi float, @randNum1 float, @randNum2 floatdeclare @value1 float, @value2 floatdeclare @iteration int, @numberOfIterations intdeclare @mean floatdeclare @stdDev float --standard deviationdeclare @precision int --number of places to the right of the decimal point select @iteration = 0select @pi = pi()select @2pi = 2.0 * @piselect @mean = 0 -- specifies the mean for a normal distributionselect @stdDev = 1 -- specifies the standard deviation for a normal distributionselect @precision = 1select @numberOfIterations = 500 --------------------------------------------------------------------------------- -- loop for number of iterations-- each loop generates two random normal deviates-- in x column of ##tblNormalDistributionwhile (@iteration < @numberOfIterations)begin select @randNum1 = rand() select @randNum2 = rand() select @value1 = round((sqrt(-2.0*log(@randNum1))*cos(@2pi*@randNum2))*@stdDev, @precision)+@mean select @value2 = round((sqrt(-2.0*log(@randNum1))*sin(@2pi*@randNum2))*@stdDev, @precision)+@mean insert into ##tblNormalDistribution (x) values (@value1) insert into ##tblNormalDistribution (x) values (@value2) select @iteration = @iteration + 1end --------------------------------------------------------------------------------- -- generate statistics for random normal deviatesselect count(*) as [Count], min(x) as [Min], max(x) as [Max], avg(x) as [Average], stdev(x) as [Standard Deviation] from ##tblNormalDistribution --------------------------------------------------------------------------------- -- binned normal deviates for a histogramselect round(x,0) as testValue,count(*) as testValueCountfrom ##tblNormalDistributiongroup by round(x,0)order by testValue

The following screenshot shows two histograms of normal deviation values by frequencycounts.

The top histogram is for a standard normal distribution with a mean of 0and a standard deviation of 1. The values along the horizontal axis areexpressed in standard deviation units.
The second histogram is for an ordinary normal distribution with a meanof 0 and a standard deviation of 1.5. This histogram also shows horizontalcoordinates in standard deviation units.
Notice that the bottom histogram has more testValues along the x-axis thanthe top histogram. Also, the top histogram is taller than the bottom histogram,with a noticeably larger frequency at its central bin than the bottom histogram.Both outcomes are the result of initializing @stdDev to 1.5 in the bottom histogramversus 1 in the top histogram.

Generating and Displaying Lognormal Value Distributions

This section adapts techniques presented in previous sections for the uniformand normal distributions for generating and displaying a value set with a lognormaldistribution. According toBrilliant.organdWikipedia, the probability density for a lognormal variate is defined by thefollowing equation:

X = eµ+σZWhereX is the lognormal probability distribution valueµ is the mean of the logarithm of Xσ is the standard deviation of the logarithm of XZ is the standard normal variable with a mean of 0 and a standard deviation of 1

If you take the log of both sides of the preceding equation, it may help to clarifywhy the preceding characterizes the values in a lognormal distribution. Theform of the equation after taking the log of both sides changes to:

Log X = µ + σZ

Because Z is a standard normal variate value, log X is also normally distributed.The distribution of log X values has a mean of µ, and its standard deviationis σ. Thus, µ designates the horizontal axis value, and σcontrols the shape of the relative frequency or probability density for log X values.

At the end of this section, you will see that a lognormal distribution for Xvalues is especially well suited for describing the distribution of values witha right skew. This is because the X values are the antilog of the lognormalvalues. The log X values are normally distributed and thus symmetric, butthe X values are stretched to the right by positive values of σ. Thegreater the value of σ, the greater the skew.

Because lognormal values are derived from the log of normally distributed values,the lognormal values must be positive real number values. However, it is commonfor normally distributed values to display the relative frequency of both positiveand negative underlying values. Examine, for example, the two normal distributionscharted at the end of the preceding section. Therefore, when computing lognormalvalues from an underlying normal distribution with negative and/or zero values,it is desirable to transform the distribution of values to the right so that thetransformed values start at a value greater than 0.

The following code is designed to transform the normal distributions from thepreceding equation with negative, zero, or positive values to only positive valuesfor use in this section. The transformation returns only positive values asinput to the log function for the underlying X values.

The code has two sections separated by a row of comment markers.

The first section computes values for @min and @relo local variables.
- @min is the minimum value in the underlying distribution of normal values.As noted above, it is possible for @min to be a negative or a zero valueas well as a positive value.
- @relo is the number of units to move normal values to the right on thehorizontal axis.
- Defining @relo_value as abs(@min) + 1 ensures that adding @relo_valueto all the underlying normal horizontal distribution values results in acollection of normal horizontal values that are all positive with the sameshape parameter (σ) as the original underlying values.
The second section performs three roles
- It commences by adding @relo_value to all x values in the original normaldistribution.
- Next, it calculates e raised to the (x + @relo_value) power.
- The results from these two steps are saved in a fresh copy of the #forlognormals table.
- Next, the transformed x values are rounded to zero places after thedecimal before grouping and counting the transformed x values, which arejust lognormal values.
After the second section is completed, the final results set is copied toan Excel spreadsheet for charting and displaying the lognormal values.

use DataSciencego --find min(x) in ##tblNormalDistributiondeclare @min float = (select min(x) from ##tblNormalDistribution); -- declare and assign relocation value (@relo_value) -- for x in ##tblNormalDistribution-- transformed x values are moved @relo_value units-- to the right on the x axis declare @relo_value float set @relo_value = abs(@min) + 1 -- display @min and @relo values based on ##tblNormalDistributionselect @min [@min], @relo_value [@relo_value] ---------------------------------------------------------------------- -- relocate x values in ##tblNormalDistribution-- by @relo units to the right so that-- all the transformed x values are positive-- this allows all transformed x values to have log values (log (x))-- the antilog value of log (x), which is derived,-- with the exp function, is the lognormal valueselect x,@min [@min],@relo_value [@relo_value] ,x + @relo_value [transformed x value],exp(x + @relo_value) [lognormal value for transformed x value] -- antilog of log(x)from ##tblNormalDistribution order by x -- save [transformed x value] and [lognormal value for transformed x value]-- in #forlognormalsdrop table if exists #forlognormals select x + @relo_value [transformed x value],exp(x + @relo_value) [lognormal value for transformed x value] -- antilog of log(x)into #forlognormalsfrom ##tblNormalDistribution order by x -- display [rounded lognormal value] along with [rounded_lognormal_value_count]-- round to zero places after the decimalselect round([lognormal value for transformed x value],0) [rounded lognormal value] ,count(*) [rounded_lognormal_value_count]from #forlognormals group by round([lognormal value for transformed x value],0)order by round([lognormal value for transformed x value],0)

Here is the relative frequency chart of the lognormal values. The chartappears as a scattergram of the counts for the lognormal values on the verticalaxis with lognormal values on the horizontal axis. A curved line connectsthe points in the scattergram. The scattergram with a curved line connectingthe points makes it easy to visualize the skew to the right for the lognormal values.Also, notice that the rounded_lognormal_value counts in the following chart datacorrespond to the test value counts in the first chart for the normally distributedvalues in the “Generating and Displaying Normal Value Distributions”section.

Next Steps

This tip introduces the basics of computing distribution function values forthree different distribution types –function values that are uniformly, normally,and lognormally distributed. These three distribution types have differentfunctional forms for displaying frequency counts relative to their correspondingdistribution values. The computation of frequency counts for each distributioncount is based on T-SQL code in SQL Server scripts.

You can assess the correspondence between a set of values in a dataset and aknown distribution shape by charting in Excel the frequency counts for functionalvalues. You can adapt the three examples in this tip to your own datasetsto verify how well the values in your datasets match any of the three distributionsreviewed in this tip.This tip's download includes an Excel workbook file with a separate workbook tab for each distributiontype.

Instead of a visual evaluation for the shape of the values in a dataset, youcan use a Chi Square goodness of fit test for actual frequency counts versus expectedfrequency counts. This approach can yield a precise statistical test resultwith a probability of getting the test value. This tip implemented the ChiSquare goodness of fit test for uniformly distributed values. The same generalapproach can be adapted to either of the other two distributions examined in thistip.

About the author

Rick Dobson is an author and an individual trader. He is also a SQL Server professional with decades of T-SQL experience that includes authoring books, running a national seminar practice, working for businesses on finance and healthcare development projects, and serving as a regular contributor to MSSQLTips.com. He has been growing his Python skills for more than the past half decade -- especially for data visualization and ETL tasks with JSON and CSV files. His most recent professional passions include financial time series data and analyses, AI models, and statistics. He believes the proper application of these skills can help traders and investors to make more profitable decisions.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2024-05-31