Needs some Stats help - Standard Deviation

AsSiMiLaTeD · January 2009

This isn't a school exercise, I'm in the real world here, needing help with something for work. It can be argued that I shouldn't be working on what I'm doing given my limited understanding, but we'll just skip that part.

What I really want to know is if you can really do standard deviation off data that's basically 1s and 0s, or yes/no. I don't really see how it could be possible, because isn't the whole point ot standard deviation to measure how far something is from center, or the mean? Well, if it's a 1 or 0 then I don't see how you can do that.

I know standard deviation is used alot with counting data, or variable data, but I'm not seeing how it can be sued with yes/no data.

If anyone can explain to me I'd appreciate it.

thsmith · January 2009

Would it not come down to how many times the 0s or 1s were suppose to be 0s and 1s but were not or were ?

AsSiMiLaTeD · January 2009

Not really, let's say I have the following dataset, we'll call this error rate, the 1 represents an error, the 0 represents no error.

1
1
1
0
0
0
0
0
0
0

So if I average that column, I get .3, but if I use Excel to get the standard deviation, I get .48. I don't see how that number is either valid or meaningful.

unc2701 · January 2009

Yep, even binomials have SD's. sqrt(np(1-P)). Basically you can use that to set up inference. With a small sample size you calculate exact probabilities, with a large one, you can just pretend that it's Gaussian.

unc2701 · January 2009

...but what you describe is a sample, so you're probably interested in the SE. what is the exact question you want to answer with your data?

thsmith · January 2009

AsSiMiLaTeD wrote: »

Not really, let's say I have the following dataset, we'll call this error rate, the 1 represents an error, the 0 represents no error.

1
1
1
0
0
0
0
0
0
0

So if I average that column, I get .3, but if I use Excel to get the standard deviation, I get .48. I don't see how that number is either valid or meaningful.

Need a larger sample data set, I think it has to be at least 25 depending on what level of Sigma you are trying to achieve. I could be totally off base, I have had the Green Belt training but no project yet.

AsSiMiLaTeD · January 2009

Let's say I have 100,000 rows of data, for the past 2 months, not sample data, but the entire population. I have a field, lets call it error, on each row of data. When I have an error, that field is stamped with a 1.

I want the error rate, which is the average of the field error - so AVG(error*1.000) in sql. So if I have 30,000 rows where error = 1 then my average is .30. Now I want the standard deviation in that same dataset, but I don't think I can really do that with just 1s and 0s, right?

Sami · January 2009

You can. With 30% error rate, you would have a standard deviation of 0.46. With 10% error rate it would be 0.30.

sqrt( (30000 * (1-0.3)^2 + 70000 * (0-0.3)^2) / 100000 ) = 0.46

sqrt( (10000 * (1-0.1)^2 + 90000 * (0-0.1)^2) / 100000 ) = 0.30

AsSiMiLaTeD · January 2009

I'm good with the calculation, I'm more curious if it's meaningful in measuring a simple yes/no. Would that .46 be 46%, or .46%?

unc2701 · January 2009

Yep, Sami nailed. I'll say it again, though. Statistics are only as good as the question that you're asking. The SD might not answer your question.

AsSiMiLaTeD · January 2009

I think I know what' i'm missing, really haven't had much time to think this through until now. But let me explain more...

What I'm trying to do is build a control chart, sorta. So I have data for the last two months, in the 1s and 0s format I mentioned. For that time I need to calculate average and standard deviation.

Then I take the previous week's data and get the average for that as well. Then I compare that with the average over the last couple months to basically paint the cell that value is in - if it's within one SD of the 2 month average, it's green, within 2 SD it's yellow, more than 2 it's red.

I've got the technical stuff figured out, and have the standard deviation function working. The issue is that I'm seeing very high standard deviations that don't make sense. I'm wondering if I actually should just group down the data at the very lowest level, actually calculate averages at that level, and then build my SD off of that...

vlam · January 2009

What stats tool are you using? Do you have access to SAS, SPSS or STATA?

AsSiMiLaTeD · January 2009

I'm not using any stats package, just sql rs.

Sami · January 2009

AsSiMiLaTeD wrote: »

I'm good with the calculation, I'm more curious if it's meaningful in measuring a simple yes/no. Would that .46 be 46%, or .46%?

It's not a percentage, it's just the expected delta between the average and the actual value. Kind of a limit of what's within acceptable range (depending on how you use it). I'm not a stats expert so I hope that's not way off. 0's and 1's, yes, doesn't really make a lot of sense since 0 or 1 is always off from the SD no matter what the values are.

Do you have enough information to get the error rate (ER) % and get SD from monthly/weekly/daily/hourly ER? Just an idea. Your team that you work with/for most likely would have plenty of input for you.

AsSiMiLaTeD · January 2009

Do you have enough information to get the error rate (ER) % and get SD from monthly/weekly/daily/hourly ER? Just an idea.

That's what I was getting at in post 11 above, I think that's the direction I really need to go.

unc2701 · January 2009

goddammit, my wife just closed my browser.... I had a decent explanation, but oh well. anyhow, you want to use:
Z= phat-p/sqrt[p(1-p)/n] Where P is the 2 month value and n is the number for that week's sample. We're gonna pretend that the 2 month value a constant even though it isn't. phat is your weekly sample rate.

Needs some Stats help - Standard Deviation

Comments