Statistics for Data Science | Probability and Statistics | Statistics Tutorial | Ph.D. (Stanford)

Statistics for Data Science | Probability and Statistics | Statistics Tutorial | Ph.D. (Stanford)


data science and machine learning is the hardest job of the 21st century with an average salary of hundred and twenty thousand dollars per year. according to linkedin. the designs job profile is among the top five jobs in the entire world. if you were to foray into the world of data science, you need to have good command over statistics as it forms, the basis of all the data science concepts. so with the help of statistics, you can make predictions such as new york will be hit with multiple tornadoes at the end of this month. or the stock market is going to crash by this we can now all of this sounds magical. doesn’t it? well to be honest, it just status x and not magic and you don’t really need a crystal ball to see into the future. so keeping the importance of statistics in mind. we have come up with this comprehensive course by dr. alban on the sarkar. dr. robin in the sarkar has his phd in statistics from stanford university. he has taught applied mathematics at the massachusetts institute of technology beyond the research staff at ibm let quality engineering development and analytics functions at general electric and has co-founded omics labs. we are uploading this high-quality classroom session by dr. amin in the sarkar from great learnings business analytics and business intelligence course, it has been ranked. number one analytics program consecutively for the past four years. this tutorial will be on youtube for only a limited period of time so that learners across the world can have access to high quality content. so please do subscribe to greet. earnings youtube channel and share the video with your peers so that everyone can learn from the best now without further delay. let’s have a quick glance at the agenda will start a band standing the difference between statistics and machine learning. then we’ll go through different types of statistics which are descriptive predictive and prescriptive after that. we will understand the different types of data available. going ahead will understand the concept of correlation and covariance comprehensively following which we’ll head on to probability and learn how to implement conditional probability with bayes theorem. and finally, we’ll look at two types of probability distribution binomial distribution and poisson distribution. so let’s start off with the session. you now need to do is you now need to be able to get the data, too. solve this problem so therefore the statistical way of thinking typically says you formulate a problem and then you get the data to solve that problem. the machine learning way of looking at things typically says here is the data tell me what that data is telling you many of my colleagues and i myself have run into this problem when going through interviews, etc, etc. and so sort of statistician say that we’re not getting jobs out there. so i go to do to people who are hiding in saying that why don’t you hire statisticians and i reach an interesting conclusion to this entire discussion. that’s sometimes around the way the interviewer who’s interviewing the statisticians for a data scientist job. ask the question. here is my data. what can you say? and the statistician answers with something like what do you want to know and the business guy says but that’s why i want to hire you and the statistician says what if you don’t tell me what you want to know, how do i know what to tell you and this goes round and round right? no one’s happy about this entire process. so there’s a difference in the way these two communities approach things. my job is not to resolve that. because in the world that you will face. you see a lot more of this kind of thinking than you seen this. because in this world the data is cheap in the question is expensive. and you’re paid for asking the question. in this world. the question is cheap in the data is expensive your paid for collecting the data. so sometimes you will be in a situation where this is going to be important. for example, let’s suppose you’re trying to understand. who’s going to buy my product? you’re asking the question. let’s say that my products aren’t selling. and you want to find out why? what will you do? get what data so let’s say that you’re selling your i don’t know. what do you want to sell? go to sell watches say so let’s suppose people aren’t buying buying watches anymore, which is a reality, correct? so your watch company who buys watches this the entire business model of a watch is disappearing. do you have watches some of you have he has actually a surprising number of you have maybe they do different things these days that that seems like a very that’s a fitness device is not really a watch at all. so something like this was actually with my daughter at lunch today. so she got something like this. i’m not sure my my wife who’s an entrepreneur runs her own company. she came back from delhi she came back with two of these. i don’t know where she picked them up. so my daughter the first thing she did she took one of this and she took this thing out. because she thought of the whole wrist band was an unnecessary idea. she that didn’t occur to her. i mean that’s a separate thing. that’s a nice little beautiful red wristband etc. so what is different thing but let’s say that you are watch company. nobody’s buying your watches or fewer. people are buying your watches. now, how you going to solve this problem or how you going to process this information? what you want to do? what do you want to know? what you remember? i’m asking this question also from an analytical. effective so when you say that to check the model and see what is not so let us use the whole data question. so you so first order you see sense. for whom and when and how how do you structure your data? how will you how will you arrange the problem? okay, that makes problems even harder because now you’re going to look for data that is in with you. no. no, he’s right. he’s right. he’s maybe people are not buying watches because they buy something else. that’s a reasonable thing. well, let’s keep the problem simple. let’s consider only data that is within you will go outside not to worry, but let’s say that i am looking at my data. what data do i want to see and what questions do i want to ask a favor? so sales year-by-year types. and then what comparisons do i want to do here region wife is with what purpose what question am i asking the data what sex? our customers are buying my product or what section of customers are buying my product compared to what what am i biggest set of customers? so that’s also what’s happening who are my biggest customers? okay. that’s a very interesting question to ask except that that question implies that i needed to know who my biggest set of customers sort of could have been but it’s a good point. where is the bulk of my sales coming from then? someone else says something about time here. is it going is it going down so you can look at things like saying that for which group of customers are by sales going down the most for example, you could ask that. i’m not saying that’s the question about that’s a possible question to ask. so let’s suppose you follow that approach then i’m trying to understand. i know that my sales are going down. that’s an obvious thing when i see you is telling my cfos telling me if i don’t stop this we all going to be out of a job. correct, the hmt factories in bangalore and not in good shape. one of them i think has become the income tax office. somewhere in the polish forum area. so there is that’s going to happen to me if i don’t do this well, so i know my sales equipment, but i don’t know by how much and particularly for home. so obvious segments for which the sales are going down which segments are sales going down the most in which segments are they going down a little bit how fast are they going down? i can push i can ask questions of that sort. now what conclusions at the end of this do i want to be able to do? how do i need to how do i want to use this information? now for this you usually follow something like a three-step process and you may have seen this and this covers both these sites and these words should be should be familiar to some extent the first is called descriptive. the second is called predictive. and the third is called. prescriptive have these was been introduced you at least in this compact least you’ve read it. i’m sure you all cruise the web and look at blogs and things like that. nothing new in this. i’m sure but i just want to set a context because it’s going to talk a little better. what we descripted is a see here. so descriptive predictive and prescriptive. now what is the descriptive problem? the descriptive problem is a problem that says that this try for me where and i’m losing my sales and when i’m losing my sins, it just describes the problem for me. it tells me where the problem is it locates it it isolates it. the predictive problem says look at is data and give me an idea as to what might happen. or what would happen if i change this that or the other so let’s suppose i do the following kind of idea. i say that let me relate to my feels to my prices. let me try to understand that if i reduce my prices of my watches will more people buy them. conversely if i make my watches luxury items increase the price of a watch. remove a luanne branded makeup watch an aspirational thing a decorative item a luxury item a brand item. so the people who are watching not to see the time but also as a prestige statement as a fashion statement, whatever it is, if i do this then what will happen. that’s predictive and trying to predict something based on it. i’m trying to see if something happens to let’s say one part of my data what will happen to the other part of my data and then based on that the doctor carries out a predictive analysis of you because i see this i now think you have this issue. you have this thing going on. let’s say i’m diagnosing you as being pre-diabetic. you’re not here diabetic, but you’re happy on the way to becoming a diabetic. now because of this i now have to issue you a prescription. i now should tell you what to do. so this is the data that comes from you. the data in some way is modeled using the domain knowledge that the doctor has. and that model has translated into a into an action that action is designed to do something. typically is designed to do something actually fairly complicated. the first actions the doctor tries to do someone. let’s say do no harm the porchetta codes. first let me make sure that that i don’t do any unnecessary harm to the patient then let me shall i say optimize his or her welfare? by making sure that i control the blood sugar the best in that. i postpone the onset of diabetes as best as i can. it’s a complex optimization problem of some sort in a business. also. it’s a complex optimization problem. right. i need to be able to sell more watches, but i also need to be able to make money doing so. i can increase my sales. but if i increase my sales and my profits go down on my earnings go down based on the cost and that’s the problem. but at the same time if i try to run a profitable business and nobody buys my product that also is not a particularly good idea. and there are other issues. we’ve been running the company. i’ve got employees that i want to keep on the on the boat. how do we run the company in such a way so that it means that particular labor force. i have finances should take care of have loans to repay. how do i get the cash flow in order to repay the bank loans that i have? so the prescription has to meet lots and lots of requirements. if you are building an autonomous vehicle, you’ll have situation seeing the car has to do this but it also has to follow certain other rules. for example, if it sees someone crossing the road it should stop but it shouldn’t stop very suddenly because it sounds very suddenly is i heard the car is also probably going to hurt the driver. so it can it should needs to start by trimming stop to suddenly. it has to follow the rules of the road. because otherwise the computer will simply say oh you want me to avoid the person crossing the road? i’m just going to go behind a person. and you go to go in the river tell the card, please don’t do that because there’s a house next to it. you can’t just sort of do that. oh, you didn’t tell me that you just told me to avoid the person you didn’t tell me about the house. okay, we’ll put that as a constraint in our program and see how well it goes. so prescription is problematic. another simple way of doing it might be to say that description is how many centuries as we are colleagues code lookup cricinfo. i will give you the answer. prediction might be try to guess how many centuries vertically will score in the world cup prescription might be how do we get vertically to score more centuries in the world? and as you can figure out you’re going through a purely database version of the problem into something that’s only not notionally about the data. i will help you but there’s a lot more than the data when it gets to that. what we’ll do today, what we’ll do now. once i finish talking to you is will will take a look at what descriptive or the descriptive part of anger, texas. so the descriptive part of our latex is talking about simply describing the data without necessarily trying to build any prediction or any models into it. simply telling you the way it is this is hard. this is in itself not necessarily an easy thing to do because you need to know very well how to do that. and what are the ways in which one looks at data? this is skillful in itself. so for example, let’s suppose that you are you’re that i mean, you’re a doctor you go to the doctor and the doctor is looking at you looking at your symptoms and the doctor recommends a blood test. now. how does the doctor know what blood test to recommend? based on the symptom. but remember that potentially there’s an enormous amount of information in you all of us is biological things carry an enormous amount of information, you know in our blood inner neurons in our jeans or whatever if you’re talking about big data as i said, there’s two meters inside every cell and there are few billion neurons in your head. you don’t need to go far to see big data you are big data. you’re one walking example of big data. we all are. thanks the in that big data what little data does the dr. know to see that’s the descriptive analytics problem. the doctor is not doing any inference on it. the doctors not building a conclusion and the doctor is not building any i system on it. but it’s still a hard problem who’s giving me vast amount of data that the that the that the doctor could potentially see the doctor needs to know that i’m this is interesting to me and this is interesting to me and this is interesting to me and this is interesting to me in this particular way. for example a blood test. let’s suppose that i drop i draw blood from you for a particular purpose. let’s say for blood sugar. okay, leaving aside the biology of how much blood etc etc to draw. just neither one of you i guess our adoption if you heard of this we doctors in the room. so doctor so i can say whatever i want to you understand what i’m saying, if you know, but we’re so but i’m old enough that this is a real problem for me. so you have a you have a large amount of blood that’s flowing through you we all do. this blood carries nutrients what that does is that every time there is a nutrient in flow. the blood looks a little different. so if you eat your blood looks a little different. because that’s your bloods job. the bluffs job is to carry nutrients. if you want to run you want to walk if i’m walking around my legs are getting energy from somewhere. the energy need to my legs has been carried from the blood and it is being generated through inputs that i get some of it because of the air that i breathe from where it gets the oxygen to burn things. so if you’re from the food that i have eaten the nice lunch that i had where it gets the calories to do that. so therefore based on what my energy requirements are and based on what i’ve eaten. my blood is not constant. my blood content is what is known as a random variable. what’s random about it? because it looks a different it looks really different all the time. your blood at 12 o clock is going to look a little different twelve o’clock at midnight is will look really little different from twelve o’clock at node because it’s doing something a little different. the same phenomena is are everywhere. if i were to for example measure the temperature of the oil in your car or in your two wheeler. what do you think? the temperature will be? it depends first of all depends on where the car is running or not. it depends on whether it has done or not. it depends on how much oil there is. it depends on how you drive? it depends on temperature the car the answer is it depends? and the same is true for your bodily fluids. so this becomes a fight problem because if it is random. then from a random quantity, how do i conclude what your blood sugar is? how does a doctor reach that reach a conclusion of any sort? average of what average a particular duration. so there are multiple averages that you can get. first of all there is a question of saying that if i take blood from you, how is the blood usually collected? so the phlebotomist comes in usually takes an injection. from one point, let’s say by some strange accident is throwing advised policy by same some strange accident two different people are drawing blood. from two hands at the same time do not try this at home. well, i suppose they do do this. we look at the same blood ideally. yes. at the same time as i say do not do this at home, but the same time you are getting two different samples. there’s not just a question of time. your blood is not going to look the same even within your body at one period of time. even from the left hand in from the hand it exactly the same period of time is not going to love this game. there is a slight there is a slight problem that some old in love. you said that you know, your heart is in the middle. your heart is actually middle, but it beats to the left. why? because the the heart is what the heart is both a pump and a suction device. the pump side is on the left. the suction side is on the right. so your blood pours out from your left side and it goes back in on the side. so this is site asymmetry in your body between left and right. once i tends to go out the other side tends to come in its slight it mixes up all in the middle. so one sampling idea is that i’m taking a sample of blood from you and is just one example. the second question is as you’re saying is a question of time. so you can average over time if you average over time. this is at least here. you can say i’m going to do this maybe before eating. after eating real after eating. so those if you have blood pressure test, for example of sorry blood sugar test, once they ask you to do it fasting and then they ask you to do some two hours after eating. do they tell you what to eat? sometimes a glucose sometimes they don’t this is sort of say that based on what you naturally eat. let me figure out what you are processing the expect you to eat a typical meal and not go and eat, you know large amounts of kfc. that is not what you normally eat. just eat what you normally eat vegetarian need normal visited me eat normal food and then figure it out. let’s see how how good your bodies are trying it out. we think do a normal thing and i’ll take another normal sample. then one of you said something very interesting the average things out. but what does the averaging do? neutralize the interesting word to use neutralizes things provide context context of what context is a good point. so so what is the doctor trying to do? so let’s let’s simplify things a little bit and say that let’s suppose that the doctor has a threshold. let’s give it a number. let’s say the doctor says that if your blood sugar is above 140, i’m going to do something if you have sugar is a less than 140. i’m not going to do anything. i don’t know whether this is the number or not, but just let’s make it up. now the doctor is going to see from you and number. it may be a single reading. it may be an average. it may be a number of things. how is the doctor going to translate? what they see from you and compare it to the 140. how is that comparison going to be made a number of people? so let’s suppose i have just one reading. can you suppose that i have one reading and that reading? oh, i don’t know is 135 i’ve just got one reading from you 135. what does that tell me notice required one argue one argument is is simple, let’s take a very machine learning computer science view to this 135 is less than 140. haha, so now we say yeah, but you know what? let’s say that 135 and another guy who say one foot 120. they should be something that says that this 135 is a little bit more trouble than 120 closer to the threshold as he says. so maybe in other words this threshold isn’t quite as as simple as i thought it was so i can solve this problem in one of two ways one way to do. this is to make this 140. a little range this song is called fuzzy logic in other words the question you’re asking becomes fuzzy not as crisp. you’re not feeling with the data you fiddling with the boundary. you’re feeling with the standard. the other way to do that is to create a little uncertainty or create clicking plus minus around the reading itself around 135 saying that if this is 135 and let’s suppose that i go and get another reading and the second reading that i get is say 130. and the third reading that i get on the day after that is say 132. and i’ll say okay. seems to be fine. i might say but let’s suppose after 135. guy goes and i do my usual thing and i measure it again and this time it comes out as 157. and i do it again and it comes out as 128. and i do it again. it comes out to be 152. so in both cases 135 is probably a good number. but in one case is 135 was very very little and the other cases 135 was wearing a lot which gives me different ideas. as to how to process it. so what descriptive analytics talks about essentially is trying to understand certain things about data that helps me get to conclusions of this kind a little more rigorously. now to be able to quantify what these plus minus is our is going to take a take us a little bit of time and we will not get there. this residency will get their next residency. to say that in order to in order to say i sought 135 135 plus minus something that question now needs to be on set. but to do that i need to have two particular instruments at my disposal. one instrument that i need to have at my disposal is to be able to know what to measure i need to say what does an error mean? i need a statement that says that maybe i’m 95% confident that something is happening. i’m 95% sure that this is below 140. i need a way to express it. and that is the language of probability. so what we will do tomorrow is will introduce a little bit of the language of probability. in vitro and related to what we’re doing today. so there’s going to be little bit for disconnect. but what we’re going to do is we’re going to create two sets of instruments one instrument that is purely descriptive in nature. and one set of instruments which is purely mathematical in nature so that i can put a mathematical statement on top of a description. and the reason i need to do that is because the pure description is not helping me solve the problem that have set it set that have set. so therefore what will happen is you will see in certain medical tests. you will not see points like this. you will see intervals. your numbers should be between this and this your question number your hdl. whatever should be between this and this you won’t see a number you will see a range the tip of typifies a variation and in certain cases you will see thresholds and maybe they are it’s just a lower limit and upper limit, but you also see a recommendation that says please do this again. i’m going to compare i can’t compare one. number one. number one. number two or number is typically a very bad place for any kind of analyst to be in because you got no idea of which is error-prone and where the error is. so therefore what happens is you try to improve one of those numbers. and so either by filling around with the range or by getting more measurements and you’ll do that in you’ll see that as we go along a little later. so this is a context for for what we have in terms of terms of data, let’s see. so this is a set of files that has been loaded. it’s a very standard set of files is not mine. to be honest. i just want to make sure that i’m doing what i’m supposed to be doing. so for reasons that are more to do with security my understanding that notebook will not access. your drapes so keep it on your desktop and not complicate life. so and there is this notebook. it’s called cardio goodness. if good. the word statistics refers to the idea that this is comes from the statistical way of thinking. which as i said opposed to the machine learning way of thinking is tends to be a little more problem first data next which means we worry about things like hypothesis and populations and sampling and questions like that. and the descriptive part refers to the fact that it is not doing any inference. it is not predicting anything. it’s not prescribed. you think it is simply. telling you what is there? with respect to certain questions that you might possibly ask of it. what is the context to the case? the market research team at a company is assigned the task to identify the profile of the typical customer free treadmill product offered by the company the market research team decides to investigate whether there are differences across product line with respect to customer characteristics. exactly what you guys were suggesting that i should do with respect to the watch understand who does what entirely logical the team decides to collect data on individuals who purchase a treadmill at a particular store during the past three months like watches now click looking at data for treadmills. and that is in the fight in the csv file. so what you should have is you should have a csv file in the same. directory and through the magic of python. you don’t have to worry about things like path. before we get there. remember because we’re looking at this statistically before we get the data. we should have a rough idea as to what we’re trying to do. and so they say that here are the kinds of data that we are looking at the kinds of products the gender the age in years education years relationship status annual household income average number of times the custom of plans to use the treadmill each week average that number is a customer expects to run walk each week on a self-rated fitness scale and 1 to 5 where one is in poor shape and 5 is an excellent shape. some of this is data some of this is opinion some of this is opinion masquerading as data, like for example number of times a customer plans to use a treadmill. hopeful wishful thinking is still data here asking someone. how many times will you use it? arrows daily. no problem seven times a week. oh, we’ll see huh, but still data, it’s come from somewhere. so so what has happened the way to think about this is to say that i want to understand a certain something and the certain some certain something has to do with the characteristics of customer customer characteristics and to do this you can then use either you can either take let’s say a marketing point of view who buys it also make a product engineering kind of you what cells there was what kind of product? should i make etcetera in business as you probably have for those of you aren’t any few entrepreneurs? one hand up. they’re actually one hand of the closet enterpreneurs from what i could figure out sometimes it’s unclear what that word means. in other words. you think you are or you’re not confident enough to call yourself one or you doing that in itís page. if you are an entrepreneur for example in in physical product space or even in software space one of the things you often think about is what’s called the product market fit, which is you’re making something. how do you match which in what you can make? and what people will buy? because if you make something that people do not buy that doesn’t make any sense. on the other hand, if you identify what people buy and you can’t make it that also doesn’t make too much sense. so the conclusion that we will draw on this we will not drawn today, but the purpose is to be able to go towards the conclusions of that kind either isolate products isolate customers and try and figure out what what they tell us pandas generally calc has a fair amount of statistics build into it. that’s what it was originally built for. number you something that was built more for mathematical problems and anything else. so some of the mathematical algorithms that are needed are there. there are other steps. i plots in metal up lot. like was he born and many other things that you’ve seen already python is still figuring out how to arrange these libraries. well enough the shall we say the the programming bias sometimes shows through in the libraries so i for one do not remotely know this well enough to know what to import upfront but a good session, you know, what to import up front when you do all this up front so you don’t get stuck with what you want to do. the naming is up to you. if you like the names as they are then that’s fine. you want a standard set of names? so when you don’t the data set if this is in the path, just this will work dot csv. if usually smart enough to convert excel forms into csv. in hours, if you have this as acceleration things like that, it’s usually smart enough, but if it isn’t then just go in and save an xls file as a csv file and operate that way in case it doesn’t do it on its own. but more often than not what you see is that when you when you when when jupiter sees it it will see and any xls file as a csv file or go and make the change yourself. oh, you can have other excellus other restatements in it as well. you can change functions inside it and you can figure out how much to head what this tells you is is the head and the tail of the data. this is simply to give you a visualization of what the data is. this gives a sense of what variables are available to it. what kinds of variables they are? we’ll see a little bit of a summary after this etc. so for example, some of these are numbers income what is income? income is annual household income. that’s a number some for example, let’s say gender male female. this is a categorical variable. this is not entered as a number. is entered as a text field if you are in excel for example writer the top if you go in and you see that it’ll tell you how many distinct. entries there are how many distinct settings there are. so usually what happens at the beginning and a dataframe like this if it is created is a data frame if a data frame is created when it gets created. the software knows as to whether it is talking about a number or whether it is talking about categories. the second challenge is to that you can see one particular challenge to this. what does this 180 mean? counts. why do you think there are so many decimal places that comes here? 14 years of experience 16 years of experience. why is it going 0 0 0 0 is 0 yes, it does this because it’s his other numbers with those decimal places are needed. so what it does is what any software typically does is when it sees data is sort of says that at what granularity do i need to store the data. sometimes this is driven by a computer your 64-bit or 32-bit and things like that. but what it does is it means that the data is stored in the data frame to certain digits. usually you don’t see that you’ll see it in this way, but sometimes for example when you see include equal to or any and you ask for a full description the data comes out in this slightly irritating rain. because of something here because let’s say the income figure any of that now when it recommends when it looks at the descriptions of this, what is the description that it is reporting and how does it choose to report out the description? this particular situation. so let’s take a little bit of a closer. look at this one thing here. look at the way it’s done here, so count unique top frequency. and then there is certain things here means standard deviation minimum 25% 50% 75% and max. when it sees a variable like gender. it reports out lots and lots of nes. what does that tell you off the bat? he can’t do that. which means it’s not a number. this is not a number in other words. if you asked me to find the mean of something and you’re giving me male and female as inputs, i don’t know what to do. which is an entirely reasonable standard to take for any reasonable algorithm, right? it requires another kind of description for it to work that the problem would describe this course in taxes that is asking for the same description for all of them whether it’s in significant digits whether it’s in columns, etc, etc. just choosing this description. it says that that’s all that i’m going to give you. but where it makes sense, let’s say for example, i look at h. now for each i’ve got a hundred eighty observations. and it is calculating certain descriptions for it. correct. so what are the descriptions it is calculating? let’s look at these is calculating a description likes a minimum minimum is what 18 maximum is 50. these are easy to understand. then let’s look at something a little interesting. suppose they want to report one number one representative age. for this data set this is like asking the question. how do i get a representative blood sugar number for you? i can give you a minimum and a maximum but to do the minimum the maximum. i need to draw blood many many times from you. but let’s suppose i want to this is why one one representative age for you. somebody asks you what is your blood sugar? you want to give them one number? similarly, somebody’s looking at this data and ask the question. give me a representative h. how old is your typical user or what age do you want to build it for or you’re even asking a you’re even asking let’s say a product question. you’re a product designer and a product designer building a treadmill. now, how do you design a product those if you are engineers? based on based on the weight not very good. what wait who’s wait? who’s the user? what is the weight of the user he’s got a good point as a design engineer? i need to know what weight will be on that treadmill. now, what is your answer to that question max? so there’s a question of saying that if i want to measure a variable by one number, how should i even frame that question? what makes sense? what is the one average no max in this particular case you might argue. the max is the is the number because i want to be able to say if i can support you maggot support anyone but there’s also a downside to that. i’ve now engineer that product i you could argue that i shall i shall i say over-engineered that product. i’m sorry. okay. so let’s suppose that you are you’re doing this for a mattress? you’ll sleep on mattresses were all relatively wealthy based on the fact that we are here. so we probably sleep on a mattress. not everyone is fortunate enough to sleep on a mattress. but let’s suppose you do sleep on a mattress how much weight should that mattress be designed to back if you over engineer it what will happen. is that number one for a reasonable weight? let’s say wait a lot below that that mattress is not going to sink. let’s say that you design it for a hundred kilos. now if you are 50 kilos or 60 kilos that mattress is not going to sing for you. this is going to be comfortable for someone who is a hundred kilos for someone who’s 50 kilos you just going to bounce on it. you’re going to feel it’s soft silkiness of whatever it is. you want to feel from the mattress? it won’t work. so what to do? that’s a hard problem is the description to with a heart problem. who do i engine it for? and so therefore people have different ranges of what i mean to represent it. so here’s one version of it. this is what is called a 5-point summary. i report out the minimum the 25% point the 50% point the 75% point and the maximum variable by variable i report five numbers. i report the lowest. what is 45% mean? 25% of my data set or the people are younger than 24? the youngest is 18 25% or a quarter of them are between 18 and 24. a quarter between 24 and 26 a quarter between 26 and 33 and a quarter are between 33 and 50. this is what is known as a distribution. this is what is known as a distribution statisticians love distributions. they capture the variability in the data and the you do all kinds of things with it. so i’m going to draw typical shape of a distribution. we will make more sense of relator on this is the theoretical distribution distribution. for example, let’s say as a minimum. has a maximum. as say 25% point as a fifty percent point it says 75% in terms of probabilities this 25% here 25% here 25% here 25% here. if you want to think in terms of pure description. this is not a probability is just a proportion. if you want to think in terms of probabilities what this means is that out of a hundred eighty people out of a hundred eighty people if i draw one person at random. if i draw one person random there’s a 25% chance that that person’s weight is going to be below. we know 24 so h-24, correct. if you want to think in terms of probabilities will do that tomorrow. but this is a description. so what this description does is it gives you an idea as to what value to use in which situation so for example, you could say but i’m going to use 25:26 as my representative age. if i do that, what is the logic i’m using. this this 25% this 50% point so to speak this is called the median. this is called the median and we’ll see it median means the age of the average person. first shot pick the middle person and ask. how old are you? the age of the average person i could also ask for the average age of the person. which is what which is the mean? which is one over n? x1 + x n now, this is algebra. what you have to do is you have to put n equal to 180. this is the first change second h the third age up to 180. one by one eighty age one plus age 180 this is called the mean this value is what twenty eight point seven nine. the average age is about 28 years or twenty eight and a half years twenty eight point eight years for the age of the average person is 46. yes. the median and the difference between the two so i described the median as the age of the average person. and i distract the mean as the average age of a person. so he’s looking at me like say you have to be kidding me that’s confusing. i admit to it the easy way to understand it could be this. what is the mean add them all up divide by how many there are what is the median soft them from the smallest to the largest leak of the middle? if there’s an even number, what do you do you take the average of the two middle ones? if they’re the same it will be the same number if they’re not it will be a number between them. so sometimes the median may show up with a point five or something like that for that reason if there is an integer counts, but there are an even number of counts. now, which do you think is better? you’re giving the answer. it depends you figure out that i like that answer. they both make sense. they both make sense. it depends on what context should going to use it for in certain cases. yes. okay, if you’re talking terms of parameters, so use an interesting term he saying what is the parameter? i’m after parameter is an interesting word parameter refers to something what generating the population it’s an unknown thing that i’m trying to get after for example blood sugar is a parameter it exists, but i don’t know it. i’m trying to get my handle on it. correct. so if i’m thinking in terms of of parameters, then these are different parameters. so let’s let’s look at a distribution here. i’m not sure whether this is pick up things. i hope so. so the median is the is the median is a parameter such that. on this side. i have fifty percent. and on this side, i have 50. set this is the median. the mean is what is called the first moment what that means is think of this as a plate of metal. and i want to balance it on something. where do i put my finger? so that it balances? it is a cg of the data the center of gravity of the data. you can understand the difference between these two. now. if for example, i push the data out to the what happens to the median nothing happens to the median because the 50-50 split remains the same, but if i push the data out to the the mean will change it will move to the your liver the liver principle, if there’s more weight on one side. i have to move my finger in order to counterbalance that wait. so these are two different parameters if the distribution for example is what is called symmetry. symmetric means it looks the same on the left as on the then these 2 will equal because the idea of going half to the left and half to the will be the same as the idea of where do i balance because the left is equal to the right? so when the mean is not equal to the median, that’s a signal that the left is not equal to the right. and when the mean is a little more than the median it says that there is some data that has been pushed to the right. and that should be something that you can guess here because the mean and the median to some extent our what 2426 is cetera. the lowest is 18 that’s about six six years eight years less than that. but what is the maximum? 50 that’s 25 years beyond the data is pushed to the and in bed. his racing push to the direct technical term is right. skewed. there are there are shall i say people are more not average on the on the older side then on the youngest active. there was a hand up somewhere. yes. yes. so therefore one reason that the median often doesn’t move is because it is not that sensitive to outliers. so let’s suppose for example, we look at us as us and we ask ourselves. what is our mean income or our median income and we have that each of us make a certain amount of money. we can sort that up and sets and put that in now, let’s suppose that mr. mukesh ambani walks into the room. now what is going to happen to these numbers here alone probably makes a very large multiple of all our incomes put together possibly. i don’t know how much you make i know how much i make. but what’s going to happen to the median? it’s going to still almost the same the typical person me move by at least half. who’s what is the typical person going to be the typical person is going to be an actual individual in the room or maybe an average of two individuals in the room. and that person is not going to change. yes. yes, that that’s that’s one conclusion. we can draw on this there are other products below which will also show the same thing. you don’t be able to draw that conclusion. good logical reason. i haven’t shown you the full data will see the histogram will do that. so hold on to that question. the conclusion was drawn is that there are two pieces. there are two things to do see here one is if i simply look at this without seeing any more graphics, where is the middle of the data from median perspective at 26? good, not from 26. look at the difference between 26 and the smallest. 18 between 18 and 26 that’s eight years. this eight years contains 90 observations because there’s a hundred eighty thousand now what is on the opposite side of this 26 to 50? that’s how many years 24 years this 24 years now contains how many observations same 90 so the 90 observations that are between 18 and 26 and the 90 observations between 26 and 50. so if i were to drop picture, what would what would that picture look like? yes exactly. as you are drawing it, right? this usually by definition is called skewed. this is a problem that be a bi has does this mean is left’s two rights cute as a word right? it’s called skewed. more data to the right. sorry more data is a dangerous word ha. no, that’s the same number of observations. i’ll say the data is pushed to the right. or variation variation side is probably safer way of putting it. yes. so skewness is often measured in various things one measure of skewness is typically for example mean – median mean – median if it is positive. it usually corresponds skewness mean – vn- usually corresponds to left skewness. this is a statistical rule. but sometimes it is used as a definition for skewness. there are many definitions for skewness cute data sometimes causes difficulties in analysis because what happens is the idea of variation changes being the variation one side mean something real different than variation from the other side. by the way, what’s happening to you with respect to things like books are you getting books are not getting books? i have no idea what the books are. you got one book, which is what? which is the statistics book? okay, i’ll take a look at that book later. so this book right? okay, show me the book. okay. comment one very nice book coming to not a python book. right, that doesn’t make it a bad book. so if you’re looking for help on how to code things up, this is not the book. get a book like things cats or something like that. but if you want to understand the statistics i to it is an excellent book. everything that i’m talking about is going to be here. am i talked about which chapters and things like that at some point? and i might talk about how to use this in the book. so for example at the back of this book, there are lots in there are tables. there are tables at the back of this book. which will learn how to use and then i’ll try to convince you that you shouldn’t use them. but remember many of these methods are done in ways in which either you don’t have access to computers. or if you do have access to computers, you don’t have them shall we say at runtime? in other words when i want to run the application on that i can build a model using a computer, but i can run it within one. the runtime environment for statistics is often done when there are no computers around. the build environment can include computers but the runtime environment can a lot of statistics is done under that kind of situation even probably yes very much so very much so okay, so definitions of skewness and things like that do it do it in the way you usually use a book which means you go to the index and see if the word is there. when then you go back and figure it out and we’ll give you some ideas as to how that works. it’s a nice book is one of the best books that you have in business statistics, but it’s not necessarily a book that will tell you how to put things up that is not a deficiency of the book not every book can do things of that sort. in other books around that will tell you how to code things up, but will not explain what you are doing. it’s important to know what you are doing is also important to know why you’re doing it. but books can be written with often everything in my guess. the thinking is here. i think this is good for thinking i would actually recommend this book on the thinking side. yes. yes, and that answer i think is very very good here where you won’t get is either say do this and it won’t give you the python syntax to do it that that will not be here. so if you can solve that i’m going through some other means. i used to have a colleague in in corporate life who had a very big sticker on his board. it said google search is not research. now nobody agrees with him anymore. so i suppose that when in doubt you do what normally homo sapiens do today, which is you google for an answer. so one possibility is that you excuse you understand something from a book such as this and if you want to understand the syntax is google for the term say python that term whatever you’ll probably give you the code. things are very well organized these days. there’s also the question and i should give you a very slight warning here for to not to discourage you from anything. but in the next nine months or thereabouts the the duration of your program, there’s going to be a fair amount of material that will be thrown at you. correct. the look and feel will sometimes be like what we would what we would often call it a mighty as drinking from a fire hose. you can if you want to. but you will get very wet. so therefore pick your battle. if you want to understand the statistics side of it peas, please go into the depth of it. but if you try to get into equal death on every topic that you want to learn that will take up a lot of your professional time. now the reason we do the statistics for first one, it’s a little easier from a computational perspective although harder from a conceptual perspective. so we begin it this way, but hold on to that idea and then as you keep going see if this is something that you want to learn more on and if you can you’re welcome just write to us let us know already. anyone know that with has just come in let her know and we’ll get the references to but if you want to force a for the first presidency’s please read the book and see what happens if they’re adults. yes, but it’s a well-written book it since its instructor is one of our colleagues here, you know, if you want to give you can also help explain things. so this is a summary. what did the summary tell you this summary give you what’s called the five numbers five numbers that help you describe the data minimum 25 50 75. max will see another graphical description of this it also described for you a mean. there is also another number here and this is this number is indicated by the letters std if she needed first two standard deviation std refers to standard. deviation and what is the formula for a standard deviation? std is equal to the square root of a mess but two steps step 1 calculate the average. step to take the distance from the average for every observation. ask the question how far is every data point from the middle? if it is very far from the middle say that the deviation is more if it is not far from the middle say the deviation is less deviation being used as a synonym for variation and the hood variation variation can be more variation can be less. more than the average less than the average if someone is much older than average there’s variation if someone is much younger than average there is variation. so therefore both of these are variation. so what i do is when i take the difference from the average i square it so more than x bar becomes positive less than x / also becomes positive. then i add it up mea average it this is small questions to why it is n minus 1 and that is because i’m dividing. i’m sticking a difference from an observation that is already taken from the data now ever squared when i have squared my original unit was in age when i have squared, this is become h squared. so i take the square root in order to get my measure back into the scale of years. so the standard deviation is a measure of how spread a typical observation is from the average. it is a standard deviation where a deviation is how far from the average you are? and because of the squaring you need to work with a square root. in sort of modern machine learning people sometimes use something called a mean absolute deviation immedi mad. very optimistically called. so mad is is as you don’t take a square you take an absolute value. and then you do not have a square root outside it. and that is sometimes used as a measure of. how much variability there is? so why it is why is it? why do we square it? because we want to look at both positive and negative deviations. if i didn’t square it, what would happen is it would cancel out? what was the word that one of you used neutralize right? i love that term. your positive deviations would neutralize your negative deviations. number is going to be positive if the x1. so let’s look at the first number here. so if i look at the head command here when i did the head command here, what is the head? what did the head command? give me the first few observations. now. this is an eighteen-year-old this probably sorted by h. this is an 18 year old, correct. now, i’m trying i’m trying to explain the variability of this data with respect to this 18 year old. what is the what is the what why is there variation this? 18 number is not the same as 28 and 18 is less than 28. so what i want to do is i want to go 18 – 28.7 what i’m interested in is this 10? this 10-year difference between the two. now the first and the oldest person in this data set is how old 50 when i get to that row this 50 will also differ from this 28 by 22 years. so interested in that 10, and i’m interested in the 22. i’m not interested in the minus sign or a minus 22. i can do that. i can do that in all those what i can do is i can look at i can represent 18-20 8 as 10 and i can represent twenty eight minus fifty has 22 and that is this as i said one over n minus 1 absolute x 1 minus x bar plus plus absolute xn minus x bar. that is this. within -1 and this is done as i’m saying this is what is called mean absolute deviation and many machine learning algorithms use this you are correct in today’s world. this is simpler. now when standard deviations came up first, this was actually harder, but people did argue about this. i think well hundred fifty maybe more about say forget my history that much there are two famous mathematicians one named gauss and one named laplace. well argued as to whether to use this or whether to use this. plus said you should use this. and gauss said you should use now. the reason gauss one was simply because cows found it easy to do calculations. why is this easy to calculate with? because you cannot come up with calculations, you know century or so before that. and so for example, let’s suppose that you want to minimize variability which is a which is something that we often need to do in analytics, which means you need to minimize things with standard deviation, which means you need to differentiate this function. the the square function is differentiable. you can minimize it using calculus. this is not so therefore what happened was the house could do calculations, but le plus could not and laplace lost. and those one the definition of the standard deviation of 25 percent or 75 so as in okay, okay, why do we not do that? so today this entire argument makes no sense. because today how do we minimize anything? a computer program you don’t use any calculus. you asked after you run f min or something of that sort you run a program to do it. so therefore this argument that you can both two calculations equally well with this as in as in that so today what is happening is that laplace’s way of thinking is being used more and more. this one is a lot less sensitive to outliers. this one what it does is if it is far away the 22 squares to 484 or something like that, which is a large number. so the standard deviation is is often driven by very large deviances larger the deviance the more it blows up. and so therefore this is often very criticized. if you read for example, the finance literature is that i called taleb nassim taleb out. he writes his book called the black swan and fooled by randomness very left and criticized the standard deviation as a measure of anything. so today this argument doesn’t make a great deal of sense and when in practice something like this makes sense. it’s often used. so a lot of this is done historically the it looks this way because of a certain historical definition and then it’s not is hard to change. so today in the in you know centuries after gauss said people like me are trying to explain it having trouble doing it. because there’s a logic to it and even and that logic doesn’t hold at all anymore now. yes. for example, is that creepy? how about jewelry that from the video? how far how far on the average is an observation from the average confusing statement again? he’s going to be one happy. but how far on the average is an observation from the average if that answer is 0 that means everything is at the average. but you’re asking the question how far from the average is it is an observation on the average if i take your blood pressure how far from your average blood pressure? is this reading? if this is exactly equal then i don’t need to worry about variability every time i measure blood pressure all see the same thing. what is your average bank balance? don’t tell me that but but but you know what, i mean, you have an average bank balance your bank account manager or your bank actually tracks this what your average bank balances. who but you are actually your balance is almost never or very very rarely equal to your actual average bank balance. it’s more and it’s less. how much more how much less is something that the bank is also interested in in order to try and figure out you know, how much of your money so to speak to get out there? does the bank is going to make money by lending it out, correct, but when it lends it out it can give it to you. so it makes an assessment as to how much money picks i don’t want you to finance now, but you get the drift. so therefore there it is a measure of the it is not the only measure of that. so for example, here’s another measure. so remember this 25 number and the 75 number that you’re asking about. let’s say that i calculate a number that looks like this. let’s say 33 – i’m fine – 24. so 33 minus 24, let’s say this is my 24 and this is my 33 between this how much data lies. fifty percent why because this is 25% and this is 25% this now contains 50% this is sometimes called the interquartile range. interquartile range big word thank know why is it called an interquartile range? the reason is because sometimes this is called cutesy and this is called q1 q3 stands for upper quartile. you can understand quartile quarter. so upper quarter and this is the lower quartile. and the difference between the upper quartile in the lower quartile is sometimes called the inter quartile. why is it called the range because what is the actual range of the data the range of the data in this particular case is 50 minus 18. and fifty minus 18, which is your max – you’re mean, this is simply sometimes simply called the range. range is maximum minus the minimum interquartile ranges, upper quartile minus lower quartile, and these measures are used. they do see certain uses based on certain applications. you can see certain advantages to this. for example, let’s suppose that i calculate my five points summary with my five point summary. i can now give you a measure of location, which is my median and i can give you two measures of dispersion, which is my interquartile range and my range. so those five numbers have now been twisted to give me a summary number, which is the median. and a range number interestingly i can also draw mental conclusions from that. for example, i can draw conclusions from these five numbers in the following way 24 and 33 half. my customers are between 22 and 24 and 33. so if i want to deal with half my customers i need to be able to deal with the range of about nine years within this nine years is all that i’m interested to get this straight. so if i’m building my if i’m building my my machine, i’m going to make sure let’s say that the 33 year old is okay with this and the 24 year old is okay with this? will the 50 year-old be okay with this? yeah may not be i want to thank the 50 year-old rookie with this and have trouble with the 18 year old. so i can do a lot with even these five numbers. we’ll see more descriptive statistics as we go along. by the way, this is only for each i can do this for you know usage. i can do this for fitness. i can do this for income. i can do this for miles income is interesting. here is the median income. $50,000 and the mean income about 53,000 dollars. if you see income in almost all real cases, the mean income is going to be more than the median income the per capita income of india is more than the income of the typical indian say, what does this command do if i say my data dot info what this is doing is my data. first of all is a data frame that i’ve created just to review i read the pdf file this this is a described. and this here is info now describing the info in english language are similar things description information. this is interpreted in the software has two completely different things information is like your variable setting is like your integer field. your real field is setting like that is giving you information on the data as data. the word data means different things to different people to a statistician data means what? processing data means and number to an it professional what does data mean? bites information, you know, i’ve lost my data. i don’t particularly in what the data is place my data. so this is that information it tells you tells you about the data is an object is description is a 64-bit store integer who is in objects, which tells you about numeric categorical. it tells you about the kind of data that’s available normal feels in other words. there are objects in the field etc. there are so many integer types which are stored at 64 because this computer is only capable at 64 and there are three categorical variables. this is a this is shall. we say a data object summary of the what is there in that data frames not a statistical summary. useful in its own way particularly, if you’re processing it and storing it for those of you who are going to go into data sort of curation like careers. this kind of a database is a nightmare. because typically what happens is when you store real data in addition to data you often store was called a data dictionary. sometimes that’s your photos metadata data about the data because simply storing a bunch of numbers is not enough you have to say what the numbers are about this adds a layer of complexity to the metadata. you now have to store not only what the variable is about. but what kind of a variable it is so many professional organizations say is that archival data should never be a mixture of both numerical and categorical objects and they pay a price for that numerical things should become categorical or categorical things become numerical but what happens is if you are storing large volumes of it and archiving it and making it available for people who are not seen it before is sometimes gets convenient. so therefore fees like this are often useful to see how big a problem you have now. i want to plot a few things to plot. you can plot anything see what i think is coming already later. but this plot this is from matt clark library. and it is plotting through a command. called hist hist means histogram which have already seen if covered histograms, right? i think you seen histograms. so this is a histogram now histogram has a syntax has been sizes in figure sizes. so what you can do is you can play around with these and see the differences in what this histogram does but this is certain default that shows up and the default is quite good and here is a histogram distribution of the age. this is not a set of numbers. this is a picture. this is a picture. what does this picture have? this picture has a set of wins. it has set of counts within each bins. between these two numbers between say 10 and whatever this illicit 22 or thereabouts. i have a count of let’s say 17. so it gives a count and it does this by getting a sense of how many bins there are and plotting this shape is a little bit of an art to write a histogram program. there’s a there’s a python book out there. i think things started going to fit in which sort of the first one. third of the book is how to write a histogram code it’s a wonderful book, but because it freezes example it got terrible reviews. reviewer said why do i want to learn how to code a histogram? and the book’s author is and teaching you how to write a code histogram is an is an example how to do that. and i tend to agree. if you want to test yourself of your understanding of data and your understanding of any programming language and any visualization language code a histogram in it. and have fun. so it’s a nice challenge from many perspectives. the data challenges. the language challenge is the visualization challenge all over. yes. we companies do that that they want archival data to be of only one data form only one format. why is that so because as i said when you store data, how do you store it? let’s see that you’ve generated analysis. the analysis is done, correct, and you’ve decided not to destroy the data. you’re going to keep the data in your company’s databases or in your own database. how will you keep it? you can take a technology. let’s let’s let’s pick an example. let’s say what’s it take an example sql excel whatever mistakes in let’s say i keep it in excel. now if i keep it in excel, what will i now do so, let’s say i have an excel spreadsheet. let’s say my cardio data centers say this data set now in addition to the data, what do i need to store with it? yes, so one possibility is i can have a text file like that. like i had at the top of this describing all of this which is typically what happens in extra storage. it describes this and it describes there’s one file called dot data in another file called describe something of that sort, which describes the variables and the idea is that they have the same name and one extension gives you the data the other disk extension gives you the description of the variables that are in this data. correct. now this is good now what’s going to happen on that data certain code has been run? that food is going to assume certain things about the data. what do you want that coat to assume about that data? whatever you want that code to assume about that data should be available in the data dictionary. now if that code is stable enough to realize that whatever field you give me i will run on that’s cool. but if that caused requires you to know what kind of data is being used. let’s say discrete data. let’s say continuous data in the future. you’ll be doing things like linear regression logistic regression linear regression will make sense. if the variable is a number logistic regression will make sense. if the variable is a 0 or a 1 if you have that problem now in the metadata, you need to be able to tell not only what business information this variable contains. but also what kind of a computation object it is. so the code can run. so therefore what people often say is that i’m going to make it very simple and i’m going to assume that my entire data frame consists of only one kind of variable so that when i run any algorithm on it, i know exactly what kind of data input that algorithm is going to get what i’m saying is a practical answer that many companies often often have and i worked in a couple of companies that is act is one company where this was very seriously done. so we had to we had to when we put data back in we had to convert it. and for in the situation that i was in it wanted everything in categories. so what we would do is we take continuous data and we would do what’s called find classing which means that we would divide not into four pieces by into 10 pieces decide one decide to – i’ll see decile for up to decide one and every variable was stored now not in its original numbers, but as ten nine eight seven six five four three two, so let’s suppose that i tell his income is 9 what? that means is i know he’s in the ninth desai. 10% of the people or more have income more than him 80% have less than him. he’s in that bracket and it all variables are stored that way now what happens is every algorithm knows that every variable is going to be stored that way. and you can keep writing algorithms that otherwise what would have to happen is every algorithm will need to be differently and let’s say you’re doing credit scoring. let’s say you’re doing crm models you doing something of this sort and you have built a very sophisticated crm model that tracks your customers and it works now suddenly, you’ve got a new variable coming in the twitter feet. and suddenly nothing works. what to do go back and rebuild that entire model that’s going to set you back three four months that’s going to set you back a few thousand dollars. so you say no any variable that has to go in has to go in in this form and if it goes into this form my algorithm can deal with it. it might not affect the efficiency of the model that we generate. yes. yes. and in fact is i’m going far away from topic now in practice and professional list has to struggle between doing the thing badly and the wrong thing well, you want to do the thing, but i think that is going to cost you time money data and everything. so you struggle between saying that i’m going to get a flawed model quickly built on a new data set or i’m going to get an inefficient answer on a model that’s already been built and let’s see how far it goes. and so these are more cultural issues with how our and analytical solution is often deployed in companies. they very very much from industry to industry. they very very much from company to company from the culture of a company to cultural company. they depend on regulatory environments in certain environments and auditor like entity comes in an insist on seeing your data. show me your data. let’s in finance is sometimes happening regulatory agencies the reserve bank of india. goes into a bank and says show me your data all this in p is a citizen. show me your order book. show me your loan book. click and now that has to be done and the decisions you made have to be done in a way that is patently clear why you’ve done this so very often people say i don’t want to make the best risk decision. i want to make the most obvious risk decision. which may not be the same thing at all. but i’m being audited. so that’s a practical question and i don’t have a clean answer to that. but i do know what happens. is it right? no, it’s not but we live in a world that has a kind of imperfection. my one of my teachers is name was jerry friedman. you’ll see some of his work later on. he came up with algorithms that projection pursuit cart mars gradient boosting. he created many of the algorithms that we’ll be studying one of my teachers at stanford when he ran our consulting classes. he would say this solve the problem assuming you had an infinite. at least smart client and an infinitely fast computer. after you’ve done that solve the real problem. when you do not have an infinitely smart client and you do not have an infinitely fast computed. this was in the early 1990s. the computer speeds were a lot slower. i wouldn’t have powerful machines like this around. so a lot of this is done in that kind of situation where you are where you’re struggling for continuity when you’re figuring it out. imagine yourself as analytics manager and i hope many of you will be and you have analytics team sitting in front of you. correct. you’re looking at them and you’re looking at them in the eye and you know how much you’re paying them and you know that half of them are going to leave at the end of the year. what you going to do with regard to the modeling and things like that? your first order of business is going to be two things should continue to in some form. keep it simple, right? keep it simple. keep it obvious for the next bunch of people who are going to come in. and for that you’d be willing to trade a little bit of make it, right? so now the new person coming in will now not want to solve a very complicated kind of situation. this is not where you want to be but and i do not want to depress you on day one, but it’s also the fun part of the profession is also what makes it. interesting and sort of interesting and exciting. it’s not all bad. okay. so the histogram of command summaries of what these histograms are in each gives you a sense of what the distribution is. and as you can see from most of these pictures most of these variables when they do have asked you tend to have a skewed maybe education has a little bit of a left skew maybe education a little bit of a left skewed that a few people are educated and most people are here, but even so interesting plot but work life has this as well, but see one has a better version of it. this is what’s called a box plot you seen a box plot. this is a box plot. people are unsure as to where this box came from. because it is transition for box. who’s used this before but this box came from what it used to be called a box-and-whisker plot. these are the whiskers. this whisker will go this is this is the median. this is the upper quartile the top edge of the box. the bottom edge of the box is the lower quartile. the end of the whisker is one point five times the interquartile range above the box. if you want the formula. sort of the whisker the length of the whisker is 1.5 times. iqr should have a break now i a little bit maybe huh? so we’ll look at 3:45 what i will go up to there. i haven’t stopped i just got distracted. so. 1.5 times the if it goes up to that. if a point lies outside it the point is shown outside. if the data end is before it. the whisker also ends correct. what is the whisker? okay, what is the risk? all right, let me explain another way. the whisker is the maximum. the top of the whisker is the maximum the bottom of the whisker is the minimum. okay not okay. okay this point here. what is this plot here? it’s for bills. so this means is this is the minimum 18 or whatever it is. and this is the maximum 48 or whatever. it is minimum the maximum. so if you see nothing else on the box plot, no other points other than just the box and whisker then your five points summary sitting there. that’s it. now what happens if you see points like this? outliers what is an outlier? and how care is a point that lies more than one point five times the interquartile range above the box? so this whisker will not extend indefinitely. it will go up to 1.5 times this box. and they just stopped and if any points are still left outside, it will show them as dogs. you can treat this as a definition for what an outlier is. say the same thing in the other direction. the logic is symmetric. that means this mean it hasn’t it’s the data is entered here. the data is entered here i suppose so and you can change it you can i won’t now, but you can go to the box first syntax and change that. so you can go to box plus syntax, and you can change at one point five. it’s not hard coded into the algorithm. and i think 95% sure a statistician of never sure about anything but i will break but it’s a parameter in the in the usual to pass the parameter in the start function default is 1.5. you should be able to change it. what’s the color part is a comedian which one color this these two colors these two colors are because of asked for two things. i’ve asked for male and female if i if i had three of them. okay, this this one here or this is q3. the lord is q3 and the upper is key for your is q 1 and the upper is q3. totally so so for meals between the bottom bottom whisker to the end of the box is a quarter of your data the box is half your data and the top of the box to the end of the whisker is quarter of your data. so the middle line in the middle is the median the middle is the median there is also a function in box plot. you can play with where we’ll give you a dot. and that. is the mean i mean you can you can you can ask box plot to do that. but what i mean is not a general standard component in the 5-point summary. it’s a different calculation. not a sort. but if you want to you can make box plot to give a dot on the mean as well by definition. it yes. so mean median so half the data is between 24 and 34 or whatever that is half of all my all the men in my sample are between those two numbers. i think austin doesn’t allow you to change the shape of the box. i think that is set. that’s sort of central to the idea of a box plot. it does allow you to fill with the size of the whisker don’t think it allows you to fill with the size of a box. now. what if you change that to something else? let’s say the 20% point to the 80% point 80/20 rule. that’s no longer a box plot. is another interesting plot the significance of it is exactly this as we have seen before the significance of it is is that the data looks like this. it’s rights cute. think of the picture so this is your q 1 and this is your cue 3 this is your cue to or the median. then the median is going to be closer to q 1 then it is 2 q 3. in the same way that the minimum will be closer to the media than the maximum same idea. this is a summarization for numbers if you want to summarize for categorical data. what’s called a crosstab? or a cross tabulation. this is simply how many products are their product category? 195 498 and 798. they’ve got three kinds of treadmills and they’re trying to understand which who was using what kind of treadmill a business problem is to understand who is using what products this is a crosstab. what is this? this is something that will be used for categorical variables. no box plot will make sense here. there’s no numbers. so now you can ask interesting questions here if you want to and you can think about how to answer it is at for example, you can ask the question. is there a difference between the preferences of men and women? possibly. is there a difference in the products that they prove that irrespective of gender? is there a product that that they prefer? then ask all kinds of interesting questions and you can find ways to answer it which we will do not in this residency for next time around categories. so this is simply once again, this is descriptive all this is done is it is simply told you the data as it is? what i’m saying is that if for this if you want to do a little more analysis on it you now have to reach a conclusion based on it. so for example one conclusion to ask is is that is that do men and women have the same preferences when it comes to the fitness product they use. now that’s a question to answer that question is enough to look at the data for just looking at it will not give me the answer. i need to be able to find a statistic. figure that out a statistic that does what that in some way measures that difference. let’s say measures the difference between men and women or what we will do is not measure that what we’ll do is we’ll measure that if there was no difference between men and women what should this table have look like. and then we’ll compare the difference between these counts and that table but that’s the interesting part of a statistical statistic which we will do that’s called the chi-square test. it’s coming up in the next residency. but that’s the prediction part or the inference part of this description. this is just the description you can do a similar thing here. this for example is for marital status and product. what product you use? i do not vary depending on your partner dosing what is metal particles form? it has to do with age or maybe they correlated. should we use one as opposed to the other? okay. you can use counts as well. if you see instead instead of instead of doing it this way instead of seeing it as a table. if you want to see it as a plot you can ask for counts. so there are things like count plots and bar plots, which allow you to do counts in the lab. you will do probably a few more of these. this is simply another visualization of the same thing. for those of you who like things that pivot tables in excel. hmm so microsoft has made, you know wonders of us all in corporate life. they were two rival stole that you know, you can have you can have a master’s in bachelors masters in anything engineering is good etc and is nice if you have if you have you know, phds in a few areas, but what you really need is a phd in powerpoint engineering and that’s a necessary qualification for success. so certain tools have been used. so therefore those tools have been implemented. in many of these software’s as well. this is the pivot table version of the same data set. this is the last sort of not last but still this is this is a this is a plot. let me show you this plot and then we’ll end out. we’ll take a break. this is a plot. that is a very popular plot because it is a very lazy plot. this plot requires extremely little thinking pair plot of a data frame, right? you don’t care what the variables are. you’re telling it nothing about the plots. you’re simply saying figure out a way to plot them pair by pair and it does that. so for example, how would you read this plot on this side? so it creates a matrix? the rows are a variable and the columns are variables. what is this? this is h versus h is forces h makes no sense. so what it plus there is histogram of age. does it like the gap nature abhors a vacuum? i suppose python does as well. so enough flutter what it should have taught his age versus h. if you’re right, it should have been a 45-degree life. hmm. but a 45 degree n is a useless graphic particularly the same 45 degree line shows up in all the diagonals. so to make a more interesting graphic. it blocks the histogram there. this calluses this kind of analysis sometimes has a name associated with it. the name is univariate. univariate means and looking at it variable by variable one variable at a time when i’m looking at h. i’m only looking at h. so univariate analysis is just a word uni as in. uniform same form unicycle cycle with one wheel things like that univariant unit in replicate the same. it will replicate the same nature of the data. they’ll be histogram here again. so yes, so what it will do is remember this graph the nature of the graph. so let’s let’s see this. so where is gender here? where is gender here? is it there? is gender is gender in my data? it is so when i did paste plot my data. what do we do with gender? yes remember in info? when we did info here. remember how it is stored the data? not any any here so it had product gender and marital status. it had identified as objects in the data frame. where is it form the data frame? so now what does it tell you about the about the command? the pair plot command yes, it will it will ignore those objects. so in answer to your question, if the data frame has been stored has been captured with integer 64. it is integers or numeric seen it it will plot. if it’s only objects in probably given a lot. yeah. say again is not why i like that. this is the histogram. this is the same plot. this plot is the same as which plot? this one is the same as this one here. no, this is not a job versus age. this is just age. he’s versus age would have been a 45-degree line but is not floating that. it’s not floating that in the diagonal. it is not plotting each versus age in the diagonal. it is simply plotting ages own distribution. yes with the council what it is doing is it is essentially running hissed on age or all each observation and putting it on the diagonal. yes. 22nd there is a bit. from each business account. it’s a count of the number of people who are in that age group here. this is a know this miles. this is h. this is h. so say between here between let’s say 40 point five and forty three point five or whatever. these numbers are there are three people. is it remember the histogram is a visual thing. you can determine a histogram. if you want to which means you can step you can find out what those are and you can see it inside inside histogram just ask for somebody that it will give you what the features are of that histogram, but the histogram is not meant to be used that way. it’s meant to be used as a as an optical device. to see the shape to see the count. it’s an art to do a histogram. if you change the beans a little bit the histogram will look a little different. so i would suggest that unless you’ve got a lot of experience in this or you really enjoy the programming do not fill with the histogram is sheep will change. i’ll see you later after the break not change the histogram, but what shape is. no, not not in default. you can go in and change it on size. but the bin weights etc the bin width of histogram takes a little motor change so you can they stuff out here you can find other things in which you can play this so there are ways to do it. okay, so quickly ending we are losing our food. so these different plots and will continue after the break. the rest of it is simply an x versus y so, for example, this is age versus education. this is age versus education from the first one. is this rotated this? yes, he’s if this is education on the y-axis in age on the x-axis or vice versa, then these two plots one and two and two and one are just mirror images of each other. rotate your depends on where you look see where you put the mirror but yes mirrors. so i remember when i was when i was a kid. nina’s would confuse me. so i would ask the question like this there when i see a mirror left and get switched but top and bottom don’t i never understood why? you know. due to gravity you can think i mean left and get sweet but top and bottom dude. i saw this coming to the mirror and then i thought it was something to do with my eye is you know, maybe because they left a so i looked at it this way and that didn’t help. so. yes. it’s an important point when you do symmetry. it’s a good catch so good cash realize that there are so many plots. they’re actually only half as many plots because the plot on this side of histograms in the plot on the opposite side is histograms are the same. there’s another question that one if you ask is that many of these he’s seem to look like rows and columns in the sense that what are these rows now? what is this role? look, what is this mean? it means that this variable fitness. this variable fitness actually has very few numbers in it. it has a number one, two, three four and five now, why is that because remember how i define fitness is my perception of whether i was fit or not in my original definition of the variables here you go self-rated and fitness and one to five scale where 1 is in poor shape and five is in excellent shape. this was the created data. so in this data set, i now have that this variable in it. these kinds of variables sometimes cause difficulty in the sense that they are some there’s a word for it. these are sometimes called ordinal variables. so sometimes data is looked at sort of, you know numerical. and categorical and categorical is sometime called nominal. and ordinal nominal means it’s a name. name of a person north south east and west gender male female place etc. it’s a variable essentially it’s a name ordinal is it’s also categorical but there is a sense of order. this is the water dissatisfied very dissatisfied. so there’s an order order they for ordinal this variable the fitness variable can if you wish be treated as an ordinal categorical variable. so for example, the likert scale is that so the seven-point scale not satisfied very dissatisfied dissatisfied morally dissatisfied neutral set morally satisfied satisfied very satisfied. mach 1 this generates the data from a scale of say 1 to 7 o 0 to 6, so it will show up in your database as a number like for example here you can see instead of one to five very unfit morally unfit, okay. relatively fit very fit is to giving one to five give it that way and you code it up this way your choice. so sometimes when you have data that looks like this. the data the python or any database will recognize it as a number because you’ve entered it as a number but you analyze it as if it is a category. so the opposite problem also sometimes exists in that sometimes you get to see a categorical variable show up as a number but you know, it’s the categorical variable a zip code is an example a zip code shows up as a number, but it’s obviously not. you can add up zip codes. you take two places in bangalore and you want to find a place between them. that’s not the average of the zip codes. you might be close, but you can’t do arithmetic with zip codes. the other difficulty with zip codes is that they can be many of them which means that as your data set grows. the number of zip codes also grows. so the number of values that are variable can take grows with the data and this sometimes causes a difficulty because what happens is that in the statement of the definition of the variable you now cannot how many categories there will be present? so, you know that they will be more zip codes coming. you just don’t know how many moles of course will be coming. but you also known as a categorical variables. you can treat it like a number. and so there’s some special types of you know problems like zip codes that require special types of solutions. so the plot itself is a very very computational plot if it recognizes it as a number eight plot set. if you don’t want to make it flat as a number change it to a character most of ways including python will allow you to do that. now this is in some ways a graphical representation of it for the free end of this session. we can talk a little bit about the numbers associated with it. so here for example, my data h. you can also go, you know dot h if you want to and things like that. this is the mean. twenty seven point seven eight eight eight is the mean so you can extract it there functionalities of the mean that can be that can be recovered like trimming, etc. etc there. if you want to you can calculate the standard deviation. remember the standard deviation formula that strange formula that i wrote on the boat. this is the standard deviation formula if you want to calculate the standard deviation, you can do this for other variables as well. this is an interesting plot. so i don’t want to go too far into this plot, but it’s interesting plot if see ball. there’s a warning on the code. this is called or what they’re referring to is a distribution plot. so this is a plot. that tries to look at not what the data is but what the distribution is. so remember i was drawing these odd pictures pictures like this and drawing lines on it. those were distributions. so what is trying to do is trying to go after the distribution of the data? now, what does this mean? it means that it says this it says that there is an underlying distribution of the age variable. this distribution is a distribution. you do not know. however, you have a sample from that distribution. how big a sample? about a hundred eighty observations from that standpoint of a hundred eighty. can you guess at what that distribution is? in other words, can you give me a curve? it’s an answer to that problem. and it gives a curve. why is the raw data not enough? so the raw data is not enough. then this goes to the heart of what the statistical problem is. is because i am interested not in the age of this particular group of people. i’m interested in the corresponding edge of another very similar group of people. why what is the problem? i’m trying to solve and trying to solve the problem of who is buying my cardiac equipment now. when are these people going to buy my cardiac equipment? at some point in time. okay. now, what is my data? but for whom is this data, whoever got the rate of four people have already bought. so i have a problem. my problem is i want to reach a conclusion about my future customers based on my old customers. how to do this what mathematical logic allows me to say something about the future based on the past? yes in short the way to do. this is to assume that there is what is called a population. we’ll talk more later at this stage to assume a distribution. to assume that there is a distribution from this distribution. i have seen a sample. today from this distribution as see a sample tomorrow. the people are not the same. because the people are going to use my card experiment cardiac equipment yesterday are not what you’re going to use it next year if it was the same. i never have a growing business. there’s no point analyzing data of customers unless i want them to buy more things or i have new customers coming in. so what is common between my observed set of data and the data for my new customers? that commonality is what you can think of as a distribution. so he says that from this can you give me a sense of what this distribution is and from this distribution. i can think of other people. coming so what we’ll do tomorrow is we’ll talk about a few distributions certain few specific distributions in our calculations in the distributions for now. what is graphic does is a simply calculates that distribution for you. i’ll explain very very briefly how it does. that won’t go into too many details. what it does is it takes the averages of points? yes. i’m saying that for a sample. why this is so why is not the sample the distribution itself? why am i not saying it’s a good question. why am i not saying ignore the curve? why am i not saying that the original histogram which have seen three or four times before? why is that itself? not the distribution? that’s similar to the following question. let’s suppose that you have done a blood pressure. yes, and you’ve got into few measurements. you’ve been tested twice today. let’s say pre-printed, you know before weeks after breakfast next week. also, you have done this. let’s say you’ve done this for what it will be a reason you’ve done this say once a week for a month. so now you got four readings. no foreign 8 readings. now these eight readings. is that the distribution? that says yes if i want to understand what my blood sugar is and what it will be going forward. if i do not get treated then certainly there’s a relationship between this and what will happen in the future. for example, if i behave exactly the same way if i eat exactly the same way or exercise are not exact same exercise exactly the same way if i smoke if i left hand is exactly the same as it is, i would expect my readings to be the same. but what about it is going to be the same and what about it is different. i don’t quite know. yes, it is. true that those eight numbers. our representation of the distribution but they’re not the distribution is safe if they were the distribution itself. i would be forced to say that in the next month. i would have exactly these aid readings. but i know that’s not true. but i also know that from these eight readings i can’t say something about what will happen next month. it’s not that there’s useless information there. so if my readings this month, for example, our let’s say a hundred and ten hundred twenty hundred 1525 sem good health hundred thirty etcetera. i know these these are my it readings. i know with some idea perhaps that that if things become remain the same next month, they will not start becoming 220 210 215 2030. they will not become that. how do i know that because i have this reading this month? so the idea of a distribution is to be able to abstract away from the data the random part and the systematic part and the systematic part is what remains as the distribution around it. there’s going to be a random variation. and the random variation is going to exist from data set to data set like this month and next month like this one customers and another set of customers who buy cardio equipment maybe from another branch of my store. if i am for example running, let’s say a chain of stores. let’s say that i am oh i don’t know not to pick names. but let’s say i don’t know reliance fresh or something of that sort. anyone who understand how my stores are doing. let’s say i take five or six stores and i studied them extensively. how do i know that those results are going to apply to the remainder of my 500 600 stores? what is common between these 5 and the remainder how are they representative of it? what part of it applies to the rest? and what part of it does not how do i extend it? how do we extend your blood pressure readings to the next third pressure readings? how do you figure this out? that is the heart of statistics is called statistical inference. to abstract away from the data certain things that remain the same and certain things that do not so distribution is an estimate of that underlying true distribution of h. and so it’s not as rough as smoother. how smooth is something that the plot changes that the plot figured out on its own like histogram, but you are free to change it. you are free to change it. there are functions. there are functions within it functions within this plot. and that’s it’s a fairly sophisticated function from which you can do which you can do many things with i mean, it’s a fairly sophisticated thing in there are many many functions available within it. so for example, this bin of histogram this allows you to say where should the boundaries of those his histogram part of it be whether you want to plot it or not, whether you want to plot that that what i was calling a distribution the gaussian kernel density estimate is a sophisticated way of saying the same thing. and their functions available to put into that so you can change this. it’s one of the one of the most sophisticated plotting functions that you will be able to see. i wouldn’t suggest doing it now kelly more experience in doing this but yes. gaussian will not make too much of a difference. what will happen is if there is if there is no smoothing out here it will look like a you look like a normal distribution. this is the these little wiggles will go. we will discuss everything later as to when it may be tomorrow as to when it’s a good idea to do that. and when it is not just hold on to that question a little bit. we haven’t talked about the gaussian distribution yet. i’ll deal with that when the gaussian distribution comes for now what this is is it gives you a visual representation simply a descriptive representation of the underlying distribution, hence. the distribution is a distribution plot. one of the examples is samples were taken example. yes distributed. then you taking a current sample that korea. so if i distribution is, correct. let’s suppose in an ideal world if my distribution thinking is correct. then here’s what would happen. if i take my old new old a data and i do histogram a sorry. i do a dist in the new data. i do add ista ghin these two should be very similar. the histograms may be different but the distribution should be similar. if i’ve done my analysis correctly. does that mean the variance is i wouldn’t use the word variance is a variability. it means that there is a there is a some this is called sampling variability. in other words is a very variability that is due to the fact that you’ve taken a sample. there is an underlying truth but you know, seeing that truth because you’re taking a sample. there is an underlying level of your blood sugar, but you’re not seeing that because you have taken only a very small sample of your blood a few milliliters whether liters flowing around and it only a for at a few seconds in time. there are many hours in the day. there are so many other things that you reading could have been but if it is a good if it is a good sample, then what would happen is that i will be able to cover this variability. see if i want to get a sense of what your blood sample actually is when i want to trample this. well, what i will do is i’ll take samples in different kinds of situations one thing the cover for example before eating and after eating that that they cover and maybe or maybe i want to cover other things as well. for example in certain kinds of diseases. they very conscious is to way to take the blood from because the metabolism in the blood changes based on what certain disease and i won’t be going for example, you draw the blood near the liver. the liver is the body’s filtration system. so essentially you want to figure out the nature of the blood when it flows into the into the liver and then after it shows up to the live in order to understand whether the liver is filtering your blood correctly or not. now to do that you need to draw the blood in very specific places. so in order to do that there for you in your experimentation should cover all of that. what does that mean? for example in business terms? let’s say that you’re looking at sales data and you want to understand your sales distribution. well don’t focus on certain sales people. look at your bat sales people. look at your good sales people. look at your high selling products. look at your nose selling products cover the range of possibilities. if you do not cover the range of possibilities, you will not see the distribution. if you do not see the distribution, you will not know what where the future data will come from. and if you don’t know that you’ll not be able to do any prediction of prescription for that. the histogram is just the summary. this is also just a summary but the histogram summary applies to just this data set. this distribution is pretending to apply to a little bit more definition of the definition of a distribution doesn’t apply to the data. so distribution function. so to speak is just this so for example is sometimes defined this way f of x is equal to the probability that x is less than or equal to x. this is sometimes called a distribution function f of x is equal to the chance that age is less than or equal to 15 age is essential to 16 is a centre 17 and now let me confuse you even more hmm. f of x this is the derivative of x the differential of x. is the density function which is the area under the curve? this is called the density function which is what this plot is plotting. this is sometimes called the density function the density function. so the distribution function is the integral of the density function and the density function is the derivative of the distribution function if you’re very mackey in all of this. huh? so what they’re plotting is a plotting the density function. i showed the consider. this is actually called the density function. the reason i’m calling it a distribution function is because it says distribution here. i was hoping not to confuse you clearly have failed. go ahead. yes, that’s the idea. yes. now you’ll see now you’re now you’ve hit the problem of statistics bang on the head. how do i get an idea of a distribution that applies to everyone based on only one sample sitting in front of that is the million dollar question. and that is why people like me exist like that is the whole point of the subject. and it is a hard problem. it is a hard problem because you are trying to draw conclusion. outside your data. you are in you’re not even nobody is interested in your data. nobody’s interested in your data. right. everybody is interested in their data. or in their problem. how come laughter? yeah, nobody’s interested in your data now, but you still have to analyze the data that is in front of you and reach a conclusion that makes sense to them. the bank has to look at his order at it’s you know, portfolio and figure out what is your strategy should be the clothing store needs to figure out look at it sales and figure out what loads it should make regularly has to figure out its course reviews and figure out which faculty members to keep you have to look at your your expenditures and figure out how much salary to negotiate for. how will you do all of this? how do you do all of this by the way? based on some sense of distribution there. so when you go and you negotiate for a salary now, you’re not going to negotiate for you know hundred crores, you might but you say no one’s going to give me that anyway, so maybe you’re good enough. i don’t know but but but what you do is you essentially say you do roughly something like the following you figure out how much money you need and how much money you are expecting and that to drums to some extent is based on. your expenditure and what you want to do your expenditures also based on that. you have a certain income when you’re spending this term this you’re doing all of this on a regular basis. hmm. you’re standing on the road. correct? you’re standing on the road in you’re trying to decide whether to cross it. how do you decide? experience you got past data and that data is telling you please toss the road how that data is not seen that car. okay, fifty three three six one nine, which is driver has not been seen by your data set. how you cussing. because you’re making the assumption that why i have not seen him have seen many others like him. so so there’s this there’s this story right? so, you know a taxi driver is is going at night on the road etcetera etcetera and he’s just running left left isn’t so red light. new shoes etc just keeps going that the drivers are the passenger is getting very scared. stop curl using the driver. he says in hindi apologize for the language gymnastic a shed, and i am the lion of the road who will stop me. it goes through all the red lights and then there’s a green and then he stops. and then he says why are you stop now? the guy says ask them about sherry? so the guy on the other side, so he’s logical right? so his data is saying that there are people who cross the red light. so therefore if i’m standing in a room like this red light on the other side cars a bit across left headlight, right? very logical so therefore and we do this all the time. so why we are not trained a statistician that is normal people are not they behave like one based on the x-men now your objective and the objective and of analytics professional is to translate this logic into a algorithm. into a proceedings of the company that the company and the computer understands and that is not easy for starters. let’s say that you that you that you are here. and you say that this is an average, right? this is the mean age. this is twenty eight point seven eight eight eight and you could say that this is an estimate of the mean of the distribution. this is not this is the mean of the data but you are not interested in the mean of the data. why are you not interested in the mean of the data? because you’re not interested in this particular set of hundred eighty people, but you are interested in the average age of my customers. so now the question becomes what does the age of my new customers have to do with the number 28 are they related? yes, you say that they’re they like a copy of what i have. i think christie said that you’d like a copy of what i have then when i say twenty eight point seven eight again now, you’ll say probably not there a copy but not that much of a copy most likely are now we’re talking you know how likely is most likely and what about it is going to be the same and what about it is going to be different. y axis of distribution, so the y axis of the distribution you say is the same also, for example, you could say that this 78.8 is an estimate of the population distribution, which means that yes, it comes from the histogram. it comes from the same district himself, but also comes from this distribution. but there’s also this nagging feeling that i do not know for sure. i do not know. i don’t know what this new data is going to be. so what will happen is we will not give the answer 78 28. sorry. we not give the answer 28.7 will give an answer that is like 28.7 plus minus something you say i do not know what the population mean is, but i’m going to guess is around 28. i know it is not going to be exactly 28 but 20 it isn’t useless for me either. it’s going to be around 1 to bike to bike around to an old 28. how much around 28 now certain criteria come in? what will this depend on it will depend on the variation of the data. if the data is standard deviation. if the data is very variable. this plus minus will be large then. yes an issue that you will depend upon how many things i’m averaging over if this was a hundred eighty. i’m so sure if this was eating thousand, i’ll be even more sure. if this was 18, i believe here. so depends on how much data is being average over the more data. i have the hearer. i am about the repeatability of it the sure i am that i will see something similar again. it depends upon how sure i want to be if i want to be 95% sure. if i want to be 99% sure if you want to be 99 point the moisture, i want to be the bigger the the tolerance i must have on my on my table and those are things will get to so those also descriptions but those descriptions are heading towards being able to predict. so now if i give you this twenty eight point seven eight, i’ve given you a description of the data, but i’m not giving you a prediction. even though even though 28.7 it plus minus something have now begun to give you a prediction. today’s about descriptive analytics. we’re not we’re not predicting anything. we will get there. but this plot is in some way a first measure of of of looking at this idea of a population and of a distribution associated with the population. this is yeah. huh? if the if the variation let’s discover will be sharp. flat means variation is more if the curve goes this way. it means that there is a lot of variation. i’m unsure about the middle. it’s harder. you need more data not necessarily the variation of the average would go less. so let’s suppose that you have no control over your diet. i’m not accusing you of anything. it happens to humans, but let’s suppose that you are doing a job in which your lifestyle is very varied you travel from place to place you eat in different hotels. sometimes you don’t eat at all. sometimes you stress out a lot. sometimes you’re naturally going after trains and sometimes you’re sleeping for 12 hours in a row your life is highly variable now, let’s suppose and there’s nothing wrong with that many people have very varied lives. well, i suppose i’m now trying to measure the blood sugar of such a person. what must i do? now try to other variables or at the very least what i need to do is if i simply want to get a good blood pressure measurement is i have to measure it underwent many different circumstances. or i could argue. i don’t control your circumstances. i can control your circumstances so i can say for example that go and measure it at this time or go and measure to take a look amateur take a look at meter and before going to bed do this or after you’ve just had a very hard day do this. i can give certain instructions to cover all the corners or i can simply say i don’t know but what you need to do is you need to measure your blood pressure or sorry your blood sugar say every six hours and then tell me what happens but you need to do this often because i expect your blood sugar to be highly variable. simply because your body is being put through a enormous amount of variation. in a business tuition let’s suppose that you’ve introduced a new product. you do not know if this new product is going to sell or not. what will you do? what will you do? i mean how will you measure you just introduced it? based on past data you’ll do that. but you’ve just released it. you can measure current data. no different situation. i’ve just released a product. all that is over all that is over. i now have just released this this watch in the market what typically happens is people track the market very very closely. liberty city as the number of sales made everything the reason is because they’re not sure how much this will sell see. the question is what changed what changed was your product release another competitor could be reacting immediately. my point is not my point is not that there are many things to look at which you should my point is that when there is a change in the distribution when there is when there is an unknown distribution coming in front of you whose variation you do not know you tend to get more data. you sample more frequent here. you get more data. you you figure this out. we do this all the time. for example, let’s suppose for those of you who have kids. that’s it for your kid is going to a new school. what will you do? you lost your questions to them. you’ll get more data you find out what is happening in school. what are the teachers? like i said because there’s too much variability standing in front of you now with those answers and then you do a few trips to the school. you are now a little more, you know, you may like it you may not like it but you release more more informed the distribution is now known to you. so you get more and more data? that’s why you get the experience. that’s why you start getting that experience. if you have that experience already, in other words, if you know the distribution very very well and you’re comfortable with it. it will take time to get there and that’s why this big data world is becoming so interesting that by the time you’ve understood a problem. the problem is not important anymore. there’s a new problem now. this is good, right? that’s why you guys have jobs. but also means that the answer to that is that also means that when you have new data you solve a different problem. you don’t solve the older problem better. which is what a statistician to some extent is trained to do that as you get more and more data get the distribution better get a better idea of the unknown. make a better product, but the alternative view is make another product. solve a different problem if you have more data, so the ceo is now saying i have more data. give me more more of what solve another problem for me. give me new customers that i can go after and things like that. so therefore that problem is is a problem that statisticians big data people often and it’s not an easy problem to go after. but as you have more and more data coming in, how do you utilize it? how do you how do you how do you make efficient use of this information do you get tighter estimates of what you’re going after you’re doing sentiment analysis of text when if you will do text analysis you write, you know twitter code in excel and you will do latent semantic analysis. you will look at positive net, you know, let sentiment scores and things are that’s it. and now the question will be that you know that this is going to change people’s opinions are going to change. so overworked granularity. do you expect people’s opinions to say the same do they change every day if they change every day? there is no point looking at a person over an average of days because i have average is nothing. every day is a different opinion on the other end if their opinions change. let’s see on a monthly basis. then you can look at daily averages in average them will get a better estimate of that monthly rating. so it’s also you have to make a guess as to whether i’m estimating a changing thing or whether i’m estimating a solid thing better. and that’s not a that’s not an easy thing to do it since it is a i know it’s happened to me. i don’t know why this happened to you or not. but at times in my life where i have simply not had haircuts. what that means is i’ve gone six months eight months a haircut has been like a weight loss program. well, i’m not care what i look like. i’m not sure i do now, but you know when things become very unhygienic i go and get a haircut. it also means times in my life. when i’ve been a lot more conscious of what others think about me. you can imagine what points in my life. now. i groom i’m very careful. i get my hair done and you know all connected. i’m getting my haircut much more regularly. now what am i doing? so in the second case, what i’m doing is i’m trying to make sure that i’m reaching a certain distribution of standard. universe is certain target distribution that i have and i’m interested in getting there and intolerant of variability. i’m saying that i’m going to estimate this distribution. i’m going to stay close to it in the first case. i was not i was perfectly okay with the variability. edison kissing you will be okay with the variability and you’ll not want to estimate the distribution of this time. and in certain cases you will you will want to estimate it very very well. you want your hair to be done? very correctly. you will want your product to be targeted to a very specific age group. you will want to know that when i am targeting to this particular age group. what advertisements do i want to show you want to advertise it on television and you will want to know which who is watching the program on which you’re advertising this. are they tardy college people are they professionals are the old people sitting at home who will use this? and therefore where will advertise my cardio product three times union want to know this very very precisely or as precisely as you can. so therefore this number this mean number and this number from a distribution perspective. from a description perspective is perfectly okay. it is just the average. but from an influential perspective is just the beginning of the journey is just one number and we’re going to have to put a little more bells and whistles around this. go ahead. you have lots of questions clearly. okay, so we haven’t talked about normal distributions. we will do tomorrow but so statisticians need to make assumptions about data. one of the assumptions is what he’s talking about. it assumes a certain distribution. it says that i’m going to assume that the data has a normal distribution is an assumption. but why do statistician make assumptions like that one reason they make assumptions like that is because they make it the calculation becomes easier now just because the calculation becomes easier doesn’t mean the calculation is correct. because the assumption is wrong the calculations also going to be wrong. but because of the assumption you can do many of these calculations and if you don’t make those assumptions these calculations now become difficult or even impossible given the data at hand. so a lot of the tests a lot of the procedures that will be talking about are going to make certain assumptions will see one in about an hour or so. have you stood assumption is correct. i will have a strong model. but if the presumption is wrong, i will still have a model that is. that is that is indicative. so there was an economic. i think paul samuelson are not intelligent or who but someone who said no, george box. i think the box and called the box lock box. he said that all models are wrong, but some models are useful. so the question is it may still be useful. if in many cases the distribution is expressly allowed to be not normal the domain tells you that let’s say that you are in an engineering domain, you know, the data has a certain shape and engineering domain tells you that it’s a shape and the shape is sometimes called a weibull distribution what that means is that if you are reporting out, let’s see the failures of something you’re reporting out the failures of gas turbine blades. i spent a number of years doing that. we had to report out the weibull distribution. we didn’t report on a normal distribution at reporter weibull distribution. in the finance industry report out a log normal distribution means and variances of it. every industry has its own favorite distribution because every industry has its own genetic data form. now even within the industry of particular data set could violate that rule. and then it becomes interesting. there is a statistician view now use a higher power tool set a more powerful tool set to solve that. this leads to certain complexities the first complexity there often runs into is which one and do i different do i do it differently from someone else? is it is like a doctor who looks at a patient and says that you know what the textbook says that i should do it this way. but i like this guy he looks different who i’ve never seen anyone like him before. so let me know the textbook and treat him this way. i think he’ll get better. now, could it could i be i could but i’m taking a risk. so every time you’re making an assumption on your own and following through and it you’re taking similar risk you could be for that particular case, but the president’s you have far fewer precedence has to go on and as result of which a turn-on when you extend it beyond to someone else you’re going to have to you’re going to find it hard to do so. so therefore people often make assumptions and distributions in a sort of you know, sort of historical sense that they’ve known that this has worked moderately well over a period of time. and they very hesitant to change it for particular cases. sometimes there are allowed to in regulatory terms are not allowed to any accountants here. accountants, its accountancy, you know this. so if you are an accountant you have to do your books in a certain way. now let’s say that you are measuring cash flow. there is there is a certain way in which you will measure cash flow. now you may say that in this particular month your business was done we will differently so i’m going to show a better cash flow this way. if you can, you know, you’re running into trouble. now you may be you may be in the sense that that me actually a better way better way of doing it. but as soon as you go out of cfa cfa as soon as you go out of a very standard way of doing things. things will be a problem. and the same kind of logic applies to office statistical analysis as well. so as a result of which like an accountant you are you are doing the thing approximately most of the time. in machine learning. this is a term that you might see there’s a term that’s often called like supervised or unsupervised etc is called pac learning. pac learning it’s a deeply technical field and that stands for probably approximately correct. probably approximately, correct. i’m not telling you anything if i’m wrong. don’t believe me. but i’m probably approximately correct. and the probabilistic part comes from statistical thinking the approximately part comes to machine learning thinking and and it’s a it’s a it’s a deep field is a serious field, but it puts a probabilistic statement or an approximation. so therefore at the end of the day, whatever method you use their stuff your sense of how generalizable it is. you will do that you do that fairly soon in a couple of months after you do your first hackathon. and your first hackathon all your hackathons will have a certain feel to them a common thief or a hackathon is i’m going to give you a data set you build your model on the data set and i am going to have a data set that i’m not going to show you and i’m going to tell you how well your model has done on my data set. and you have a day or six hours or whatever to fill on with your data set. show improvement on my data set. this is what you’ll do. you’ll do twice i think in your in your schedule. what does that mean? it means that by being very good on your data set doesn’t necessarily mean you are successful. you have to be good on my data set, but i’m not going to give you my data set. this is not as impossible as it sounds is a very standard problem. and this is a typical problem. you will not find this heart. you’ll find is very easy. by the time you get there, you know not a problem you all will your pretties is will you will you will get me in a nineteen ninety six ninety and whatever percent accuracy not to worry. technically. this is not hard how the road right? so there are two answers to that one is if the mean is different from the median then you ask no no. meaning equal to the median from a distribution sense means that these are the two numbers. okay, if the distribution looks like this and i have a another parameter mu we’re going to do this later when statisticians use a greek letter. they’re referring to something that they do not know. where it is all greek to them. so mu is a copulation permitted exists, but it is unknown. it exists but it is unknown. now. if the distribution is nice and symmetric like this, then this unknown thing in the middle can be estimated using a mean or it can be estimated using a medium. now the question becomes which is better and the answer to that roughly speaking is this that if there are many outliers if this distribution tends to sort of spread out to the tails then use a median. because of the reason that i said the median becomes stable to outliers. if this distribution has the more bell-shaped curve of this particular kind the mean is more efficient at this a better answer is e what if the destination of distribution is not that but it is like this then the median may be here. and the mean may be here. now you’re asking different question now is not a statistics question. it’s a common sense or a sinusoid question. which one are you interested in? how you interested in per capita income or are you interested in the income of the typical indian? correct. for example, let me ask you this how much time or give me one number one representation of the amount of time that you spend on a website? i’m asking for one number. don’t tell me the number but think in your head as to how you would answer this. how much time do you spend on our website by website what i mean is this? yes, but what does the average mean? so how would i do this? so so here’s what i’m asking you cruising the web every day. let’s say sort of asking for is a number like this that and the amount of time that you spend on a website you go to different websites and you spend a variable amount of time on each of these websites for whatever be your purpose. sometimes you just passing through sometimes you’re seeing a video. sometimes you’re sending an email blah blah blah, whatever and every section i’m thinking of as a different website. if you go to the theory go to google twice then i thinking i’m thinking that is too. sites so session ways, so to speak now. i’m asking for a representative number. so how would you come up with that number? what’s a fair answer to that? i mean see if i do the mean here is how i would do it on a given day. i would so the first website i’ve gone to and find out how much time i spend their second how much time i spend 1/3 how much time i spend their fourth how much time i spent it and it had this up and i divide that’s the mean right? what would be the median? the median would be i look at all those times and i sought it and it put this in which is going to be larger. it depends is correct. but in this particular case, so think of your think of your typical browsing habits. now everyone’s browsing habit is different, but you think of it and networked people who deal with network traffic deal with this problem on a regular basis. so here is what usually happens. most of your sessions are actually quite short. for example query you go to website a new poster. you post a query or you go to your gmail and you check whether there’s been a new email or you go to a favorite web site new site and see whether something new is there or not. most of the actual pages you visit you don’t spend a lot of time on but sometimes you go to a website and you spend a lot of time on it. let’s see you write an email. let’s see you see a video. so what does your data look like many small numbers? and a few big numbers this is what is called a heavy-tailed distribution the distribution of the histogram. sometimes looks like this. heavy tailed this is the tailed is called a tale of a distribution at l2 a statistician is not an animal thing a tail is usually refers to the end of a distribution. some kind of heavy tail distribution and a network traffic is an example of a typical example of heavy tailed. so now here is what happens people in this particular case the mean and the median are carrying very different kinds of information. the median is essentially saying that for a typical website that you go to how much time do you spend on a typical website? now the if that number is low that is an indication that most of the time. you are shall we say cruising? or browsing on the other hand if you if you’re looking at the mean and that number is high then you know that you’re spending a lot of time on certain very specific websites. and this points to two very different kinds of people. so the mean and the median are carrying different kinds of information with them both useful. so you also get your question. it depends on what you’re going after. and and and in certain things you will see one of them. naturally used as opposed to the other. there’s also a third one called the mode which is which is actually harder when we when we were still six units instead of mean median mode. and the mode is the peak of the distribution. what is most likely? and the reason the mode isn’t talked about much is because the mood actually algorithmically is very hard to get at. the mean is a very simple algorithm. the median is a very simple algorithm. the mode is a harder algorithm. you can think about how to write a program for the mode if you want to. it’s a much harder algorithm. so the mode essentially what is the mode of this distribution, for example? so let’s take a look at one of them. this is what this is income for men. what is the modal income the modal income is here? it’s some real somewhere around 55,000 where this maximum is. correct for women it’s here. maybe a just less than 50,000. so you understand what the mood is it is the it is the highest frequency or the most common value. but in practice that’s actually really difficult to do if i give you a set of numbers. how will you calculate the mode? but when you see a spike, what is the spike so i’ll give you all your ages. how would you calculate the mode? so one possibility is you you look at the age and ask which age is the commonest? with a count of the ages more but that almost means that your data is not numerical. you’re almost thinking of the data is being categorical because you’re counting how many observations there are high value? the idea of a numeric is that it is sort of continuous. it’s not junk to that way. so for data that is chunked up or categorical you can easily calculate the mode for something. that is not. and so the mode therefore has become less fashionable because it’s not a very easy thing to go after when we were in college. the mood was something actually quite easy to calculate. here’s the way we would calculate the mode here is a histogram. and the way we would calculate the mode is this we draw a line from here to here. we draw a line from here to here left left dig the highest class draw this cross line draw this tossed line and here is the moon this is the way we would do the mode in the pre-computer era. i went to college where we didn’t have any laptops and things like that running a program meant running to the computer center with pieces of paper. so many of these things were done by hand and this is something that easy to do. i manually this is not that easy to do on a computer. the logic is twisted. you have to figure out what the bin width is. therefore you have to make there for his estimation of mode and his estimation of mode will be different from the same data set. that is not going through for the mean or the median. and as soon as two different people find the same answer to the same different answers to the same question, you know, there’s a problem with the statistic. so therefore this is so the mood isn’t done as much these days. these are the histograms sort of my data histogram. it’s this is a way of separating out the histograms. in other words looking at these programs by different column equal to income essentially means that which variable the by says which gender so it’s and they go side by sides because they essentially tell you as to what the difference in the distributions is. so what does this tell you? i could have plotted a this plot here as well or the code could have but this is says that there is a little bit of a difference between the male and the female distributions in shape as well as in the actual value so to speak and so from a descriptive perspective, you can keep doing analysis of this kind to see whether there is a difference not just in the engender variable, but in other variables as well. do people travel the same amount of miles on different on different devices a plot like this will tell you to compare these two? what we can do in lecture students your you can do as an assignment after that is you can see is there a statistical difference between the miles of products that are traveled or that are used for between the different products? in other words. is there a difference between these three products in terms of how much usage they see and you can compare three distributions and we will compare three distributions in time. okay. now the last saturday i want to talk about today. is we’ve done we’ve talked mostly about univariate which is one variable. we saw a little bit of a plot, but i want to talk about shall we say bivariate? bivariate means two variables at a time if you want to talk about many variables at a time that’s called multivariate. but before we get too many, let’s get two. to stop we’ve looked at one notion, but we’ve looked at two notions. we looked at the notion of location location means that if there is a distribution, what is this middle and that can be mean or median? we have looked at variation like standard deviation range and interquartile range, but when i look at distributions of two variables, there’s a little bit more to it. there is a relationship between the two variables that i want to want to be able to capture a sense of relation or a sense of correlation that how do i measure whether one variable is related to the other variable or not. remember? i’m still describing. i’m still trying to find a number like a mean like a standard deviation. i’m trying to simply describe a number if that number is this correlation is high if that number is this correlation is know. what should that number be. there are many many ways of defining such a number here is 1 and is there in the book let’s suppose. so i’m going to do this slightly abstractly. so i’ve got i’ve got numbers that look like this x 1. these are my points. so for example if i look at say a plot. here is take one of these. this is same miles and income. the amount of exercise done and income each of these points has an x-coordinate and the y-coordinate these coordinates. i’m calling x1. y1 x2 y2 x3 by 3 x 4 y 4 x1 lty 180 understand the faisal observations this is say x 1 y 1 this is say x 2 y 2. this is say x 3 y 3 the pairs of observations this way. x bar is what? 1 / m x1 plus xn and i’m going to write this simply because i’m going to try something a little more complicated now summation. i is equal to 1 to m x i if you don’t like the sigma notation, that’s fine. you can write it with dots. why add a little complexity here? y bar which is the average is similarly summation i is equal to 1 to n why i i’m going to write something here. i’m going to write summation. xii minus x bar why i minus y bar. i’m going to write that down. i’ll tell you why i’m writing that down. but look at that. what is x i minus x bar? it’s sort of like a variation or a spread of xii from its average. similarly why i minus y bar, okay. when is this dumb? xii minus x bar y minus y bar. when is it positive when both of these are positive? or both of them are negative now. both of them are positive means what both of them are positive means excise above average and why are you is above average? wolf negative means xii is below average or wire is below average. so imagine a data set that looks like this. where is x bar and y bar somewhere in the middle here? here is one line over here is a line and here is another line. for all the points here xii is above its average and why is above his average for all the points here x i is below its average and why it’s below is average which means all these terms are most of these terms are going to be positive. i may still have a point. for example say this point where it is negative. but when my data looks like this this number will be positive what happens when my data looks like this? when we get a looks like this then x i is above his average. sorry, why are is above his average and excise below its average that is one of these is positive and one of these is negative. that means this guy is negative so when the data looks like this. this becomes negative. what happens if the data looks like this? the positives and the negatives will this number being negative means when one is high in the other is low. for example, let’s say height and weight. i can wait means what the taller you are. behavior you are relationship between the two my doctor says that i am about four or five kilos overweight. i say no doctor. i’m about two inches too short. i don’t have a weight problem. i hate problem your interpretation. so so therefore so if you want to defer get a statistic that captures whether your data with your variables are moving together or in opposite directions opposite directions, for example might be something like say wheat of a car and mileage of a car. bigger cars have lesson mileage which means that if you have an above average weight car? that’s probably has a lower average lower than average mileage. so this particular measure? this is an addition when i divide it by 1 over n minus 1 to take an average effect. this thing is called a covariance of x and y. this is called the covariance of x and y covariances are very heavily used in certain areas. they’re heavily used for example in you know, dimension reduction in principle components. you see that time they’re used in finance for in portfolio management and things of that sort. this is called a covariance of x and y. what is the covariance of x and x? the covariance of x and x which means instead of y. i’ll just put x. this becomes 1 over n minus 1 summation. i is equal to 1 to n x i minus x bar into xil minus x bar. which means x i- x-bar squared. which is the square of the standard deviation, this is sometimes called the variance of x. which is the same as the standard deviation of x squared? so the thing that before i took the square root that’s called the variance. with the square root is called the standard deviation without the square root is called the variance by the way. it’s all there in the book. so in case you didn’t get it, you can see the video or you can read the book called. these are very standard definitions. so the covariance is a measure of the nature of the relationship between x and y if the covariance is positive, they’re moving in the same direction if the covariance is negative, they’re moving in opposite directions if the covariance is 0 then many things can happen either the data looks like this there’s no relation or maybe the data looks like this. not a normal distribution. this is not a distribution. this is a price. and profit for example, what is usually been price in profit, by the way, this is a cuticle relationship between price and profit on this side as price goes up. your profit increases because you’re getting more money per product and on this side with even higher price fewer people buy your product. so your profit goes down. now for such a thing, that’s when i the average is somewhat here. so the correlation also becomes 0 another way to think of it as is positive on this side and negative on so if this is 0 it doesn’t mean that there is no relationship. it could mean that there is a complicated relationship something that is positive on one side and negative on the other side. and now that’s it. i once remember doing an analysis in which we were trying to find out that is about attrition why people leave companies and inside it there was a model that we were trying to for some reason trying to find out the relationship between or trying to understand where people stay do they stay close to the office or do they stay far away from office? and what do you think is relationship between say experience? and distance to home we are normalized for that the in other words think of it as just experienced but we were looking at populations in which experienced loosely translates to age but you’re there could be people who join the company very old. i agree with that policy simply. life and say that you have a data set in which you experience and here’s what we found that that early on in their careers. people live close by in the middle there moved away. and towards the end they again became closer. now this was an observation. there’s no science to this. this was just simply seen in that particular company this particular thing would happen. but remember the point is not to describe. the point is also to predict to understand and things like that. so we had we had to build a story around this when we went to the cmd and said that you know, here’s what we had done. so so the so you can make some story around this and the story we made up correctly or incorrectly don’t know is that in the beginning? to some extent people have low dependencies typically coming. you’re unmarried bachelor cetera. you also need to work a lot harder. so staying close by his convenient. you get a pg or you get an apartment? you stay close to you close to work because thing far away from work gets, you know particular benefit is just inconvenient. but as you as you reach in some way middle-aged, so to speak things with them very complicated. there is a spouse here. he may have a job. there are kids there are schools. this kinds of houses that you can afford. and so this solves the more complicated optimization problem and you may not be able to find a solution to that problem close to work. but people who survived even longer in the company earn enough to solve this problem through other means and then what happens is they move back to work again, you know buy a villa close to etcetera and now there are multiple cars. to take people elsewhere kids are often grown up. so the number of dependencies are a lot less. you may agree with the story. you may disagree with the story. but the point is that there’s a complicated relationship you’re trying to explain based on what the data is now the use of it. i will talk about much. so this this number is a number whose sign positive or negative tells you about the nature of the relationship. but only the sign tells you the value is much harder to interpret. the reason is because i can measure these things in whatever units i want suppose i am measuring you know, you know, so height say height and and weight and i measured height in centimeters and weight in kilograms. that’s one answer, but i can measure height in feet and weight in pounds and get a different answer. i can even make this number much higher by measuring height in millimeters and wait in milligrams and why i do that but i get it so this as a value is entirely dependent upon the units of measurement which makes it a problem. so what statisticians do when they reach the situation is if they normalize things they make the unit go away. so the way the unit goes away is you divide this by the standard deviation of x and you divide this by the standard deviation of y. now i can do this on the board without writing anything again, but i would suggest you write the whole formula again. when i divide this by the standard deviation of x and standard deviation of y now the units cancel out. now this value becoming one means x i is one standard deviation above average. in whatever beats units and why are you say two standard deviations above average in whatever is eunice the unit has gone away. this number is called the correlation between x and y. and the correlation between x and y is a number between 0 and i’m sorry is a number between minus 1 and 1. the correlation is between minus 1 and 1. if the now give the data looks like this then it is 1 if the data looks like this then it is minus 1. this is the correlation. it is a measure of the relationship between two variables measured in this very peculiar way. it is not just a measure of the relationship. it is a measure of what i would say the linear relationship between x and y are nonlinear relationship or a strange relationship could cancel out positive and negative and end up with zero or a low number. so if the correlation is close to plus 1 there is a strong positive relationship between the two strong postulation means what if one of the variables is above average then the other is also very likely to be above average. and vice versa so what i can do is this is the when i do my data and i do dot core as a function. this gives what is the called the correlation matrix? again, it will calculated only for the things with numbers. if it doesn’t in other words, if you give it a data frame and this doesn’t happen then just make sure that you only take the subset of it, which has only the numbers do not calculate correlations for things that aren’t numbers if they’re not numbers. there are other ways to calculate association will see that later as well now based on this. what do you see? first of all, the correlation between age and age is 1 why well is the 45 degree line, right? by definition it is one that is this is a number that comes from one data set with one kind of relationship. what does that say anything about the practical world? so to speak is another way of stating saying what i have been saying all along. how does your data have anything to say about these relationships outside the data? the problem is we will clear it here maybe but the problem exists for anything. so for example, there is a correlation of .28 between education and h point to it means that there is a positive relationship butter. not a very strong one. where is that? where is that graph? this is h. education was a second one, right? so this one, right? or this one whichever way this shows that there is a weak positive relationship between them. when one goes up, the other does have a slight tendency of going up now should warn you that there is no sense of causation here. there is no sense that he flex goes up. then why goes up because correlation of x and y is the same as the correlation of y and x. definition this is symmetric concept. it makes no attempt at causation. that’s a different thing altogether. so this is a positive this these this is a positive relationship. it’s a weakly positive relationship. this is about usage and education is about 40 point for income and education is about .62 miles and usage is about .48 miles and fitness is about point seven eight. let’s see miles and fitness. this is miles and fitness. nothing in this data set has a negative correlation, but you might have seen it if one was negatively correlated to the other negative 8 to the other. close to zero. you’re looking for low correlations, right? so age and usage for example is a very low number. who is in miles in other words he’s doesn’t seem to have much to do with. things shall we say? other than income, but asian income doesn’t really have much to do with your product per se it will be useful in when you do clustering rated on variables like the reduce full to child segment. rich old people always an interesting segment. yes. what is the coalition mean zero that there is no relationship between the variables it could mean for example that a plot that looks like this. let’s take a variable. so closest to 0 is what age and usage. so age and usage is where usage and this one so this age and miles that also is something low problem. this one. no, there’s no relationship between them in the sense that there probably is a relationship in the variability. in other words is more variability here then here but if i want to draw a line through this the line doesn’t have a positive slope or a negative slope. there’s no there’s no idea that says that if one of them is above average the other is also likely to be above average. so low correlation means that there is no sense that one being above average rate to the other being above average. no increasing nor decreasing. correlations and toto really hard numbers to interpret but they’re also very useful summaries particularly for large data. the question that he asks is to does this make any sense in the real world has two components to it. component 1 is your relationship between the two related to a linear concept? so for example, we i was talking about height and weight. what is the reward should be the relationship between height and weight? linear, so if i plot height versus weight, i should see a straight line. okay. now she’s going to say not necessarily removing outlines. we’re all outlaws, aren’t we? okay have any of you heard of a concept called the bmi body mass index in this day and age, we’ve all heard of body mass index. kavitha. what is body mass? index? hide by height by know there’s no asian it hide by weight sweat hide by weight squared now. so bmi is height by weight squared. so if bmi is height by weight squared, what does that tell you about? the human body height by weight squared is what is called bmi. and this number let’s say should be around 25 if you are healthy. what does that tell you? if you are taller what will happen, how will your how should your weight increase? no, there’s a square here. how did it so roughly? this should mean let’s say that this is correct. let’s say roughly. so this is correct. if this is correct, what does that mean? it means that height is approximately 25 into weight squared if you are healthy that means if i see a bunch of very healthy people and i planted height versus the weight. i should see a curve like that. not a straight line. she’s figuring this hissing. now. why why why do i if i’m twice as tall? i should be twice as in. yes, if you want to give it a fancy name, correct when you refrigerate that is a parabola, undoubtedly true. so you could argue as you why is it height by weight squared? society different question. why isn’t it’s a weed by height? so let’s suppose that you so with my height means what so let’s suppose that these two they’re not the same. well, i suppose that they’re so so so so these two so this is a certain height. this is a certain height if i put this on top of this what happens to the weight, if these two are exactly the same this is going to double or if i take two of these i don’t see two of them. i apologize. but anyway, okay, so here’s one more so these two so if i do if i put this on top of this is doubles. so therefore if i look at objects such as this then by doubling the weight and the height so height by weight is remains a constant, correct. so if i’m looking at bmi for bottles this way, it should be weighed by height. so if you are a body. your bmi would be weighed by height. okay. now imagine that you are a football. now if you’re a football and you hate double how much bigger would you be? you understand. the problem is the football. it is now twice as high. how much heavier is it? well factor of what but so the volume height is double volume has gone up by what no. no, how high are what no 4 by 3 pie r cube. it has gone up by a bicubic factor. so now for a ball the bmi should be weighed by height cubed. so you’re not growing like a cylinder? how do you like doing like a football? you’re going next something between a cylinder and a football. we all are not you personally, so which is why it looks like that. baby is do like cylinders. we don’t we don’t go like cinderella’s if we grew like cylinders would be a lot thinner. think of yourself imagine yourself when you are, you know, five or six and now wr height. you’d be looking at her. right. similarly. you don’t grow like a you don’t you don’t grow like a football as well. imagine yourself five is six and now imagine you grew in every dimension in the same way. you’d be a lot fatter than you are now. so therefore this relationship depends on the empirical relationship between height and weight for the data that is available, which is of humans growing. and so empirically people have discovered that this is the object. that should be invariant. this is an example of what’s called dimension reduction. variables are being combined into one which is carrying information for you. but it relies on a nonlinear relationship between the two that is not going to be pick only picked up by the correlation. so the correlation goes so far and no further. it is not one of the more analytically useful things very often. we do test a hypothesis is the correlation 0 versus is the correlation not zero to ask whether the correlation is real or what is often called spurious and in a later class, i think about two or three residencies from now you spend some time on things like spurious correlations in other words and finally relationship between x and y, but is it real or is it due to something else? it’s hard as a basis for acquisition. it gives you some summary of the data. it is at best a descriptive measure of. association sometimes people want to see it in another form. this is what’s called a heat map. it is exactly the same thing as a correlation except that in a heat map. it gives you nice colors. it gives you nice colors and you can change those colors. so to speak here’s the index of what the color is. – one is pale blue positive etc and and positive is in the same direction. so it gives you a sense of what the color is. so sometimes when you have lots and lots of variables, this is too few set of variables for a heat map to be useful. so for example, let’s suppose that you’re looking at a product catalog. a few thousand products in you trying to find the correlation between sales of those products across time and across geographies and you give a display of you know with and so you do hit map and you find those regions where the products are sort of clustering up we often do it in medicine through what is called micro arrays. we essentially we look at data from jeans and let’s say there are thousands of genes and and you look at the expression levels of each of these genes and you say these are the genes that have been expressed and these are the genes that haven’t been so if you are doing correlations of thousands of variables, hundreds of variables often a night and nicely arranged set of variables with a heat map gives you a good picture of the data. so heat map in this form is exactly the same as the correlation except that it adds colors to the numbers so that you’re not looking at the numbers you get a visual picture. so the user so the traditional choice of it is hot is related. so red is related. and why it is not. but there are many ways in which you can change the coding of the heat maps colors. okay, now comes to some extent a tool that is descriptive. however, it is the first predictive tool that you will see i will not want to use it like a predictive tool but i’ll still show it. so let me show you what the end product is. the end product is i want to summarize the relationship between say miles usage and fitness variables like this. known predict in relationships of variables such as this kind here’s an equation minus is equal to minus five point seven five plus 20 into usage plus 27 into fitness. this is shall. we say a targeted equation. what is this equation as far as i’m concerned today? this is a description of the data. but the description of the data will be used in order to predict how many miles my instrument will run. so think of what the instrument is, the instrument is going to be is an engineer design instrumental. i’m trying to figure out how much it will be used. how many miles did we used to do that? i will figure out whether people consider themselves fit or not and how frequently they use it. and using that i want to get an equation for the number of miles. this will run. is there a descriptive way of getting at that equation? so what this does this kind of an equation is what’s called a linear regression model. this is your first model. this is going from descriptive to predictive i haven’t done it yet. i haven’t done it yet. i’m just saying what i’m trying to do. self-rated fitness on a 1 to 5 scale mmm, i’ll get there. okay, so maybe i shouldn’t have shown you the output always dangerous to show good people output never showed put moral of the story. so what i want to do is keep it deliciously vague. huh? so y is equal to beta naught plus beta 1 x 1 plus beta 2 x 2. i want to fit an equation of that type. why do we have multiple variables? i can do it with one variable. maybe life is simpler with one variable. you have given me at nasa data via a key variable a kia here. so you can you can in the code as you see you can have one variable you can have two variables. you can see where he’s going to be number of variables. i think they’ve chosen to to say that you know, i once had a few we’re going from bivariate distribution multivariate distribution seated on a bivariate distributions, and then at the end he said now put two equal to n and we were telling i’m sorry. it doesn’t work. that way if you do it for n, i can put n equal to 2, but if you want me to put you through to equal to n and which to do i put with k. so he saying i’ll show you it for two. but if i show it for to you can do it for one and then you can do it for three you can do it for any but we can cite for one also if you want to so let’s look at that and what what what am i trying to put here and trying to put miles here. and i’m trying to put usage here. and i’m trying to put fitness here. i forget which was where but anyway, these two variables. how am i using it to describe? so you want to invite want to think of it this way if i give you three variables, how do i describe the relationship between them? if i give you three variables, how do i describe the relationship between them? there are three variables. in the form of something like that is one way of doing it. no, does that mean that in reality as he might say that there is a relationship between these three things no correctly. so free shipping big sighs, maybe i don’t know not necessarily. correct. not necessarily, so when you do linear regression in the future. any regression for that matter they will be three uses of it. use one it is simply descriptive. it was simply describe the nature of the relationship to you. it will make no causal inference. no sense that this causes this it will give you no predictive model it simply describes and we’ll discuss how he describes to it predicts predict means when i put in another value of x. and another value of x for another value of x 2. i will get a different value of y which means that i’ve looked at data from all of you and a new person comes into the room with a new x1 and x2 and i’ll put then have her number in and i will predict her why. that is a pretty views of the of a model third prescriptive in order to get a different targeted. why? what changes should i make in my x 2 x 1 and x 2? to get different usage of the equipment what behavioral changes do i need to make in people to get them to use mod n even more complicated use of the same thing. so the same model the same principle can be used for different uses. i am using it simply as a description. simply as a way to summarize. not univariate not bivariate or trivariate or multivariate. i can do that with a three-by-three correlation matrix. but if i choose to do it this way now, where is my where is the what number am i looking for? fitness is here average number of miles a customer expects to walk or run average number of times the customer plans to use. so i’m going to give it this variable and this variable and try and get an outcome for the middle one. getting it the way to do it is something that i won’t talk about too much. so there’s a there’s there’s there’s a scale on which is you know, one of the one of the learning modules that they learn in the sense of supervised learning import linear model regression linear model as a function and the cycle irritating big function here called linear model, which is inherited from linear model. you’re giving it a why what is the why the thing on the left hand side of the equation? what is the x the thing on the hand side of the equation? what is ric fit rec fit means regression fit? and this fits my x and y and this output something it doesn’t output anything at this point in time. now i have my regression coefficients and my regression coefficients at 20 and 27. my regression intercept is minus 56 and my my spirit is – 56.5 4 + 20 usage purse 27 fitness. how is this interpreted from a purely descriptive perspective. it means that for example if usage remains the same. and my fitness goes up by 1 unit. then my miles goes up by 27 year. if my fitness remains the same and my usage goes up by say one hour or one unit. then my miles goes up by 20. what does -56 means? if you don’t use it at all and you have zero fitness, you have done – 56 miles. makes no sense, but neither does 0 fitness. so the model is not necessarily written in a way in which this intercept makes sense, which is why in the software the intercept is not treated as a coefficient. this is up is a part of the equation but is not one of the coefficients that you interpret. this is pure description. how does it how does it what does it do in case you’re asking and i hope you don’t. what it does is this. what it does is it looks at the data and what is my data? my data is say y 1 x 1 x 2 and it says this it looks at why one sorry why i minus beta naught minus beta 1 x 1 i minus beta 2 x 2 i whole squared. this is my prediction of the equation beta naught plus beta 1 x 1 plus beta 2 x 2. this is my actual. what prediction is the closest to my actual in what cells find the difference between the prediction of the actual square it and then minimize it? with respect to beta 1 beta naught beta 1 and beta 2 so what are we turn our beta 1 and beta 2? their variables or the parameters that are estimated in such a way that if i is she made it this way this plane is the closest to the data. in what sense in the sense that the difference between the predicted and the actual is the smallest? don’t worry you’ll do this again. you will do this. again. this is a very important thing in supervised learning in production mode. in description mode all that is necessary for this to happen is that it describes the nature of the relationship between miles usage and fitness describes in what way in addition to the interpretation of the numbers. there’s also something else interesting here the positive sign. what is the positive sign mean? it means that as fitness goes up miles goes up as usage goes up my is goes summarize the relationship between three variables treating one of them as a output. this is a distinctive use of linear regression as a way to describe data is the description real. to be decided to be confirmed to be analyzed. to be understood right, you do not know it is empirical. it is based on data. why is it nests? why is it necessarily true? is there a logical reason why this is to be the case. yes, you can do it with one. you can do it with you can remove it. if i remove it what happens? so what would you do you guys can do it? if it’s there you would move it here instead of instead of usage and fitness just have one of them there. i have not given you any idea as to whether the description is good. i’ve not told you whether this model is a good model or a good equation in the same way that i did not tell you whether the correlation was good or whether the mean was good. i’ve not given any quality assessment to anything. these are ways to describe. the quality of the model how accurate is my mean how good is my prediction? these are things that are going to be inference in fridge and we’ll come we can’t answer those questions before we get to probably middle here sense of language on it. yes. hm. fitness and usages huh, huh? that’s true. so you’re saying that it doesn’t make sense for certain values, which is true, which may be will be as i said, i am not saying that this is a good pretty model. what will happen is you will you will what will happen you will study a model like this and you will ask certain questions. what questions might you ask for example, here’s a question that you would ask you to ask the question that if i fit a model like that is this coefficient that is in front of this variable actually equal to zero. because if it is actually equal to zero, then there is no relationship between the output and that variable. so what we do is we ask for a statement of this kind if say why i is equal to beta naught plus beta 1 x 1 plus beta 2 x 2 i asked for the statement is beta 1 equal to 0. and these are called hypothesis. because if beta 1 is 0 then this number should not be in the model. and therefore this variable has no predictive power over this variable which is where the analytics part becomes interesting but to answer that question. i need to have a sense of how do i know whether this is 0 or not? and to answer that question. i need to have a sense. what the error around that number is? so this number is not 20. it is 20 plus minus something. in the same way that my meanest 28 age of 28 was not 28. it was 28 plus minus something. this is also similarly not 20. it is 20 plus minus something and if that plus minus something include 0 then i can’t see that this is not zero. if on the other hand that plus-minus does not include 0 i can say it’s a pretty model that’s coming. but for now, this is simply a way to describe data and like for means like for correlations, like four standard deviations and for linear regression, all of these all will now see an inferential phase to them the main must see a plus minus the regression coefficient must now see a test is it equal to 0 is not equal to 0 all these models all these estimates will now be put into an inferential. test interpretive test. how is how useful is it for new data? because just describing current data is not going to be good enough for me. i’m writing an equation like this. i want to write this equation. i want to write miles is equal to beta naught plus beta 1 into usage plus beta 2 into fitness. i want to write this. the code now tells me what these numbers are. this number is minus 56. this number is plus 20 and the third number is plus 27. that’s it. you can call it intercept based on what you like. whatever your term is. yes. yes, yes in x just put in another variable do cam another variable. it can be any number. aggressively to try it out and you can do it now. if i don’t want to screen on with this i will plot it my purpose is to if i could not it i would but remember there are three variables. remember the three variables. why am i doing this? because if two variables i can plot it i can also look at many variables at a time and see a correlation. but if i have three variables plotting things becomes difficult. if i have 4 variables floating things becomes even more difficult, but you still do it. i think you have tableau or in your curriculum, maybe i’m not sure but visualization techniques can help you. but if you’re going to 10 variables then plotting is not a way to do it. so, how do i express the relationship between 10 variables? by arbitrary equations like this what does it mean this intercept is if this is 0 and this is 0 what is this? but as we have said this zero doesn’t make sense in this zero doesn’t make sense. but this is simply a line that goes through the data if i have data that looks like this for example all it does is it fits the straight line? what is the intercept when it comes to success doesn’t make any sense. maybe maybe not. this is the place for the data makes sense, but the equation is written so that it cuts the line here. great, if i find a relationship between height and weight and i write the equation as y is equal to beat our say wait. sorry is equal to beta naught plus beta 1 into height. what is beta 1 beta 1 is the weight of someone who has height 0 makes no sense, but giving me the freedom to have a beta one here allows me to get a much better line because i can move this line up and down in order to get the best fit. it allows me an extra flexibility. don’t worry. in fitting good models, you will have enough experience in doing this. my purpose is just to show you it as a way to describe three variables in one shot. i am again. i’m not building emulsify do it for two of them just miles and usage. just just two of them this an equation an equation just have this kind with one variable you wouldn’t do this because there is nothing to mod rewrite this one equation between one variable criteria for doing this. remember my purpose is not to use this to select which variables to model. when i’m calculating means and standard deviations and correlations. i’m not using them to select anything. i’m not saying that i will measure your mean because you’re important for i’ll measure your standard deviation because you’re low. i’m using this as a tool to summarize three or four variables which variables to use very interpretive mode. you can do you can look at for high correlations in there many other techniques that you learn in order to figure it out. so just like i mean is a way to do analytics. correlation is a way to do analytics and deviation is similarly. so what we had done yesterday is we had spoken essentially about descriptive statistics and descriptive statistics is the picking of data and to simply describe it with the later purpose of either visualizing it or writing a report or using it for inference and prediction in later courses or later applications. it is compared with predictive statistics of predictive analytics. and then prescriptive describing is simply a task of summarizing a given set of numbers. you will do sessions in visualization in due course. prediction is a task that is often in machine learning or a data mining professionals requirement to say that if something changes then what happens? i should have made a comment that there are two english language words that mean more or less the same thing one is forecasting and what is prediction? in the machine learning world, these words are used a little differently for testing is usually in the context of time. so something has happened in the past what will happen in the future. i’m giving you this week what will happen next week forecasting in the future prediction is usually used without any sense of time prediction is like i’m giving you an x you give me a why i’m giving you one variable you give me a another variable. so predictive analytics doesn’t necessarily forecast anything. despite the fact prediction itself is what forecasting so the words mean slightly different things. it’s a little like, you know price and worth mean more or less the same thing, but priceless and worthless mean different things. so so the words are used as i create different context. so in descriptive statistics, we had looked at certain ways of doing things. for example, we had looked at what is called univariate data. univariate means one variable for the univariate distributions. we have seen certain kinds of descriptive statistics. some of them were about shall we say location location meant where is the distribution and we had seen for example things like means and medians. which talked about where is the distribution located we talked about things on variation? where we are talked about standard deviation will talk more about things like this today. standard deviation range interquartile range here also we had terms for example like you know the quartiles the upper quartile the lower quartile. these are parameters that are used in order to convey a message to someone saying that what is the data about so for example, a five-point summary talks about the minimum the 25% point the 50% point the 75% point in the maximum. irrespective of the number of data points you could have 10 of them. you could have a hundred of them. you could have a million of them. you could have a billion of them. it doesn’t matter it still five numbers. sometimes those five numbers tell a lot they tell about location. they tell about spread they talk about skewness is a distribution sort of tilted towards one side is in more data on this side than on the other in terms of the data spreading out towards the tails. so and so their plots associated with this as well. we talked a little bit about the plots later. then we went towards the end towards the idea of lexi bivariate data. bivariate means that there are two variables in which we didn’t spend a lot of time we talked about covariance. and correlation covariance is a sense of variability of two variables together. it’s univariate version is a variance which is the square of the standard deviation is scaled version of covariance is the correlation if the correlation is is close to plus 1 then it means that there is a strong positive relationship between the variables positive means if one goes up the other also goes up if one goes down the other also goes down- means the opposite as one goes up. the other goes down correlation is not to be confused with causation. there is nothing in the descriptive. that says that this cause is this there is no science to this. this is simple description the science to it and the logic to it and the use of it for for inference for business logic and things like that will come a little later for now. we are simply describing. then we are taken an even brief and perhaps even more confusing. look at multivariate or first multivariate summary where we looked at the idea for linear regression. a linear regression is an equation of the form y is equal to say beta naught plus beta 1 x 1 plus beta p x p where one variable is written as an equation of the others. this is merely done to describe the nature of the relationship between the variables. correct. it can be used for prediction. it can be used to prescription if you wanted to but that is not a purpose here. our purpose is simply to describe a relationship. why is this useful because let’s say that you’ve got three variables for variables 10 variables. you need a mechanism to say how these variables are connected. how do you describe 10 things at a time? there are graphics out there that are famous graphics in history where you have many variables being represented in on one plot or one visualization. so visualizing things itself. so for example, we looked at a certain kinds of plots. we looked at for example histograms. we looked at box plots. set of pairs. which were essentially scattered what are called scatter plots? so these are for the human eye. these are things for the human eye to to see data. and they have the limitations because we can only see data in a certain way. we can’t see very high-dimensional data. visually, we can see up to three dimensions maybe. for those of you who are interested about such things or any of you are in the graphics world etc. he spent a lot of time saying, how do i how can i make people see things? so how many dimensions can you actually plot in python is self is is good at it but there are other devices. so for example, let’s say that you’re plotting you can have of course one variable as x 1 variable is y another dimension can be maybe the the size of the plot. this is bigger than another variable z becomes larger. it can be a color like a heat map a fourth variable if it is low can be blue and if it is high can be read. another maybe the shape of it lower values are circles higher values are more pointy. so there are many ways in which you can get summarization to be done. so when you do visualization, if you do you’ll see other ways of summarizing it but if you want to do it as a number then something like an equation that looks like this. is often a good representation how one gets at these beta 1 sin beta psi explain very briefly. what happens is you form this equation and you take those values of beta naught beta 1 and beta p that are closest in some sense to the data. so if i draw a picture of say two of them y on x and i say give me a line. which line should i take take the distance from the line to the points? and make this distance the smallest get a line that goes through the data with the smallest distance to the points. how is small measured small is measured by the square of these distances because distance from the above the line and distance in below. the line are equivalent. so if this is my beta naught plus beta 1 x and this is my y what i do is i look at why – or why i minus beta naught minus beta 1 x 1 i isaac equal to 1 to n my endpoint square it this is the squared of the distances from the line and then i minimize this with respect to beta naught and beta 1 that is how i get the numbers. but if you simply interested in what python or or does then the program will simply give you what the number is. so what what sorry what i will get from those you will get the value of beta naught. tan beta 1 find the value of beta naught and beta 1 such that this is the smallest. fi for different values of beta naught and beta 1 this distance will be different for different lines this distance from the line will be different. which line will i take the line such that this is the smallest. how to get the beta so find this why i minus beta naught minus beta 1 x squared on a plot after that points are have existed. the point is here. the points are here. so these line line is which the line is the line. i’m trying to find. here is a point here is a point here is a point here is a point. here is a point right? let’s say a five points. i want to describe the relationship between these five points. therefore what i need to do is i need to find a line that goes through these points. i want to write an equation of the type y is equal to say remove the b and x y is equal to a plus bx. i want a line like a plus bx going through those points. there are many lines. this is one line. this is one line. this is one line. this is one line. there are lines which line will i use to represent the relationship between y and x. i need two criteria. so what i do is i try to say let me find a line. let’s say that this is the line and find out how good it is. at describing the data. now when is it good at describing the data? when it passes close to the points. because that is his purpose to describe the data. because i want to say that this line. without any data points is a description of the data line position. that’s what i’m talking about. so i need the value of a and b. correct. so how do i find the value of a and b for every such line a and b. i find the distance of the points to the line. so if they’re how many points do i have here? i’ve got five points. i’ve got 5 distances. what are the points this is the point x1? y1. this is the point say x to y 2. x3 y3 x 4 y 4 and x 5 y 5 these are my five points. how far is the point x1? y1 from the line? this distance and what is this distance? this distance is how much? this distance is this point is why one? – what is this point? a so a plus b x 1 that’s the point on the line. hit okay, i can stop here. but if i stop here what will happen is that if this is the distance then this will become a negative distance and this will become a positive distance and they will cancel or neutralize as you say no is this equation? is equation of the line? you want to know why? dummies is this point? b is the slope and i want to find dnd. so this equation is a plus b x so this point is y and this length of this line is y minus a plus bx 1. square plus for the second one. what is it for the second one y two minus a minus bx 2 whole squared do this. five times correct. for every line you will get this number if you want to you can take a square root. for every line you will get this number. this number is the sum of squared distances of the line from the data. it tells you how far the line is from the data. the larger this number is the further. the line is from the data. the smaller this number is the closer it is to the data if it is if the data is on the line, what is the value of this 0 so every point is on the line or if the data is itself a straight line, then this will be 0 so i have formed this now. i find the value of a and b such that that is the smallest. for every n b. i will get the value like this if i take another line i will get another value of this. for every choice of a and b. i will get a difference distance from the data which nb will i pick? that a and b such that this distance becomes the smallest. so can we have you don’t you believe this? so choose a and b to minimize this and that is the envy that the software gives you. this is called a linear regression answer to does it have a problem. does it have re square? this is why gauss was so successful and laplace was not you will get a unique solution. this was called a convex problem. and this is a convex optimization because of the squaring if you have modulus values here. there is a possibility that you will not get a single answer. but because of this and because of this square and because of the nice bowl shaped curve that the square function gives you you will find a unique solution to this know the system doesn’t do it. that way the way the system does it is the system differentiates this with respect to a and b differentiates respect to a set ‘s equal to 0 differential spec to be sets equal to zero and solve those two equations. it doesn’t minimize. it doesn’t minimize when this becomes very high-dimensional this minimization this differentiation of solving it becomes a very interesting problem in mathematics and numeric analysis. to do that you need typically to do linear algebra and in cases such as this and in machine learning books, you will see at the beginning of the book you will often find chapters and optimization and linear algebra because of this or something similar to this that represent a problem you often need a matrix representation and to get a good learned solution. you need an optimization. so most machine learning algorithms are built that way because for example staying yesterday that you’re going to tell someone to do something. i’m going to tell a car to behave itself on the road. yesterday while going back to today morning. i heard that or i read that bmw and daimler are setting up a you know, 1 billion euro rnd operation somewhere in europe for self-driving cars, etc. etc. to different industries are trying to go towards making cars that don’t need people the automobile industry is as well as the healing industry. people like uber and lyft and align these companies. so now you’re going to figure out the car is now going to be have to be told when to go and when to stop but how does the car know that it has a good rule? how does it know that it has learned and what is good learning as opposed to bad learning? what is enough learning as opposed to not enough learning? a computer is stupid. on a computer can do store a lot of data and do calculations quickly computers aren’t intelligent. to make the computer intelligent. you have to give it an intelligent function. you have to say okay, run your algorithm such that this thing becomes the highest it can be or this thing becomes a lowest it can be which is an optimization problem. so what machine learning algorithms almost invariably do is they say that here is an input and here is an output. give me an algorithm such that based on the input. you can come closest to the output. for example object recognition if i am if i’m teaching computer vision or if teaching text recognition or any of these lists a text recognition, so i’m trying to understand what a word what the word is. so if so, i’m the computer is reading something. let’s say in handwriting. and try to identify that as an english language or a kannada hindi phrase. so it’s going to write something down you write this and in my horrible handwriting, i’ll write something and that camera has recognized what i wrote and transcribe it into something that you can read. now how does it know it’s done a good job. what it needs to know is that this is what i think the word is and this is what the correct word is now tell me whether i’m close. anytime close. i’m good if i’m not close and not good, but this has to work not just for one word. this has to work for thousands of words. so i must be close to thousands of words at the same time. therefore. i need to measure the distance from my prediction and my actuality over many many data points. so all these algorithms what they do is they take your prediction and they compared with the actuality and they find a distance between them and they minimize the totality of the distance between the prediction the actual and algorithm that minimizes that distance is a good algorithm. it has learned well. so they all do something like this. with this is the prediction and this is the actual and a and d other parameters in the in the prediction. in other words find a prediction such that it is closest to the actual. so this algorithm has become very popular. it’s probably the single most popular fitting algorithm out there. this called least squares. we squares hmm. this called least squares. squares here’s a square least because you’re minimizing this called least squares and the least squares algorithm is a very standard way of doing things. this has nothing to do with the algorithm itself. the algorithm can be anything. this is self can be a neural network. it can be a support vector machine. it can be a random forest. it can be association rule. it can be any of your logics. but the question is if you give me the program, how do i know whether the program is good? so i give it what’s called training data training data means i tell it what the answer is. yes. so this is a prediction. this line is a prediction if you want to think of it the data points of the actuality. the problem is this these data points are also not the actuality. the actuality is going to come in the future. it is a training set for the data it is yell. it is the data that is being given to the algorithm to train it. but is not the algorithm. it is not the data that the data will actually run on the car will run on the road. the car will see is data points, if people and other cars and the cows and whatever for the first time it will not have seen that data before but it will need to know what to do. yes, so what do you do? so what you do is you train the algorithm? what does training the algorithm mean training the algorithm means you give it data for which the car is told what to do in other words you give it what they call ground truth. so you give it the why and you say here is the doubt or here is a situation. please do the thing. so please do it such as this so here’s a person who’s crossing the road. please stop here is another person crossing the road, but he’s very far away. calculate the distance compared with your speed and decide what to do. he may be far enough for you to be able to see but you may not stop. if you are driving a car. it’s quite possible that you are seeing someone crossing the road, but maybe about close to a hundred meters ahead and you’re not slowing down because you’re doing the calculation that i have a seat. the person crossing the road also is speed and by the time i get there, this person would have gone do not do this when doing a level processing. but don’t do this all the time. we do this from crossing the road. so happy doing that. there is a car coming but i’m still crossing the road. why because i know that i will be able to cross the road. before i get there, so the car needs to be taught how to do these things. so this data is given data like this and says for the training data get as close as you possibly can to the training data and then it’s given what’s called test data. and now the algorithm is told. oh now i’m going to give you need a new data and now you’re going to tell me how well you did on new data. so suppose you were given a problem like this. i’m not supposed to talk about this here your m and instructors are supposed to talk about this but suppose you are given a problem like this. in other words have given you a data set and i’m going to tell you that your performance will be judged not by this data set but on another data set that i am not giving you. how will you what will you do? how will you make your program ready? yes. how will you make your program generalizable? so the usual way it’s done is something interesting you say that okay, you want me to predict data that i have not seen i will see if i can do that. so what you do is all the data that is available. you take a certain part of it and you keep it aside you have it, but you don’t use it. and now the remaining part of the data you build your algorithm one and now you tested on the kind of data that you yourself have but i’ve kept aside this called validation data. and now if your algorithm works on your own held out data the data that you are other things not seen you’re more hopeful that it will work on somebody else’s new data. this called validation and this entire cycle is often called test validate training center or train validate test etc. and you will do this in your hackathons. but to do this you’ll go to the needs to know as to how good it is and it needs measures like this. there are other measures. so for example, if you’re classifying an algorithm good or bad positive sentiment on tweet or negative sentiment. don’t treat no numbers then you do not need this. what you need is simply are you correct or are you incorrect if it is, correct? let’s say you give yourself zero distance. if it is incorrectly. see you give yourself one distance in other words you made one error and you just count how any mistakes you made but what is a mistake when it comes to estimating a number like say miles and things like that. there’s no mistake. you are either closed or you are for so you need a measure of how close that’s a measure of how close. so this descriptive method is used as a criteria for building predictive models. this is says can be a predictive model, but very rarely is a good enough. too few things in the world are this simple? all the things in the world is simple. yes. but as we say as we discussed yesterday even things like height and weight and not that simple. there are complexities to that. so for example, you can have theories it we said for example, you could say let’s say a savings rate what the savings rate of savings rate is the proportion of money that you save. so if there is a saving state what that would mean is that you if i take your income data and i take your consumption data that should form a straight line. because you’re saving a the same proportion every month, but it’s not. if you go home and month-by-month you figured out what what your income was fairly precisely from your salary or from other sources, etc. etc. and you also plot again fairly precisely as you can how much you your household spend that month. it will have an increasing effect probably but it is very very unlikely to be a straight line is certain things. you may be going after a law of physics, but the law of physics may hold for gravity be not hold for anything else. i remember trying to apply this one day one day cricket sort of became popular when i was in school or thereabouts and one calculation was done as to as to how to figure out whether a team is doing well or how well is a chase going. so one possibility is just simply track the score. the other possibility is to say that if you know how many runs you’re going to get you tend to begin slow protective we can send you accelerate. so what you do is you build models for that you build models were saying that let me assume that the team. going to accelerate constantly. which means that every over that comes later it’s going to do. is going to get better. it’s run rate is going to keep increasing steadily. now if it’s run rate keeps increasing steadily then when will it reach the halfway point? that is the same thing as asking the question that if i take a bone and i drop it, how long will it take to get to the halfway point and this is square root term there in the answer is about 50 divided by the square root of 2 about the 37th over or something of that sort. so effectively the logic was if at if you’ve reached halfway point below, let’s say the 37th overall so you are on track. if not, you need to accelerate in faster that’s using a physical law to try and predict something that is not a physical law. i’m the laws of physics don’t apply to critic cricket at least not in this way that i am describing. so therefore these laws will get you somewhere like a straight line, etc, etc. but they are approximations. and so what you will do is you will build better versions of this when you use it for an actual prediction for the same argument holds for things that means standard deviations in many such things if there’s a specific problem you need to solve you may make get a better estimate for doing it. yes. someone had was asking a question. yes. yes, yes. yes. so there are many ways to do. that one is you just put it in you find for different values of a and b. you find what that number is and then you solve it. if you won’t do it the hard way you can still do it the hard way and the hard way will end up being something like this. i’m minimizing and i should not be talking about this. say why i minus beta naught minus beta 1 x i whole square i was using a and b, so say a and b whole square. i’m going to minimize this respect to a and b. so essentially what i’m going to do is i’m going to call this. let’s say l of a and b. and i will say dda of la is equal to zero. ddb of lb is equal to 0 and this will and i will solve these and this will give me 2 interesting equations. and my answers will be this. i’ll tell you what the answer is your b hat your estimate of be. your estimate of b will be this summation x i minus x bar y i- y bar divided by summation x i minus x bar whole square and you’re a will be this. y bar minus b hat x bar, so if you want formulas these are your formulas because at minimizing something to minimize something is the same thing as setting is derivative 0 now that is also the same as maximizing something but this is where convex optimization comes in that. this will have a minimum but we won’t have a maximum. so by setting it equal to zero, i’m going to arrive at the minimum. yes. ignore this deflection. yes x and y are fixed the data is fixed. the parameter is varying. action by a fixed for my data, correct. so my b is written in terms of a and so this is a formula if you want to close it, this can also be written as the covariance of x and y divided by the variance of x. so if you want to calculate it for two variables, what you need to do is you need to calculate the covariance and divided by the variance and here y bar minus bx bar. this means that the that the that the line passes through x bar y bar the line passes through the middle of the data. we’re minimizing with respect to a is a variable. how will do this? so for different values of a and b, the distance will look like this there’s a particular value of n which the distance will be this. there is another value of n we suggestions will be this. there’s a particular value of a and b in which the distance will be this. chorus chorus. this is this is my my summation y i- a minus bx i whole squared. this van four different values of a and b. i will get this. so when i minimize this got to do this, you don’t need to do it all you need to do if you want to do it if you want to do it is this. if you want to do if you want to do it do this. do you have yesterday’s code? open it. you can do it. that now that’s it you stop is i think i get the mean then what? one use of it is to predict another use of it is to prescribe. there are many uses of it a third use is to do nothing but simply to use it to to visualize or to summarize the relationship between two variables. correct, and we do this all the time so so for example how do you measure how price-sensitive your product is? do you understand the question you’re trying to cheat? you’re trying to change the price of your product. why would you want to change the price of your product? profitability. maybe you want to increase it. so you get more money. so people in marketing often want to understand how sensitive my sales are to price. no to do that they come up with various kinds of measures one particular measured is what is called the elasticity of demand. elasticity of demand means this if my price changes by 1% by what percentage does my says change well, if my price goes down, i would expect my demand to go up but by how much now there are certain assumptions to this. for example, it’s the there’s it’s assumed that the same number works if it increased price as well as you decrease price. so this is called the elasticity of demand. so therefore to get the elasticity of demand, but what is the elasticity of demand elasticity of demand is essentially a slope? a slope that relates something like this that if i have demand on this side, let’s see seals on this side and price on this side. i have this negative slope the slope of this is what the elasticity is. so very often you do equations like these in order to simply get at a number that has a certain meaning for you. so the slope of a linear regression between log sales and log price is the elasticity of demand for that product. i mentioned log sales and no clock price because velocity is done in terms of percentages a percentage increase in price and a percentage decrease in sales. if i don’t do it as a percentage there’s a problem now my measure depends on my units. is it thousand units per rupee or what? it depends on what i’m selling and one currency and that’s not a good measure. so i measure it not i measure it as percentages, but when i measured as percentages have to wait on the log scale, so there are many models like this where the equation itself is used to simply describe a parameter. something that tells you a little bit about the market like an elasticity of demand. you’re not using it to predict anything the simply using it as a descriptor. to say that this is this is an inelastic product. if this is an inelastic product, what does that mean? it means that if you change its price, they won’t be too much of a change in its demand classic examples of that. for example of salt if you change the price of salt a little bit they’ve at least certainly domestic salt. they won’t be too much of a change in demand. they might be a little bit but there’s certain things are highly inelastic you change it a little bit and the demand will change a lot. and marketing people are very sensitive to this idea saying that dua is my is my demand elastic or is it in lastik if i want by prices to go up then i want the demand to be inelastic because i don’t want my demand to go down if i want my demand to if i want my prices to go down, but if i’m pulling my prices down then i want the demand to be last because i want people to say that your prices are going down. therefore. i will buy more. so marketing analytics is very concerned with things like this. so therefore sometimes in equation of this kind is built just to describe something. so what i’m going to do is is go down. and since we are going to do this just on to let’s pick it just onto so let’s let’s change this to maybe miles and let me remove this. so what i’m going to do is i’m just going to do it on. one of the i suspect one parentheses my den work. i suspect this might work x has to mean to simply because of the weight centered. because it because i have not done anything on this data set now. this one is a comment. what do i have on the coefficient here? 36 and 22? so what is my equation based on this? miles is equal to minus 22. plus thirty six point two nine, whatever into usage okay. alright. now let’s try to do this manually if you want to if i want to do this manually what so i need to get at each of these things. so now i need to find for example, i need to find let’s say the covariance between miles and usage. how do i do that? tell me. no, not a sample. the data is present in me. so if i so now i have things like for example, my data miles. correct, so i can i can calculate things on this? so for example i can do. i can do this. this is the mean i can do calculations on this. okay. so now if i do say what is the standard deviation syntax? st. no dv, okay. this is standard deviation. let’s try one more. this is the variance. what is the variance the square of the standard deviation? now i want to find the covariance of this. how will i find the covariance? not necessary. remember i have the correlation function. how did i find a correlation function here? i found the correlation function from here. so i’ve got a number of ways to potentially do this one is i can do it with the correlation function or the covariance function in other words. so, for example, i can try doing this. how do i write it here? my data. this thing here was the correlation. this gives the covariance matrix. this gives the covariance matrix. okay. now what is the value? what is the value of b according to my formula? covariance of which variables now which are which are my two variables – and usage. where is that covariance? is it this number 42 points is 7:1. right. okay. / what? variants of what? where is it? no? variants of usage why? not why isn’t why but why isn’t question mark why? so my data. usage . is this also here in the data? it is because what is it that sort of what is the diagonal element of this? what is this number? usage. this is usage. 1.17 this number here. this number here is the variants. okay. what is the equation usage? this is my ex. this is my y this is my x so covariance of xy divided by the variance of x. so based on this. what is my answer? my answer is going to be for the the answer for my coefficient is going to be where is the covariance here 42 point? i can do it. you know man you almost 42.7 1 / where is the variance? one point 1:7 say a hero. you know six or something of that sort. 36 point g18. where is my value of b? here this is my slope. how do i get my how do i get my intercept? mean of why so my data this is the mean of all of them. so what do i want? which mean do i want mean of? what is my y here miles? where is my mean miles? let’s say 1 0 3. say point wire. no say 1/9. – what is my slope? say 36 point? why not? three two, or something of that sort star? what is my x? usage which is three point say four five. so 2.4. bye. three point three point four five some trick will start work that way. – 22 what is my coefficient? – 22 so if you want to you can do this from first principles. by using that formula i’m not asking you to. you can do it just by running linear regression. but what it is is this you can also check the units. what is the unit of be the unit of b is a unit of miles divided by the unit of usage what is unit of the covariance miles into usage divided by the variance which is miles in two minds. so the ratio of this is usage by minds. what is the unit of a miles this is in miles. this isn’t my for usage. and this is in usage. so the units all makes sense. which one? units what is the unit of the covariance the unit of x into the unit of why why is that remember the definition of covariance is the product of an x and a y. so this is in the unit of the product of x and y. this is in the unit of the square of x. so the product of x and y divided by the square of x and y cancel out why this becomes the units of y by x which is what b should be b should be in. my spur usage 36 means what it is 36. miles per usage unit. that’s what b is. b is in some units because these in some units we will run into some difficulty when you use this in predictive models because if suppose i want to figure out is this number equal to 0 or not? because if this number is equal to 0 statistically speaking then miles doesn’t depend on usage but because this is not a dimensionless number i can make it anything. i want to by simply changing the unit. so that makes the statistics a little hard so i cannot simply look at this number and say where is high or low? i can make your height anything i want. by simply changing the scale. i can make your height a million by simply using a small enough unit. so simply taking a raw measurement tells you no idea of the value of its magnitude. that same argument will work for any of those parameters. so therefore when we do testing when we do hypothesis testing, we need to normalize all these numbers by something and that something is typically the standard deviation. so we’ll do that. okay. so let’s end this the purpose of this was just to tell you what that regression line is and then there are similar formulas but as the dimensions increases it’s hard to do this manually for two of them. you can do it manually for three of them. it’s hard to show manually which is why i changed it because i would not have been able to do this for two variables formula becomes a lot more complicated for two variables. and which is why people don’t use the formula for certain variables. now what i want to do is i want to talk a little bit about probability. this this slide deck should also be there with you. so you have to cope with under the the idea of probability is to be able to cope with this uncertainty. what is the uncertainty that we’re talking about here? the uncertainty is that is that when you observe something you’re entirely sure what the value is not because of measurement perspective. but because you do not know what the corresponding population number is you do not know the truth of the number another sample will give you another number there is uncertainty. and this uncertainty is being made is usually captured by a probability hmm. this is interesting question. what is the probability that a man least 4,000 years the empirical probability. what is empirical probability mean empirical probability means you ask a question as anyone live for a thousand years if the answer is no then you say that the answer is 0 if anyone has lived for a thousand years you’d say tell me how many people are afraid thousand years. so one interpretation of probability is simply. you see it. there’s a criticism to this point of view one of our teachers the professor d-bus too many years ago would say that if you want to find the probability that little girl is going to fall into the river. how many little girls do you want to walk next to the river to find out? so in other words normal probabilities can be thought of as let me just see how often it happened. so you need a little bit more than this. so some words the words are often useful to know probability refers to the chance or likelihood of a particular event taking place and event is an outcome of an experiment and experiment is a process that is performed to understand and observe probable outcome probable outcomes. the set of all outcomes of an experiment is called a sample space. this is correct, and it’s easy to understand with one problem. who’s performing this experiment? well when you use probability. you are some you’re in you’re in two modes in one mode. you are performing the experiment. what does that mean? let’s say you are running a marketing campaign or you are designing a portfolio or you are manufacturing a product or you are recruiting people or you are testing a piece of code you are doing the experiment and sometimes you are not doing the experiment. somebody else is doing the experiment in your simply observing. the customer is buying or is not buying the product is failing in the field or it is not failing. the portfolio is making money or it is not making money the come person you hired is staying on or is quitting. it is not your experiment. you are simply observing the outcome of it. so sometimes you get to do the experiment and sometimes you do not we used to call these things experimental studies and observational studies and experimental studies is something in which you begin by designing the experiment and you have a handle over how much later you will collect an observational studies. you just watch and you see what data comes in. you will in your careers mostly be working with observational studies because of the nature of data today. there’s just a lot more that is simply being generated without anyone asking for it in certain very peculiar situations. you do experimental studies. for example nuclear explosions, right? why do countries want to test nuclear devices? telugu is primarily to collect data. primarily to figure out where this thing. what’s an ordered? how does it work or not? so the do little experiment to say boom. let’s see what happened. because otherwise, it’s all computer simulations and you got no idea whether this happens or not. i remember running into trouble with my engineering friends on this working on the design of a fairly large aircraft engine. and there’s a question of saying that you know, what is the thrust what is the efficiency of the engine and i stupidly made the observation that why don’t we test it out? and so they looked at me this side that side etc as if you know, how do we go to explain to this idiot? i’m officially one of them said to me very kindly use very courtly gentleman older than me any and he took the responsibility of telling me where will it go? so his point was that if this engine fires up? is going to want to move. where will it go? pointing to the difficulty that i cannot easily do a full-blown tester for jet engine this if i do start it, i got to give it enough room to move somewhere. so where do you want it to? so where do you want it to go? so you will not be in a situation to do that very often. so when you say experiment it is sometimes your experiment in sometimes it is not in rare situations, will you be in an experimental see like in a/b testing in my websites, for example, it’s a common job marketing people often disaster design websites and the erastus it does which is a good website. so you do an a/b test will see maybe test you design a website of say type a or type b. maybe one is equal to website and you let them loose and you find out how people react to the different websites. this is a little tricky but i want you to think about this. we will not spend a lot of time on it in manufacturing unit three parts of an assembly are selected. we are observing whether they’re defective or not defective determine the sample space and the event of getting at least two defective parts. what is the question that i’m asking the question that i’m asking is here’s the situation there are three parts. for these three parts are interested in knowing whether they’re good or bad. the question is asking this describe for me all the possibilities. which is what the sample space is. so what are the possibilities don’t talk about probabilities now just talk about the possibilities of what could happen. we’ll talk about the probabilities later on also good one way of doing it. is this also a defective two of them are defective. one of them is defective and none of them are defective. if you do it this way the sample space has for objects in it great, that is one way of describing it. one- we haven’t yet gotten to probability. but yes, if you get to it, you know give you it will be 1 minus that. and the event of getting at least two defective parts means to defective or see three different which is good. so this is a this is this is one way of describing the sample space. this is not the way the sample space is typically described. you’re not wrong. but there’s a problem. and the problem is this so let’s suppose i described it this way. in other words. i’m now i describe my sample space as it says zero defective. one defective to defective and three defective. let’s say these are my possibilities. if i do it this way and this to defective thing is here. if i do it this way. i will eventually have to get around to calculating probabilities. and let’s say i want to calculate the probability of let’s see this event at least two defectives. now, how will i do that calculation? now what happens is the way the probability calculations are done is that you try to split this up and see that i’m going to find the probability as the sum of the individual outcomes as a sum of individual events. i’m going to split it into individual components and then add it up. so therefore i will ask for you. therefore. what is the probability of to defect and what is the probability of three different? what is the probability for example of two defects? so let’s say i want to find out what is the chance of two defects. how will i find that? how will i find the probability of there being two defects in this situation? yes, and how will i how will i how will i do that calculation? there is not see there is you have not allowed me to even think in terms of one two, and three there is no one two, and three there’s only 0 defective one defective to defective or three defective. your sample space has lost all identity as to which one is defective. so do you want to revise your opinion of what the sample space is? what do you want to define it now is correct. so what you can do is you can define your sample space not in terms of the count of defectives, but in terms of whether each individual item is actually defective or not, correct. in other words, what you’re doing is you’re essentially saying let’s say good bad or good defective or g. what does this mean? the first is good. the second is defective and the third is good if you do it this way how elements are there in the sample space eight because each of these can be good or good or bad. these are your eight possibilities now from this what can happen is using these events you can now add them up. now. what happens is if i am looking at let’s say to defectives. which ones are relevant? say this one is this has to defectives this has to defectives and so three of them have to defectives in it, correct. one of them has no defectives three of them have one defective three of them have to defectives and one of them has all defectives. so this is another way of writing the sample space what this will do is this will allow the calculation to be a little easier and your objective is to be able to make the calculation real easier. so in this particular case, for example, just just to get the calculation out of the way. let’s suppose that the chance of a defective. let’s suppose the probability of a single defect. let’s say is 20% let’s taste 20% there’s a one in five chance at per unit is where this seems too high you would survive. let’s say ten percent. one in ten is defective. one in ten is defective. if one in ten is defective the probability is now ten percent then what are the chances of all of these? and asking for common sense answer to the question will get to the concept related one. so the chance you understand the chance that a single one of them is defective is 10% the chance that a single one of them is defective is 10% and let’s say that i want to solve this problem. what is the problem the event of getting at least two defective parts in other words? i want to find probability of. let’s say to defectives. what is this? let’s work it out. so good example to work out will understand many things as we do it. the chance of a single defect is 10% i’m asking for the chance that i’ve drawn three of them and then i will see two things two of them being to defectives. this needs a little bit of work. let’s do this patiently. let’s let’s work this out. now the chance of two defectives can happen in how many ways we just saw it now. let’s suppose that i want to calculate the chance of three defectives. to calculate the chance of three defectives here is what i claim i can do. i can add up the chance of these three. equivalent. i can do it. this way probability of to defect is equal to probability of gbd or d g d or dd g. is this correct? there are only three ways in which i can get. defectives you’re okay with this. now i’m going to do something real interesting. i’m going to write this as p of gd d +. p of d gd + p of ddg, let me explain what allows me to do this. what allows me to do this is the fact that if this happens these two cannot happen. they are mutually disjoint. both cannot happen together, which means that if i draw a picture there like this. so if i want to find the chance of being here or here or here? i can simply add up the chance of being here plus a chance of being here plus a chance of being here. why can i do that? because they’re disjoint. there is no common thing. if this happens, then this does not happen. these are two separate things disjoint. okay. now let’s look at each of this probability of g b and d. this is what event the first is good. the second is defective and the third is defective. i’m going to write this as p of g multiplied by p of b multiplied by p of d. i’m going to multiply to use the technical term here and the technical term that i want to hear is independent. independent means that the whether the first is good or bad. tells me nothing about whether the second is good or bad. they are independently good or bad. this is an assumption. but i think the problem allows me to make that assumption. i’m making a part is good or bad and making another part good or that and these are independent of each other. i’m trying to sell a product to him and to him whether he buys or not is independent of whether he buys or not. that’s an assumption. it may be true. let’s say they’re from two different neighborhoods, or maybe if their neighbors and i’m going from one house to the other. maybe this is not independent. maybe if he buys his more inclined to buy so independence is an assumption. in this case. i’m making that assumption when events are independent. i can multiply the probabilities. what does that mean? for example, let’s say that he will buy a product for me. let’s say ten percent of the time in other words for every 10 people. i want to sell my product to only one person buys. so there’s a 10% chance that he will buy my product and let’s say there’s an independent 10% chance that he will buy my product. what is the chance that they will both buy my product? ten percent multiplied by another 10% first he has to buy the second person and then his 10% will be like ten percent of that. so 10 percent to 10 percent. there’s only a 1% chance that they will both buy my product. multiplication is allowed when things are independent. so independence means that the probability of both of them happening. let me re let me rephrase the question in this way. let me write it as one more step. let me write this as probability of g and b and b. i’m going to write this gds g and d and d. which means the first is good and the second is bad and the third is bad this and i will now write as g into the sorry. into probability of d in other words if there’s an and then i can multiply where if things are independent, these laws will be clearly written later. so if things are independent. i can multiply if there is an and if things are disjoint. then i can add when there is a war. common sense rules but they require little bit of sort of logic in calculating. so i’m going to take this on take this to the top now. this is going to be what this is going to equal p of g into p of d into p of d. plus p of d into p of g into p of b plus p of b into p of b into p of g now what is p of g? .9. 90% point nine in two point nine into point plus. very clever thinking into 3. you guys are ahead of me here. see dew point nine in two point one in two point one. even more generally let me we make let me be even smarter than you are. i’m going to write this as 3 choose 1. into point one to the power 2 in 2.9 to the power 1 it’s a set is sophisticated way of writing the same thing. correct. why did i write it as 3 choose 1? because why was it three? how many ways could there have been? to defectives out of three so correctly speaking. i should actually have written this as 3 choose 1 or 3 choose 2 because either i can choose it as one good or i can say it is too bad. so 3 choose 2 is the same as 3 choose 1 whichever way you want to write it. so maybe a slightly better way to do it would be to say this is choose to what are these two the two bag defectives out of three? what is this point one the chance of a? one defective. what is this too? they were two of them. what is this point nine the chance of the good? so how many goods how many bags were there? to how many goods were there one in how many ways could i have chosen two bags out of three? three of them this is the answer. this is an example of a distribution called the binomial distribution, which will see. and this calculation you don’t need to calculate your python. will we do for you like all good things? the look and feel for many of these classes seat once understand it and then, you know ignore it because someone else can do the calculation for you. will you not to worry we’ll get to it after the race. but what is the answer? 3in to say point 1 in 2.1 in 2.9. somebody tell me what that is. i’ve got no idea. two point two point four three points to two point forty percent. 0.024 three this number someone verify on there. 0.027 okay, then about two percent 2.7 percent or below 2 percent that is a chance of c2 defects when there is a when the chance of seeing a single defect was. so this is as you can see this calculation is not about defects or anything of that sort. this is just a counting argument. this is a counting argument. so for example, i could have asked the question that i’m i’m trying to sell a product to three people. my chances of success is is final 10% what’s the chance that i’ll be able to sell to two of them? today, i’m going i’m a salesperson. i know i can call upon i’m going to three houses today. let’s say that i sell children’s books. i’ve gone to schools have set up my stalled. you can cool down there. i’ve got three addresses of parents who have been kind enough to say please come to my home and you know, i’m willing to listen to you. so now i have on my on my cell phone three addresses. i’m going to go to today. i know that my chances of selling this are not good. optimistically even 10% which means that if i try to sell to 10 people only one will probably buy so my chances of success for a single person is about 10% so now i can ask myself what is going to happen at the end of today. how many will i sell? what is the chance that nobody will buy? what is the chance that one? single by what is the chance that two of them will buy and what is the chance that all three of them will be so what is the chance that two of them will buy two poisons there’s only about a two and a half percent or roughly two to three percent chance that we’ll end up with two of my of these people by not. no, i for me i define 10% as bye. so which is i think that this calculation does not depend upon whether your is defective or is buying anything. what it depends on is the probability of an event and you’re asking the question how many times will this happen? and that can be a defective part that can be a seal that can be the loss of value of of a portfolio that can be the attrition of a of a person that can be a hit on a website that can be a click-through rate can be very small number for those of you who are in digital marketing. what’s a typical ctr mode set? click through rate and if you with that industry. so what’s a typical click-through rate for you? website clicks are immigrants but what’s a click-through rate? so what a click-through rate typically means is of the people who pass through an application for whom the application is an image now was an impression as they say what percentage of them actually click on it is very important for digital marketing in a so you’re showing ads all these websites come with ads etc. someone’s paying for those ads and they want to know what the click-through rate is when i see the ad what percentage of people click on the ad and it’s typically very small number. have you ever clicked on an ad? no, most normal people don’t but either she advertised you see advertised clitoris is very slow. let’s say it literally it is point three percent. let’s see. let me see out of a thousand people can be expected to click through now, i can ask the question for example that if i i want to have let’s say, you know, so how many impressions should i have that depends on how many clicks i expect to have expect to see if i want to have let’s say a hundred people clicking on my ad that gives me a rough ideas, too. how many how many i should reach how many impressions i should be? i wish i should be having in also answer a question like this. what is the chance that i will have more or less a less than a hundred people clicking? so you can ask the question. what is the chance that less than a hundred people click in a month? how do i calculate that with this i can because what i need is a radian estimate of how many impressions there are in a month. from that that is how i see my n that is an inn. i can then calculate. yes. yes, i could have if i wanted to do at least two. right. if i was solving this question at least two then you’re absolutely right. i should have added the last one. there’s a question of so i’ve written it as this probability of two defectives if i retreat as at least two, you are absolutely right. i should have done that. i could have done that. so how do you divide this into an impression is the the so on a website if you see the ad in other words for a session, so someone has gone to that website and that ad is present on that website at that point in time. that’s an action if someone is actually clicked on that ad in that session, that’s a click. so the picture rate essentially is if i’m showing you the impression are you clicking on it? so i can look at the number of impressions because that’s that’s the number of times that website has been visualized and that ad has been on it. i can also calculate the click-through rate how many people are doing looking at and now we can ask the question? what is the chance that have less than this? so then what would i do? i this number would be lets say the number of impressions. this would be say a hundred. or and this would be say my click through rate. and this would be my one – my electorate. so my point 0 3 to the power say a hundred those hundred people who clicked. the the 19 the the the hundred the number of impressions – hundred people who did not click and the number of ways in which i can get a hundred people out of let’s say, i don’t know million impressions. i do a calculation like that. i wouldn’t do it. i get someone to do it for me. so this is where we are heading. and i’m very slow down a little bit take you through this conceptually. just just to get these terms understood slightly. so first of all, what is the probability a probability is a number between 0 and 1 it is often calculated as a ratio. the number of ways that is favorable divided by the total number of outcomes. this is not the only way of calculating probability and it is very rarely works, but it is often a conceptually easy way to understand. it’s a number between 0 and 1 0 means impossible 1 means certain the probability is a pure number it doesn’t have units. the philosophical question of is the glass half full or half empty types of probability different ways of doing it because here’s what i was talking about as mutually exclusive events. these are two things that have nothing in common. mutually, exclusive exclude each other out and example. if a drawing from a deck of cards, you can either draw key or you could queen or neither of them, but if you are drawing one card, you cannot draw both a king and a queen. just like if you’re drying up apart, it’s either defective or it is not defective. if you are physicists you think of schrodinger’s cat. and so physicists have a lot of fun with this. you know, the story shooting his capture is shooting a given example of a cat things like the position of the electron. so there’s a cat in a box and this very unfortunate for the cat and there’s a vial of poison. now that vial of poison is little unsteady. so you could fall down it could break open and if it does then then the box fills up with fumes and the cat dies. so, you know that there’s a box of poise. there’s a vial of poison in the box and there’s a cat now. the question is is the cat dead or alive. it’s a closed box. you know that there’s poison in the box and there is a cat question is is the cat dead or alive and the answer is you do not know until you open the box. now if you open the box, you can see whether the cat is alive or dead. this collapse of the wave function in quantum physics which means that the event has already happened, but until the wave function is collapse, you do not know whether the cat is alive or dead if you can observe an electron the electron is here, but if you’re not observing the electron you don’t know with electron is it could be here or there or anywhere? so the electron is buzzing around the room. in physics, this is an important idea and a lot of probability theory has come from physical consideration. if things aren’t mutually exclusive you can add up the probability as we had said this king or a queen what about two independent events two independent events are events such that if one of them happens, then it no way influences your currents of the other one in other words if he buys then nothing about if one of them, is it effective it says nothing about the other thing being a defective. let me ask you a question. let me go back to my previous picture of mutually disjoint events. are these two events independent? know why are they not independent? yes, if i know that the king is drawn. i know something about whether the queen has been drawn or not. i know in fact that the queen has not been drawn. these two are most certainly not independent. so please don’t confuse these two concepts. we talking about the same event. i’m talking about king these two events. so for example, if i talk about let’s say one particular one particular unit being grd. good are defective. if i’m talking about one of them, then the picture looks like this. it can either be g or it can be d. it cannot be both. this is for one of them. but if i am talking about two of them. then say this can be g 1 and this can be sadie to in other words. the chance of the first is good and the second is defective. now. these two are no longer disjoint. why because both these things can happen together. there is quite possible that the first is the first is good and the second is defective. that’s quite possible, but they’re independent. independent means if i know that the first is good. it tells me nothing about whether the second dish. defective or not so if you picture sort of intercepts. then you know that you cannot add up the probability. in fact, you know little bit more, you know that if you want to add up the probability you can but you have to somehow take out this. real common part so this joint two separate things you add it up. or this or this not both? just situation happens you can add the probability up and you can also add up. but you need to assume independence. we will break all these assumptions soon. this is the simplest possible way to do calculations have to get to a little bit of a nightmare called bayes theorem rule for computing rules for computing probabilities this language here cup and cup cup and cap language this from set theory. some people find it comforting to see that language others find it complicating. it’s called the union. so union means so union in general means the collection of two things so you know this. this is the probability of a or b if there is a common part then probability of a or b. is equal to probability of a plus probability of b minus probability of a and b. the chance that both happens is a chance that one happens plus the chance of the second happens – the chance set they both happen. if there’s a disjoint then this becomes 0 because i know that both cannot happen but in general. this term stays. this is called the intersection. they both happen simultaneously. here’s an example. what is the probability that the selected card is a king or a queen? so this assumes that you know, what a card deck is. so 52 cards 13 for suits. so how many kings? four kings how many queens so what is the probability of a king for by 50 to 1 by 30? what is the probability of a queen? 30 what is the probability of a king or a queen 1 by 30 plus probability of a queen? 2 by 13 the other way to do it is if you want to how many ways. can you get a king or a queen 8 which is 8 divided by 52, which is also same number. what about the second one? what is the probability that the selected is a king or a diamond? so again, there are two ways of doing this p of king or diamond is p of king? plus p of diamond minus p of king and diamond. this is let’s stay on 52. this is so this for by 50 to 1 by 30 is also correct. plus probability of diamond 13 by 52 – king and diamond. there’s only one such cut one by 52. this is 15 by 52 another way of doing it is how many ways. can you get a king or a diamond 16 ways the whole suite of diamonds and there are three remaining kings. there are 13 kick 13 diamonds. there are four kings, but have double counted one of them in both. so if you subtract it once remind me what the second statement was the question. okay, you’re saying there’s a king selected card is a king or a diamond. you draw one card at random from a deck of cards. and you’re asking. is this a king or a diamond? let’s say i’m trying to decide and try to sell him something. correct. one event that i’m interested in is easy going to buy my product or not. the other interesting question is let’s say for example, is he an it professional or not? correct. now is a relationship between these two things. not really. but i may be interested in the joint probability of them not because of this event, but because i want to calculate another event that’s interesting to me, which is if i know that he’s an it professional. can i sell him something in other words? suppose it is not independent. suppose i now know that whether he buys my product or not depends on whether he’s riding it professional or something as this i’m trying to sell him a computer peripheral and i may be assuming that if he’s writing professional. he may be more interested in a computer peripheral if he’s not he may still be interested. but if he’s an it professional you may be more interested in this particular patient. in that case we’re trying to do is i try to i try to use one unrelated event as information about another one in other words i’m saying it is not actually unrelated. so these hands become interesting. so effectively how will my calculation go my calculation will go this way that if i want to find the probability that he will buy my product given that he’s an it professional. then my answer will be let me find the probability that is both will buy the product and an it professional / with reason it professional why let me first calculate the chance. he’s an it professional within that let me now find the chance that he will buy my product. a given b is equal to a and b divided by b. this trick this trick is always used in analytics to say this and we’ll do it before. i have received an email. is it spam or not? which means i need to find tell me the words and i will tell you whether it is spam. so now i need to relate the words to spam. so i have two unrelated concepts, but what i want to do is i want to say that if i know one of them maybe i can get some information about the other similarly here. i’m this maybe about a color and this may be about a suit, but if i know about one of them, maybe two gives me a little bit of information about the other. we’ll see examples later. it can be both because i am drawing one card. you’re asking just about this. that’s 150 to what is challenging. is it is this an exclusive on? he’s asking is this an x dot in computer science? you know was he saying that when i see or am i excluding the case that both are allowed? no, but confusion still remains if he’s very pick t. so you could say he’s making a distinction between two statements king or a queen king or a diamond or king or a diamond or both? any to specify both? and in that yes, you are, correct. ha so his mind works in ways in which the default is the exclusive. your mind works in ways in which the default is not the exclusive. but it’s a valid. it’s a valid criticism to make. that in the english language when you use it, do you which or do you mean in when i say this in probability theory if i say a union b and if there is an intersection, i include that intersection set theory is not confused about this. set theory a union b is just the set. and if there is a common part that’s in it. and it only once is this region? so what i did was we translated this into set theory and he saying that maybe i should have been a little more careful because there’s a difference between this set and the following set which is just this part and this part. multiplication rule when things are independent. i’m allowed to multiply. example there are two subjects the chance that you will do. well in one of them is 70% the chance that you will do. well in the other is 35 is 5% the chance that you will do. well in both of them are in the be the corresponding grades is is the multiplication of the two, which is 35% here comes the interesting part. what happens to events which are not? didn’t what happens to the or i’m sorry what happens to the multiplication? and there are various ways in which this parts written. so the currently the way the formula is written is a and b is equal to a x a is multiplied by probability b given a this is the way this expression is written. sometimes it’s easy to understand this way. sometimes it’s easy to understand this way. probability of b given a is equal to probability of a and b. divided by probability of e i want to know what is the chance that be will happen when i’ve already been told that a will happen. so first i find what is the chance that you will happen? and within that i take the fraction of both a and b happening. this is the same as saying the top line a and b is equal to a given b. this means what this means a and b is first e happens. then given that a has happened be as happen. correct. if he and b are independent, what do i know if a and b are independent then a and b is a into b of b. that means that if a and b are independent independent probability of b given a no is equal to probability of b. stare at that for a while. if they’re independent, then this will become p of b. and so p of b given a will equal p of b, but is this not exactly what independence is if i tell you that he has happened. i have not changed the chance of be. that is almost by definition. what independence is that by knowing that one of them has happened has told me nothing about the second one by knowing that the first unit was defective told me nothing about the second one by knowing that the first customer bought. my product told me nothing about whether the second one will buy it or not. this these statements are understood in different ways. sometimes this is a good way to understand is sometimes this is sometimes this. but this is a more general form for doing it will see examples of this. this one needs a little bit of work to understand. from a pack of cards two cards are drawn in succession one after the other after every job is selected card is not replaced. so you’re drawing one is like a normal deal. the second one now comes after the first one. what is the probability that you get both drawers? you will get speech. in other words, you’ll get two states. two drawers to space. what is the chance of that? so here’s a structuring of the problem. is that the you get a spade in the first draw b is you get a spade in the second drop? so what is the chance of a the chance of a is 13 by 52 is the chance of the first one is the speed. no, i want to find a and b and the way i do it is this what is the chance of a and then what is the chance of b given a in other words i’ve drawn a spade. and then what is the chance that i will draw speed given that i’ve already drawn a straight the first thing and the answer to that is minus 1 because there are now 51 cards left in the deck and there are 12 space remaining so 12 by 51. so the answer is 13 by 52 x well by 51. what would the answer have been if i had replaced the first card? it would have been 13 by 52 x 13 by 52 because of independence. i put it back if i put it back when i put it back the second draw looks exactly like the first one so knowing that i had a speed to begin with has been lost because i’ve already put that straight back in it is it is a situation of independent experiments. this one however is the case that the result of the first of the result of the second depends upon the result of the first we are assuming the second one that we have already picked the first one. yes as a spade, huh? because that is what is being asked for. what is the probability that in both the drawers you will get escapes? so i’m drawing one and i’m doing a second one. what is the chance that they’re both space? here’s a here’s a here’s a similar dish question. what is the chance that i will get to addition to adjacent seats on my flight if i don’t free book. yeah, so it’s a similar kind of calculation. why is it a similar kind of calculation? so you want to adjust the seats, but for two adjacent seats to be picked by you. those two sliver speak to empty adjacencies. now two adjacent empty seats means what? that means can you calculate the probability? yes, you can but when somebody booked seats, let’s say that one particular seat has been booked what happens to the probability of the next seat next to it being booked. so the probability of a seat being booked of a single seed being booked. let’s say is making up a number. let’s say 50% for single seed being booked is 50% now. i’m telling you that a one particular seat has been booked. let’s say, you know 15 a has been booked. well, i’m asking the question given that 15 a has been booked. what is the chance that 15 b will be bought. will be 50% will be more than 50% will it be less than 50% it will be more than 50% at least if you’re modeling reasonably. well, it would be because the whole bunch of people will be booking pez. we’ll be looking past. so now if i know that once it has been booked if i know that 15 a has been booked now the chance that 15b has been booked is going to be more than fifty percent which means that my chances of coming late and find into adjacencies is going to go down because i’m looking for sees that are on board. kansas will be more right. know the chances will be less because as people book so people book adjacent seats more than at random. so the probability of two adjacencies being booked is not the product of the individual seats being booked. it’s more than that. so the probability of me finding two empty adjacencies is going to be less. we are looking for empty seats. so here’s an example of doing this conditional calculation marginal probability is a term. i’ll explain when i do the example. so here’s an example a survey of 200 families was conducted information regarding family income per year and within the family buys a car is given in the following table. so the 200 data points 200 surveys of come and they’ve been distributed in a cross tabulation like this. we did a crosstab like this yesterday as well. this is the crosstab. once the axis is the by a car or did they not buy a car? we are there is an income statement income below 10 lakhs or income greater than 10 lakhs. now. why am i why would i be interested in this data to figure out who by my who buys my cars where the cars can be sold and whether that has anything to do with income and if it does arrange to do with income then is high better or is no better i don’t know. so what i’ve done is i’ve arranged my data in this particular way. and now he’s asking a few questions. what is the probability that a randomly selected person or what is the probability someone is a buyer of a car? it’s you don’t even need to look at the full table. this is 80 by. 200 probability of let’s say car. this is called a marginal probability y module because from the picture it’s at the margin. it’s at the margin of the table, which is where the term originally came from. this called a marginal probability. there are many things going on. but you’re asking a question only about one margin in this case the margin of the car you are interested in the income. this called the marginal probability. what is the probability that a randomly selected family is both. with a buyer of a car and belonging to income 10 lakhs or above both buying a car. and income 10 lakhs or above 42 on 42 1 200 ok a family objected random is found to be belonging to an income of 10 lakhs and above. what is the probability that the family is a buyer of a car? if the income is more than 10 lakhs, what is the chance of a car? so this is probability of car given greater than 10x 42 by 80 interesting 42 by 80 why is this? you’re here, right? is 80 that’s a sample size. you understand the logic, but that is exactly the same as this probability of car and greater than 10 lakhs. / greater than 10 lakhs why because car greater than t car and greater than 10 lakhs is 42 divided by 200 and greater than 10 lakhs 80 divided by 200 200 200 cancel out this again becomes 42 by 80 but the thing is absolutely right. this goes in the denominator because because this somehow says that out of how many people am i going to select? and then on the top is how many are both. this is called a conditional probability. this is called a conditional probability by the way. what is this number? this for example less than 50% sorry greater than 50% what is the chance of buying a car? which is about forty percent. that means if i did not know your income. i would guess that your chance of buying a car is 40% if i didn’t know that your income was more than 10 lakhs now your chances of buying a car went up to over 50% therefore it’s worth my while to find out whether your income is more than 10x. because it it is by the sample data tells me that that’s going to influence in a positive direction whether you will buy my product or not. so i’ll try to find out. this is in terms of words is called the marginal probability marginal. and this is a conditional. you might have a little bit of trouble with these words, but conceptually, this is not very hard. and so this is the calculation that we just did. base when you originally wrote this paper. so we talked about it. nobody understood him. only after he died did somebody find in his papers and i said, okay this is going to take a long time and then they explain it to others. let me explain what it tries to do. yes on the board card car and greater than 10 lakhs this one. okay. this one is a joint probability. this is a marginal. this is a joint and this is a conditioner so conditional is a joint by a marginal condition is a joint divided by a margin a joint is a marginal x a conditional. so the base hiram’s idea is the following. what it does is it switches which event is being conditioned on? it switches between a given b & b given a now when would you need to do this? here’s an example. you want to find out whether the whether the email that you’re receiving is spam do use gmail? gmail often identify things as spam and moves them somewhere how does it do that? actually, it looks at the male’s and headers and it uses a very very complicated algorithm. but let’s suppose you are building an application of this sort and you want to do it based just on the content of the email. so you want to following kind of program you want a program that says that if i know the words of the email i can tell you whether it is spam or not. which means i want the following thing. i want the probability of spam given words. if i tell you the words, can you tell me whether this is spam or not? this is what i want to do. correct. but how will i solve the problem? i’ll solve the problem by finding the opposite condition. what is the opposite conditional the opposite conditional is what is the probability of words given? spam now, why do i am i interested in this because this one is easier for me to do. in the following sense. what i can do is let’s see in my research lab. i can collect lots and lots of documents and i can identify them as spam or not spam. in other words. i can manually go in and i can tag them. so let’s suppose that have looked at a thousand of these and have targeted. let’s say say 800 of them as stand and 200 of them is numb stand or maybe i go after things that are spam and find five thousand of them and go off to things that i know are not spam and find five thousand of them. now. i can solve the opposite problem, which means that if i know that it is pam i know the distribution of words and if i know that it is not spam. i know the distribution of words. i can do this inside my analytics environment. so now i know that if it is sam, this is what the distribution of what is looks like. if it is not spam this was received. no was look like using that i will now push the problem and say now if you give me the words i will tell you whether it is spam or not. now, how do i do that? i do that doing this. now. this is a very easy formula to understand why because this formula essentially says this that why is why is this equality true this equality is true because let me rewrite it slightly. let me see. what is the probability of let’s say stamm. and works what is the chance of spam and words in other words there is an and they’re now i’m going to write this like a and b. but here’s the interesting thing when i wrote i can write e and b in two ways. i can write it as b multiplied by a given b. but i can also write it as e x be given. i have a choice as to which is first and which is second. so therefore i can write this in two ways. i can write this as spam given words. x words but i can also write it as words given spam x spam do you understand the trick? but what does that mean? that means these two things are equal. if these two things are equal that expression of for is now i know that probability of stam given words is equal to the probability of words given spam multiplied by probability of stan? divided by probability of corpse so to execute on this what do i need was given spam which i told you what to do probability of spam, which is an estimate of the proportion of emails that are spam or not spam. and probability of words black has no conditioning in it. this is what’s usually called a lexicon. or a dictionary so if you give me a dictionary of the language i can give you this denominator. if you give me shall we see an it estimate or a sociological estimate as to the proportion of words or proportion of emails that end up being stamp? i can give you the probability of spam. you give me a thing stop the spam. i can find its dictionary distributions. if you give me things that are tagged does not spam i can find it. so therefore i know the hand side therefore. i know the left hand side. and now if you give me the words i can tell you the probability that it is spam. so it’s either thought of in the way. i just described it which is sort of flipping these two probabilities. sometimes it is described the following way stan given words is an update of just probability of spam. this probability of spam part is sometimes invasion language called a prior. and spam given words is called a posterior, which means that if i know the words. i have a greater idea as to whether it is spam or not. if i know his income if i know he’s an it professional. i have a better idea when we buy the product or not if i know the income is more than 10 lakhs. i have a better idea whether you buy a car or not if i know the words i have a better idea whether it is spam or not and to do that i flip it this way. and because of applications like this bayes theorem has become very very central to machine learning. because for example think of the autonomous car, what is your numbers cause decision problem something is crossing the road. should i stop? in other words given cow, should i stop? now think of the think of the the problem that has to be solved to do that. i can flip it now to flip it means what to flip. it means essentially flip it by saying thousand stuff essentially a now have to tell the program. so i say stop given cow. so now i have to solve it by this young kyle given stuff. so i need to take these are the situations in which a car is stopped. and these are the situations in which a car has not stopped since top situation. look at what that cost saw and in a not stop situation. look at what the cost saw. like spam and not spam and now i can flip this and say therefore if this is what i saw i now know whether to stop or not. it’s a neat little logic. so this is this is a set what bayes theorem essentially does. it is one way of doing supervised learning. it is one style of doing supervised learning and there are there are supervised learning algorithms that are explicitly this for example bayesian belief networks for bbs. there’s some supervised learning algorithms that are this but aren’t explicitly so for example linear discriminant analysis. but what you do is you find the posterior distribution of being in this class given the data. and so this class given the data is written as you know, the class given data. so and vice versa. so there are at least two of these algorithms that you will study later. justin ellis is one and i think bb ends i don’t know is regular but in general you will find it to be a very useful trick. i’ll come back and i’ll show you the theory behind it if you’re interested, but this is actually all that’s that needs to be remembered for this application. so the questions are autonomous cars. his question is why don’t i do the simple thing of saying that if you see something stuck. now from a computer computer. following that logic the computer now has to know what should i do when i see something? not if i see something in stock so you could see if i see something on the road then stop. you didn’t ask what happens if i don’t see something i should keep going. so this becomes very simple rule that says that if i see something stop if i sort of don’t see anything stop. now, what will this do to the car? okay. so so this is a translation of a rule. the difficulty will be the following and you can try doing it the difficulty will be that what precisely will the car see and we’ll follow that logic explicitly. so if you see the car that is coming quite far ahead. it will stop you could say i’m going to drop threshold. if it is further away from this in front the car in front, then don’t stop because you’re expected to see a car in front. and so if you see a car in front, please don’t stop because something is in front. but you now have to encode that. and so that way of doing things is entirely feasible. so for example, there’s a there’s a whole branch of learning called case-based reasoning case-based reasoning and case-based reasoning essentially lies on that. give me all the cases and give me the reasonings for all those cases. what does misusing sometimes becomes difficult if if becomes very very difficult to enumerate all the possible cases? for example in the stand problem i have to solve this problem for every conceivable word. that the email might see because email is going to decide based on the words. and if you do, if you do not go full case based approach if the email sees a word that it is not seen before the meal will say what do you want me to do? so typically when bayesian methods are used when it sees that word it will do precisely nothing. in other words. it will say if certain words are there. i will update my decision if those words and not there i want it’s irrelevant to it. there’s no evidence that it has chop the other is a probabilistic way of thinking that bayes theorem or any of these relations is probabilistic learning that when you do some when you went up when an autonomous system or any machine learning system decides then what does it decide on you’ll often find in data sets the following situation. i should have had an example. i pull it up. all the x is are the same but the wise are different. all the exes are the same by the wise are different. two people have exactly the same characteristics but one has bought the product and one has not bought the product to people applying for a loan have given you the same information. they come from the same village. they have the same income. they’re the same, you know family circumstances. they grow the same crops one farmer has repaid the loan. the other farmer has not hmm car being tested out the someone crossing the road identical see one test driver decide to stop the other test driver decided not to stop same x different. why what should the computer do? nothing of this remote computers perspective. what is the computers problem? then the variance problem is if you give me an x i will give you a why now, what do you want the computer to do in this particular situation? because in your real data the same xu leading to different wise what’s an ideal solution here. what would you do? how do you think through this problem? one possibility is to give it a probability. that’s one approach to the problem. what that means is this at in your data set. let’s say half your people who have seen this x have given a y of 0 and half your people have seen this data set of giving it a y of one the computer literally tosses a coin and decide which one to predict. that’s called a randomized response and sometimes it’s done. it could be a disaster as well. i’m sorry, that could be good. but what would give me another alternative? the safest of alternative we could go for a ride, which is safe. how does the computer know that? what condition is given the same x its input is identical? see that consequence has already been worked out by in nature. in nature is the consequence was there that would have already have been baked in. so if there is a consequence to it and if there was a good consequence and the test driver would have stopped in all cases. the case driver would have stopped. yes. yes that that that decision would have been made by the test driver as well. would it not have been? the raw data would also have shown that bias. or are you teaching a computer to have a sense of value that the real human did not have? two doctors look at the identical medical report one doctor says cancer. the other doctor says no cancer. you are building an ai system for medicine. what should it say? go for it. go for another test. you should see that the you should see a very nice video of watson. you know, what watson is you should see the watson videos if you haven’t seen it and you want to be an ai. professional read email professional then you should see the watson videos. wonderful videos and you can see you can see that the decisions at the bottom. you can see the you can see how watson decides you know, what what’s in this is the jeopardy videos. so what sampling jeopardy? and so sub japanese a quiz question in which the answer is given and you have to sort of say the question or something of that sort. so when you see the video you’ll see at the bottom you’ll see a bar and the bar is a set of probability statements as to how likely is this the answer etcetera etcetera? and based on those probabilities watson gives an answer and sometimes wasn’t does not give an answer because it is unsure of even its best answer. so you should so when you watch it watch the watch the bottom of the screen the data that watson is on setting based on. this particular way of doing things but in general this problem is a hard problem in machine learning because in the real world, you will have this issue. if this was not the case if it was the case that that identical values of x give identical values of y. the machine learning problem would be a mathematical function fitting problem. it would be a problem simply saying if this is the x match map it to the y. just find the rule that map’s x to the y it’s not and the reason is not is because identical inputs. do not lead to identical outputs. and resolution of that has many many procedures and possibilities for doing that. one of them is a probabilistic way of doing things. to answer the following question. i will not tell you whether y is 0 or 1. i will tell you what is the probability that y is 1 i would love to tell you whether you have cancer or not. i will tell you what is the probability that you have cancer? i will not tell you what the probability of hitting something will be if i continue it’s not a definite answer. i’m asking for a 0 or a 1 and i’m not giving you a 0 or a 1 i’m giving you a probability. so at every time the car when it is driving is calculating a number given the scene. what is the probability that i will hit something continuously based on what it is seeing now, you decide based on that probability whether you should stop or not based on you know, your risks etc. the the learning system does not do that. the learning system does not save whether you should be diagnosed with cancer is simply says what is the chance that you have cancer now you decide based on your whether that’s enough for me to state whether you have cancer or not, the learning system will not say what is the probability that you have defaulted on that that what you will not say whether you will default on your loan or not. it will say what is the probability that you will default on the loan now you decide how much risk you will bear? that’s one solution to the problem. it doesn’t even try to predict the answer. it simply gives you a distribution on the possible answers. you decide as i said if you see the jeopardy videos you’ll see this action. you’ll see that the data on which it does category is 19th century. novelist. what watson wants to do then is preserve the lead not take a big risk, especially with final jeopardy because just like for humans follow japanese hard for watching now we come to watson who is bram stoker. i would have thought this technology like this was years away, but it’s here now. i have the bruised phenomenal. that’s one on a terror watson. look at that. what is doing is it’s given probabilities on the answers? these don’t add up to 1. these don’t add up to 1 but what is the chance that list is the answer. what is this juncture pain is etcetera, etcetera this number if it is below this threshold what’s in will say pass it won’t answer. and is there in the video few number of times? it doesn’t know but it says that if i am more sure than a certain threshold and if i am uniquely sure it will also none sir, if multiple of these cross here. which means both of them are probably and i don’t know which is like they both sound correct to me again. i might stop. he’ll do that every question based on hearing it. so if probability is them by python language or any of the machine language thing then what is that? we are here for meaning what is the rule in deciding that deep philosophical questions? why are we require we existing at all? why are we here at all? so real? yeah. so so one one reason you’re there is to provide test data to the system or what’s called ground truth. it was you need to give it spam. and you two need to tell it once this is spam. just like he’s saying i need to tell it to stop. i need to say that this is a dangerous thing. so so human needs to initiate that but he has people are asking that question a lot. that is that human initiation necessary. now the trouble with that is that the the value system that is necessary to decide that. this is a good thing or a bad thing is something that computers do not have and it’s extremely difficult to encode that. it’s not easy to include in a computer in some way. this is good or this is one decision this one decision. and also if you want to encode a cost to it, and if i do this, this is what caused suffering reinforcement learning does this if you take a wrong decision, there’s a penalty function that hurts the computer in terms of an objective and the computer knows that if i want to reduce shall i say that pain factor i should avoid doing this like babies learn is called reinforcement learning. i don’t know whether you will do much reinforcement learning in this course not but you will. that so you you so you so you so you build algorithms of that kind. they will come a time with that will not be necessary for us. it is not necessary. but even we even humans have to come with a genetically coded information. we also cannot begin from scratch. we already come coated with this. there’s a school of thought that says that that’s all that there is that this information is passing along in other words a hen is an egg’s way of making another egg. so an egg was to make another end. how does an egg make another egg through a hint it makes a hell and that hen makes the egg another egg breaks which means that there is a basic information content. the gene is trying to say i need to survive. so the sequence of three cds and geez that has a survival instinct. and the only way it can do that is to get another organism to create a copy of it. why does this do that? brilliantly? fix the big war going on on planet earth for a few billion years and still continuing. it’s a deadly war has got no winners and is going to continue is a war between bacteria and viruses. nobody wins. write these two are at each other for donkey’s years because they have two very different ways of dealing with information. viruses retrovirus type thing a virus is just dna with the protein around it. the way tree produces is like certain birds. we learn in mythology that information gets into another organism typically a bacteria. so a virus forces a bacteria to make another virus. and obviously the bacteria doesn’t like it. okay, so the bacteria over billions of years of figured out how to prevent doing this and viruses of consequently adapted and have repeatedly kept doing this and so information transference has a long long history in the real world in the in the in the computing world the challenge of saying that how do i input the information? how do i get the machine to learn is something that we are rapidly evolving in? the reason this this current generation is so excited about it and i am not that old but even in my career and i’ve been doing this for about 25 years or so, roughly speaking. i’ve seen three or four waves of it. are we going to goes up it goes down it goes up. it goes down and different the current version of it essentially is based on certain deep learning algorithms that have come and it made it a lot easier to feedback information. so, you know recurrent neural networks can cut all these neural networks now have the ability to feed context in feed information lot more efficiently, which means this idea that a computer can pick up context and use it to get better. algorithms is there and that scares a few people mightily because what it means is that as a car keeps driving very well. it’s knowing that is driving very well and will keep doing certain things. so the school of thought that says that therefore maybe the car should have a few accidents. just like maybe they should be a few nuclear explosions. let’s suppose that you go and get an hiv test done. hiv tests are routinely done. we’ll see you have surgery or anything like that. etcetera hiv tests are done. so let’s suppose that for whatever be the reason an hiv test gets done and the test turns out to be positive. i hope it never happens to but let’s suppose it just turns out to be positive the question is how scared should you? very that’s a reasonable answer. but let’s work it out. so to do that trying to calculate the probability of hiv given positive test this is what i’m interested in calculating because my life may depend on it. there are many ways to do this. here’s a suggested route. now what i’m going to do is i’m going to write this version of the formula. no. hm. without this early and you’ll see what what it means here. so what i’m going to do is i’m going to write this as probability of hiv and plus divided by probability of positive, correct conditional is join to / marginal. now i’m going to write the numerator as probability of positive given hiv multiplied by probability of hiv i’m going to twist it. here’s why? these are numbers. that are much more available to me. what is this number? this number means that if i have hiv what is the chance that the test will be positive that’s called the sensitivity of a test a test maker has to report that. this is the proportion of people who have hiv. this is the incidence rate. it has nothing to do with me. it’s like my dictionary is just the prep fraction of people who have hiv. so these are numbers that i know one from epidemiology and one from my test manufacture. / positive and i’m going to do something very interesting on the positive. i’m going to write this positive in two ways. there are two ways in which someone can become positive. hiv and positive plus not hiv. and positive okay this joint there are two disjoint ways in which i can end up being positive either. i have the disease or i do not have the disease. now they can write this as this i’d already calculated is the same number. probability of positive given hiv multiplied by probability hiv plus probability of positive given not hiv. multiplied by probability not hiv this is this formula just example doubt. we’re going to apply this and see what happens. let’s what are the numbers that i need? i need a number of probability of hiv probability of hiv is a incidence rate for hiv. what’s a good number for this point? zero one? okay. it’s a point one percent. that’s actually very low the hiv. it is a lot higher policy one person. one percent of people have hiv and 99% don’t what this also means is that probability of not hiv is 99% okay. i also need a few other things. i need for example this. probability of positive given hiv this is a measure of how good the test is if you have hiv what is the chance that it will report that you have hiv. what’s a good number for this? 99% 95% what is 85% mean that if you have hiv there’s a 95% chance that i will find it equivalently for a hundred people who have hiv 495 of them. i will find it. yes, which one i asked this is a this will come from the this is called a sensitivity number. it comes from the test. a very good test may have this at 99% 99.9% hmm are not very good test or a cheap test may have this at 90% i’m assuming that this test is 95% pick your own number. it’s sensitivity is 90% we have the number is sometimes called specificity. so for example, let’s say i go the other way positive of negative given not hiv, which means if it if you do not have hiv. what is the chance that it will say you do not have hiv? again, 95% again 95 in other words, and i have a fairly simple stressed, which is 95% accurate. whatever your disease state is 95% the time we will give you the answer. okay. now let me re ask the question. i’ve given you a test that is 95% accurate. i am now telling you that your test is positive. what is the chance that you are hiv positive 95% that’s a reasonable guess right. let’s work it out. negative not hiv is 95% so what is positive given not hiv? hi percent. great. okay. now i have everything that i need to calculate this. what is positive given hiv? .95. correct into what is for probability hiv. .01 with is given it as one person. downstairs again point nine five into point zero one plus what is this positive and not hiv? 0 5 x probability not hiv .99 could someone please work this out? on a calculator or on hints together they cover everything. yes, which means in a particular case you have hiv or you do not have a check. there are no other possibilities. the exclusive why because either you have hiv or you do not have hiv. but exhaust exhaustive events means that there are no other things. so this given hiv positive is 95. so yes because of what they had 95, you’re not calculating this which one the last one five percent this 5% this is i think one – this for not hiv negative was 95% then positive will be 5% what is this number? point i have to have high variance in my answers. anyone else 0.16 0.16. they’re sixteen percent chance you have hiv if you test positive, why is it that a fairly accurate test and 95% accurate test? my wife and i have a party company. we’re trying to release a product on molecular diagnosis for infectious diseases. if we get 95% we’d be thrilled. our investors would be thrilled. we’d be in business. this is not easy to attain particularly cheap. we try to keep the cost of our test fairly low for things like uti and stuff like that, but so where is the problem samples are false positives? 95% approach but there is a there is a there is a there is a there is a problem of false positives here. so another way of seeing exactly the same calculation a pretty much exactly the same calculation is the following thing. so i’m going to double bass serum here, which is exactly this i leave it to you to link this to be is cetera et cetera, but sometimes it’s easy to just understand it as an example. as how it’s done, but i’ll show it to you as i will now show you two as a picture i leave this here. and now let’s assume that i begin with a population of maybe a hundred thousand people. let’s suppose that i’ve got a hundred thousand people. who are being tested let’s say. now all of these hundred thousand people some of them have the disease. some of them do not have the disease samples. i got the total is hundred and this is my sample space. so to speak now, let’s say how many of them have hiv 1% 1% so that’s how many thousand so thousand of them are here. so these are hiv and how many are not hiv 99,000, correct? now all of these 1000 how many of them test positive 950 and how many test negative? 50 okay of this 99,000 how many test positive and how many test negative? so these guys should test negative. so what is 95% or what is 5% of ninety nine thousand five hundred of ninety nine thousand four nine five zero is five percent. so five percent is a wrong which means full full nine five zero are here. this is 5% of 19,000. and so how many are now here- 94,000 about that this number one matter much anyways, so you’re okay with the situation here. now, let’s look at all the people who tested positive. where all the people have tested positive? these guys have tested positive and these guys have tested positive. so how many people have tested positive in all? so 950 + 4 9 5 0 of them how many have the disease? 950 calculate this this is exactly the same calculation you did before arithmetically it is the same calculation. sir here for nine. five zero is the culprit. what does that mean? it means that they were a lot of people who had a false positive now, why were there a lot of people who had false positives because there are a lot of people who did not have the disease for that large number of people who did not have the disease only a few positives with swamp the positives of the people who had the disease which means most of the people who are testing positive and actually healthy people. who have had the misfortune of the test going wrong on them? but because there were so many of them. it affected the probability. no, but what is it for you? so what is the moral of the story now? so therefore what will happen? let’s say therefore. let’s say you go and let’s i’m pretty sure this hasn’t happened. but if somebody gets in positive hiv test, what will the doctor say? get ready test done. why? because let’s suppose this is my test. let’s suppose this is my test and let’s suppose now. i’ve changed the test to sing that i will say you have hiv only if you test positive twice in a row. you tested twice and both times you will end up you show a positive. now. what happens to these numbers? what is now the positive given hiv in what is now negative given hiv first of all what happens to what happens to this? what happens to you? what is the chance of a false positive now? so the chancellor false positive which was previously 5% now become yes now becomes you must it must go wrong twice. so point zero five into point zero 5 and then 1 minus that 5% of five percent 5% of five percent is what it’s a quarter of a percent or something like that or even less maybe that becomes now a very large number. so this number becomes much smaller the chance of a false positive becomes much lower and because the chance of a false positive becomes much low, this number becomes a lot lower. and now the number begins to approximate what you think it would but for this to work. i must be able to multiply the two probabilities that both tests went wrong. that multiplication comes from independence, which means the second case that you should do should be from a different laboratory, which would have its own biases in have its own problems, but they will be independent of the first guy and you can multiply this out and this problem will go away if it doesn’t multiply out. this is the same result happens. in other words, if the same thing shows up, then this profile fire will not go down. so this difficulty with this do this also again, for example, this shows up in many things even even this is so if i if i am trying to detect let’s say fraud. i’m going to take fraud and i fraud detection algorithm. and i now say if i see this signal what is the chance that it is fraud by this serum that will be low. the reason that will be low is because most transactions are not fraudulent transactions. and so even if there is a small possibility of detecting an on fraud transaction is a fraud transaction. i have messed up my algorithm. you have to do the test independently running the same program twice will not help you. huh? so in the biological example, you need to run it again. what it in a different test in a machine learning situation. what does that mean? it means you have to give it fresh data. different data from the same situation shall we say which is a little harder, but that’s fine. so this is based here. so that last world and sperm problem. how does it end of map to how does it map to this? okay. well it looks it looks completely different does it not? okay. we’ll do it this way. what is the proportion of spam and non-spam? let’s say this is pan. and not spam what is the purpose i need to know this? this is the proportion of things that have salmon not spam independent of. what is in the text? what’s the proportion of emails that are spam? what do you think my pleasure sir? 30% a stamp. okay, you guys know your inbox? it also points to a healthy social life, right? so now what now let’s suppose that we fix the problem and i’m going to solve the problem not for not for words, but for one word what’s a stem like word for example? bye. congratulation. congratulation right? congratulation. so now so now i want probability of congratulations given spam. what is probability of congratulation given spam if congratulation is there? then if it is spam, what is the chance of the word congratulation will be there? a hundred percent. let’s link this little too, huh? 75% let’s say let’s say this. let’s say this is 75% then. what else do i need because what is the problem? i’m trying to solve. huh, so i’m trying to solve the following problem trying to find probability of spam given. congratulation. this is what i want to find. i want to say that if i see the word congratulation. what is the chance that this email is spam? that is the problem. you want to solve now to solve that and solving the opposite problem i’m saying what is spam? what is not spam? what is congestion nation given spam and i need when i need one more probability. congratulations not spam. what is this? 25% not necessarily one – this is a separate calculation, but it could be 25% if you want to. let’s make it 35. huh, which means if it is a genuine email if it is not spam. this is 35% chance of the word congratulations will be there. now i don’t need to make this up as i said in a laboratory. i can look at all spam things and i can count how many times congratulations shows up in it. so now let’s suppose let’s suppose this is here. let’s suppose i know this. now, can you do the calculation you can do it using bayes rule. you can do it using the drag diagram if you want to just try. what is the answer? congratulation given a lot of times in 35% that is known as the phone numbers are known to you. well, actually, you know, these are the same number so three numbers are known to you. if it is fair, then the chance of congratulation is 75% if it is not spam the chance of congratulation is 35% now i want to find what is the probability of spam given that there is congratulation. now, how do i how do i all four unknown? this is mike. shall we say the information that’s available to me? some of you can try using the formula some of you can try using the picture. so if i do it using the formula? what will it look like? spam given. congratulations is equal to probability of congrats and going to support her. is this spam? x stamp / probability of congrats given spam multiplied by probability of spam plus probability of congrats given not spam. multiplied by probability of not spam this and this is what congrats given spam is. .75 into probability of spam is 0.3. divided by 0.75 into .3. + congrats given not spam. .35 x not spam is 0.7. point no z .47 or you may want to draw a picture like this like we are drawn before begin with an another typical number. let’s say a hundred thousand. you’ll do it as spam or not spam on the spam side. this is a hundred thousand emails on the spam side. how many will there be 30,000? on this side 70,000 on this side. how many will have congratulations this is on the stem side, 75% of them will have congratulations. so 75% of 30,000. that’s what 22500 or something like that. and the remaining will not have congratulations. how many here will have congratulations? for not spam 35% of 70,000. what is 35% of 70,000? huh 24500 and so what is my answer 22,500 / to a 2500 plus 24,500 which is presumably might 47% we could do this as well. without opening the email without opening the email and seeing the email the chance that it is spam is 30% but if the word congratulation is there in the email the chance that it is fame has gone up to 47% now you would not do this just for congratulations you do this for a whole bunch of words. which means that instead of congratulation read congratulation and something and something etc, which means that instead of congratulations here. it’ll be congratulation and something and something and something here which means for these probabilities you will need to say congratulation and something else. let’s say another word. what’s another word offer? so you now say what is the probability of spam given congratulation and offer now, you would need congratulation and offer but if you assume independence there this can be progress. visions given stan x offer given spam so word-by-word the probability can be calculated and can be put in this approach. you will see studying text mining one if your course is called the bag of words approach. the words are put into a bag irrespective of their order and things like that. yes. yes, yes. who is this for? yes, so each of these the e will then be a new event and the new event would be different words. and so that those different words will be thought of as the product of each word. so the chance that that the words congratulations and offer are there in the email? is the chance that congratulations there in the email x the chance that offer is there in the evening? that’s an assumption and as i mentioned that is built into the bag of words model. if you don’t like it, what you have to do is you have to give me the joint probability of offer and words and those motors are also there. they’re called bigram models. no. spam and non-spam are where stamina stamina system or in the bi these to spam and non-spam. yes. yes, we are reading it reading it to to hear they were there are k possibilities know the number of possibilities in this case. no, and they could be other possibilities here here the things and deciding between are just to spam or not spam in this formula. the number of things that i’m deciding between are many for example in your gmail. how many categories are there? the social this promotions and primary. so instead of these being stamp. i can define it as primary social and promotions. so now i need to find what is the probability of primary given congratulation? promotion given congratulation and social given congratulation. there are three of these now that can that now you can apply here. there’s b1 b2 and b3. so you all we’ve already seen an example of a distribution. i’ll simply tell you what it is the binomial distribution. what is the binomial distribution the binomial distribution is a distribution of simply counting. the number of things the number of defective products. hmm. the number of customers that receive services, etc. etc. exactly like the applications that we were talking about. this is the statement we have already seen. the probability of getting x successes out of n trials is p of x is equal to n choose x p to the power x 1 minus p to the power x where the individual p is the probability of getting success in one try? you remember my formula of point 1 to the power 2? switch that formula what is this formula say this formula says? that if p is the probability of success of a single trial then what is the probability? of getting x successes out of n trials n trials p is the success probability? each trial what is the probability of x successes? n choose x p to the power x 1 – t to the power n minus x. how do i think this through? what is a trial a trial is the total number of attempts that i’m making the total number of products that i’m making i’m making three products? the probability of each product being defective is point one what is the chance that i will get to defects? switch is to point 1 to the power 2 .9 to the power 1 p success is p into p into p n – x serious. what is not a success as a failure whose probability is 1 minus p and that x-rays of choosing that. original n so in this case, it’s like these trials are like with replacement for these trials are not just with replacement. yes, they will replacement. it’s not like it’s a it’s a population. so to speak in other words an actual. it is not being done. it’s imagined that someone is doing this experiment repeatedly. so yes, if you want to think of it as replacement is replacement. it’s a model. for example here a bank issues card statements to customers under the schema master card based on past data. the bank is found the 60% of all accounts pay on time following the bill. if a sample of seven accounts is selected at random from the current database construct the binomial probability distribution of account staying on time. what is the question being asked the question being asked is this that i am looking at seven accounts. and i’m trying to understand how many of those accounts. are paying up? how many of those accounts are paying up now? what values can it take? what what are the possible values that that my ex can take 0 1 2 3 4 5 6 and 7. six wins none pay on time. i’m sorry zero means non-parent time one means one pays on time. seven means all pay on time. the chance that every one of them individually pay on time is 60% and i’m going to make the assumption that these people aren’t talking to each other. so they’re behaving independently the 60% chance applies to everyone separately which means there is one person is field that has had no impact on whether another person has. paid or not? correct. let’s do one of these calculations. let’s say what is the probability that let’s say how many people two people pay on time so to pay on time. what is the answer to this? you can use this formula directly, but two people pay on time means point 6 into 2 .6 in 2.6. not into 2 to the power 2. .4 to the power 5 these are the five people who have not paid on time. these are the two people who have paid on time. so this point 6 into point 6 into this point six in two point six in two point forty two point forty two point four in two point four into point for the seven people now, that is one arrangement how many such arrangements are possible 7 choose to arrangements are possible those two could be the first to they could be the next to they could be the first and the last there are seven choose two of those for each of them is a pattern paid paid not paid not paid paid and every time you see a paid .6 every time you see a not do not paid point for the point six you’re going to see twice. and the point for you’re going to see five times their for this formula. 7 choose to is a formula which simply says how many ways can i pick two things out of seven the formula for it is seven factorial divided by 2 factorial into 5 factorial. which is 7 in this case 7 into 6. / 2 which is i think 21:21 the 21 ways to pick 2 t 2 out of 7 to 5 minutes because i asked for two. i can do it and the problem asks for all combinations. i’ve just solved it for. for one particular answer i need to do it for 0 1 2 3 4 all of them. we’ve had them if you like it, i’ll get the answer 1 because something must happen. know the number of trials is 7 the number of outcomes is 8 if i toss one coin i can see two things. so there are seven outcomes the seven people. so 0 1 2 3 4 5 6 7 that’s eight the eight possible outcomes. all right. so now there is a file here. it’s called i think binomial distribution example, you didn’t you report a few things for plotting and for the state functions, then i’m going to set up the problem. how am i going to set up the problem in this particular case just by specifying an n and specifying a pee. what is the n in this case n is the total number of trials? why is it 7 for me? because there are seven customers. correct? p is .6 where do we get this point six here? the 60% what am i doing here? what i’m doing here is i’m creating the sample space and creating the set of numbers for which i want to calculate the probability. so this one here? the range function 0 to 8. so when i do this, it creates an array of eight numbers zero two seven zero really has a value. of course we do there is a there is a reasonable probability that nobody pays on time. same place wherever you got the other one from how does form? this is x people have paid so this is p into so think of it as p into p into p x times. and think of one minus p into 1 minus p n – x x because x people have paid and what allows me to multiply the probabilities because if they’re sixty percent chance you pay then also 60% chance you pay when i think of the chance that both of you pay is going to be six point sixty two point six and if he doesn’t pay and i want to modify those two 2.62 point section 2.4. now how many pieces are there? how many successes i want? how many point force are there? how many non successes are there and how many such possibilities are there? how many ways can i get two successes? that is what i am calling 7 choose to which is 21. why is it 21 you are going to pick two people out of 7? how many ways can you pick them the first person you can pick in seven waves? the first person to present time the second you can pick in six ways. 72 6, but if i pick you first and you second the system speaking you first you second serve double counted. so by 2. so 7 into 6 by 2 which is my 21 this application this kind of application or another kind of application. for example, i can change this to say in sales. i am i am selling my or am i i am approaching seven leads. the chance of a conversion for a lead is 60% what is my sales distribution? okay, tell you that information. for example to figure out let’s say that how much budget should i have for the sales team? for example, i could say you know, what? i’m going to approach seven leads and i’m going to get sales. however. how about those sales going to be made? the sales are going to be the sales are going to be made on the phone. but to confirm the feel i need to be able to send a salesperson to the person’s house and get their signature. this person is going to take a certain amount of time to travel through the famous city of bangalore. and get stuck in the traffic jam and get there. so i will be able to get at most three signatures in a day. and if i use it, i lose it or throw signatures or so. let’s suppose that therefore i employ one person. is that good enough? so now i’m asking the question. what is the probability that i’ll end up making more than three sales in a day? because if i end up making more than three shells in a day and not be able to close all the sails. so this becomes a salesforce question. it becomes a question of saying that these semi-pro ability to sell i should have a sales team if myself scene is too short too small. there’s a probability that they will not be able to close out all my sales and i leave money on the table if my sales team is too big. i’ll be staying for that sales team, but they will not have enough to do. so yes, the binomial distribution is just and center. in in contact centers. yes is use the same same argument in context in this for example. one reason is used is how many escalations do you expect? so in many of these so how do i execute on this? so i’ve given the i’ve created the array now, here’s the command that you need to know. this command calculates that formula. that n choose k that formula that formula is calculated. right, by the way, you can manually do this if you want to once which is your 21 in 2.6 to the power 2 into 0.5 to the power 4. does anyone want to manually do it once? no one has any just to check. otherwise, we’ll just trust the output. that’s fine. but if i do this binomial stats dot binomial dot p mf p m– f stands for probability mass function in case you want to know what enough that means probability mass function. so this thing is called a probability, mass. function probability clear mass means is almost as if you’re thinking of emit a solid material and the probability has been physical mass. how much mass is an each number? how much mass is in each number? so this number? the pmf simply is this number? it’s a calculation of this number. so now if i asked for binomial if you do it without the equal to it’ll just give it directly. alright, so it run it just takes a bit of time. so binomial is an array. so what is this number here? for 0 so what is this in the business context? this is the chance that nobody pays on time. the number of people who pay on time is zero. so it’s about point one six percent number of if what is the chance that one person pays on time? 1.7 percent to people pay on time about 7.7% three people pay on time about 19% for people pay on time about 29% five people pay on time 26% six people 13% seven people about 2.7 percent. okay curiosity question. how many people would you expect to pay on time? no, remember there’s a 60% chance that everyone will pay. yes, 405. in fact, the answer is 7 in 2.6. which is above zero 4.2% so you’d expect to see about for a little more than four people pay on time. and the chance of four people paying on time is what is 0-1. 29% and the chance that five people pay on time is about 46 percent. if you want to plot this this is there is a slightly sort of, you know, jazzed-up version of a plot here. so the first line says plot it, you know, it says binomial then there’s a thai tea does a labels and then finally the plot command itself. i think that’s a plotting artifact. i mean it tells you what to plot you can remove it and see what happens. here’s an interesting thing. someone’s cost what happens when i add up all the probabilities. which is what i get here. i don’t need it. it’s a checksum. so one person one possibility of a business outcome is what is the probability that say more than six people do not pay their bills on time. no in the collection steam in a bank certainly is interested in that. will you have to go after that? there’s also a question of what is the entitlement or miles on my on a specific month? so bank is going to make money or csa telephone company. whoever is going to make money on the amount of bill that’s actually paid. now the fact that the bill has been given to a person doesn’t necessarily mean they will pay it like here. so how much money does the bank actually expect to make it has to have an estimate of its revenue per month. how does it get that by doing a calculation of this kind? here’s a real formula if it wants to help you. the average of a binomial distribution is given by n into p, which is discussed that total number of trials into the probability 7 in 2.6. which means for example that if i think that my success probability of a sale is 10% and i approached 10 people the number of people i expect number of sales i expect to make is 10 into point 1 which is 1 does it mean i will make oneself no. the distribution goes from 0 1 2 3 up to 10, but the average is that one similarly, the average of this distribution is where it’s at 4.2. but where is the picture? where is the average weight is 4 point to somewhere here. somewhere here is 4 point 2. this is the center of gravity of the of the distribution. the standard is a standard deviation formula. if you want to know n p into 1 minus p the standard deviation. we will make a little more sense when we talk about the normal distribution. i hope i’ll get there. now there’s another distribution which is used a little less in practice. you guys are all very practical types. how is it is how is it used the question kind of question. he asked so i want to make an estimate for example as to how many people will pay my bills. because based on that eyewitness. i can do it two ways. i can for example say what is the number of people expect to pay my bills? 4.2? what is the number of number of sales? i expect to make what is the number of errors? i’d expect to have in my code. what is the number of defective products? what is the number of expected customer recalls that i have whichever industry you’re in there are events that happen in that industry any trying to find out an estimate for it one estimate for it is an expectation like we discussed yesterday. but remember this one is not coming from data. seven in 2.6 is not a calculation based on data. i didn’t give you any data. on people paying their bills on time. i give you three digital distribution. this is an assumption that i made. it’s not an average computed on data. so therefore when i make the distribution assumption and is a beast and the distribution, what is the expected number? i should see will i see that all the time? no, that’s why there’s a distribution. so there was a yes yes reality. no, so this would be used and it is often used where what will come from the what will come from the data. one thing that can come from the data is the p the p just happy not the distribution itself. yes, and so that will not so for example, i want to find that next month next month for a new customer or next month. how many people will pay their bills on time? that’s use a case. now. here’s the way i do it i ask myself last month. how many people paid their bills on time? but it comes but may come from the data. so the p comes from the data, but the calculation for saying how many people will pay their bills on time comes from the next month. it is done for the next month. it makes no sense to do it for this month because i already have this month situation the probability that we had at write probability of one person saying property of 2% yes, i exact array. yes. yes. yes. yes. of data it already has because the p has come from the past data. yeah. yes that normally in a real situation. yes that probability has to be computed in a lab based on past data rate. yes. let me clarify that yes, it came to quickly. so the complexities one is you might be supposing that it changes with time. you might be it might be a situation that does this that you know what i have to i have a collections problem means not enough people are paying so i might have a problem that looks like this that might be the number of people who pay their bills on time is 60% and i’m saying it’s too low now. we want to increase that how to increase that i was my manager comes in says make it so such that. the number of people say let’s say more than five people not paying on time. this number must be less than let’s say point one percent. that’s the goal now to do that. i now need to change my p. so i’ll set my piece so that the answer to this question becomes less than 0.1% that gives me a target p now. i must reset my collection process so that that p is attained. to achieve that p so i can do that. i can create applications in various ways. give me the p and i will tell you what happens or give me a situation that i want to achieve. and give me a target p such that it gets their constant. the variables keeps is yes, the constant the variable keeps changing. what do i want to fix keeps changing so that the pack this is a model. this is a mathematical model how you use it is up to you. this is one particular use case, but there’ll be many use cases for this you see one in logistic regression. for example, the poisson distribution is a very similar distribution except that for the poisson distribution that has a mass function that looks like this. now this mass function. counts, but does not count relative to a maximum. the binomial goes from 0 to n 0 1 2 up to n the posture. there is no n there is no total number of things. for example, i might ask the question how many fraud cases do i expect to see? there’s no sort of maximum to that. i could frame it as saying that tell me the total number of cases there are and that is my n and then i’ll figure out based on a po mail fraud cases there are but there are situations where this maximum is something that doesn’t quite make sense. how many fraud cases are there? how many cracks are there microfractures are there on this bottle? it’s a count right? how many eggs will the chicken make? it’s a count is not in some way a proportion like thing. so if it’s if you’re in a pure count like situation you are in the situation of the so-called poisson distribution whose mass function has this cycle different form called e to the power minus lambda lambda to the power x where lambda is the average if on average six customers arrive every two minutes set up bank during busy working hours. what is the probability that exactly for customers arrive in a given minute? what is the probability? it is that more than three customers will arrive in a given minute. this is slightly different from a binomial. why? the reason is in the previous case they were asking for how many customers did not pay but there was a total universe of customers 7 customers. there was a samples sample space here. there isn’t i’m not telling you how many could have come there is a series and the series could go up to anything. so to speak this is the typical situation of a poisson distribution where it’s not a question of saying independent trials, and how many were successes? it is a time simply counting how many there are and i have no ideas to how many there could have been potentially how many fraud cases i do not know how many micro fractures i do not know how many customers could have arrived. i do not know. there is no maximum to it. so the similar calculation here. for the same thing. if you open the poisson distribution example file. now for the poisson distribution that formula for the binomial there were two numbers you need you to put in the n and the p for the post. so there is only one number. there is only one number and that number is usually called the rate the rate at which my customers are deriving the rate at which i get from the rate or the density of my cracks. it’s a writ number. you can think of this rate number as a product of n and p. as as the total number of opportunities x the product if you want to think of it as that so for the posture i need to be able to specify the rate. and now i do exactly the same thing again calculate the poisson probability stats dot plus r dot. pmf. now for computational purposes. i am setting the range from 0 to 20. i can set it to me any high number that 20 is not coming from my data the 20 is coming for a computational reason because i want to do the calculation for a finite number of points. and as you see after 20, the numbers are very very small. so the 20 is not there from the problem the 20s there for my visualization. i can make it any making any other number if you make it too low you’ll be leaving some probability to the more than 20 you make it too high. you’ll be calculating a lot of zeros. so what is my problem? let’s go here by problem is what is the probability that exactly for customers arrive in a given minute six customers arrive every two minutes at a bank. what is the probability that exactly for customers arrive in a given minute, whatever i put my rate as six and here is my distribution. this is what 2 point 4 into 10 to the power minus. 3 so this is what point zero zero two. let’s see what happens. so what is the probability of zero point zero zero two. what is it? 4-1 .001 for two for 3.00 8 4 4 what will know what is it for? what is it forces this zero. one two three four. what is it for 4.13 13% what is it four five? 16% was it what is it for six? 16% what is the average number of customers expect to see six? 16% what is this? what is this? seven thirteen percent for 810 percent now to start going down. i’ll go down and by the time i reach 20, it is already point 0 0 0 0 1. so if you have gone beyond 20, i would have seen even smaller numbers, but i could have stopped for example, let’s say 15. if i stopped at 15 we have would this have stopped one, two, three, four five you would have stopped here. which is fine. xx xx is an approximation 20 is 20 is a guess here is a distribution plot the same thing. this is the plot of the distribution function whose average is at six. by the way, what is the answer to the question? what is the probability that exactly four customers arrive in a given minute? be slightly careful be slightly careful six customers arrive every two minutes. the question asks for exactly four customers arriving in one minute. which means? if i eas if i’m putting six as the rate. they have to convert this question to saying what is the probability that exactly how many customers have every two minutes? each customers arrive every two minutes or what i can do is i can change my rate to 3. this one is a distribution where you do most of the calculations with this is the normal distribution the distribution that corresponds to age. to the means fed, all the continuous variables that we were looking at numbers. numbers. so if you’re dealing with numbers, then you deal with the distribution that has a shape like that. this is called a normal distribution. now the normal distribution. the reason i wanted to get to is this because because of this picture now this picture puts the standard deviation in context. so yesterday we talked about the standard deviation and a question often asked is what is the standard deviation mean? what is standard about the standard deviation this picture tells you what is candid about the standard. so this picture means that if i have a normal distribution, then the chance of being within one standard deviation is 68 percent. as a numerical quantity this distribution is a distribution that has a mean. and it has a standard deviation. now the standard deviation is to be defined in such a way. and the way the standard deviation is defined implies that the chance of being within one standard deviation is 68% the chance of being within two standard deviations is 95% the chance of being within three standard deviations is 99.3% so now if i tell you something like this that i’m telling you that for a group of people the mean height is say 5 feet 10 inches. with the standard deviation of 2 inches mean height is five point two inches and a standard deviation of 2 inches so mu so let’s say five feet eight inches and a standard deviation of sometimes you noted by sigma of set to inches. i’ve told you some interesting things if you allow me a normal distribution. i am now told you that sixty percent or roughly two-thirds of the people. are between 5 feet 6 inches and 6 feet. ten inches this is 5/8. this is one standard deviation, which is two and two. so this is 510. and this is five six and this is about 68 percent. sometimes it’s easy to remember it has two thirds close enough. two out of three are between these two heights 95% are between what and what? six and five four 95% are between these two heights. one in 20 are outside this range. so therefore if i tell you the mean and the standard deviation have actually told you a reasonable amount as to how the data is spread. so sometimes the mean and the standard deviation are are reverse-engineered so to speak. so if you are professionals and i often do this, so people say people often as well as the data. they said nobody has any data so assessor so, you know, you’re trying to figure out what work to that so so so so you might ask a question. when do you typically arrive and someone says oh nine o’clock thereabouts. what’s your earliest arrival time? 8:30? what is your latest? ten o’clock so looking at this you’ll now see a so you can decide as to what you should assume that the whole range of the distribution is say from say 8:00 8:30 to 10:00 o’clock. and now this pattern tells you that if i go for three sigma covering 99.7% this whole range is about six standard deviations. so to achieve if you could find the mean you just take the middle of it and to find the standard deviation, you take the whole range and divided by 6. so i can get an idea of what the average is and what the standard deviation is without even getting any data from you, but just getting a sense of the extremes. it’s a nonsense way of doing things but what it does is it allows you to cheat with essentially very minimal information. so remember this remember these pictures are helpful. they give you an idea of what the distribution is a by the way, these numbers are easy enough to calculate so we’ll do some calculations. the the normal distribution is a bell-shaped distribution. so it’s symmetrical the tears could be extended. it depends on two parameters mu and sigma see the power of it by giving your two numbers have given you characteristics like this. so and i can do calculations and this is the density function that equation if you want to think of it, nobody does anything with this but and then you can do calculations on it. so here’s a curious the calculation. i’m not sure this is a calculation that we had worked on. this is a calculation that we actually do in in some detail. let’s do it. so the mean weight of normal of a morning breakfast cereal pack is 295 kilograms with a standard deviation of point two five kilograms random a random variable way to the that follows a normal distribution. what is the probability that the pack weighs less than 280 grams now, why would someone be interested in this? one possibility perhaps is that may be the target for the for the pact is something like 300 grams. and you’re trying to understand whether you are. within tolerances or more or less or something of that sort. so what is the probability that the pack weighs less than 280? so what do i need to do? what is my picture like? my average is 295 standard deviation of 25 on the gram scale and i want to find the chance of being to the left of 280 i need this area. calculating this area is actually quite easy. so let me calculate that area. so i’m going to do it this way stats. . norm. c. bf c d e f stands for cumulative distribution function i tell you what’s cumulative about it cdf now. what is the number that i’m interested in probability of being less than 2 8 0 or if i want to be very clear about this point sorry point to it and i’m gonna do something here comma location equal to location means the middle of the distribution for me. what is the mean? point two nine five coma skill is equal to what is the standard deviation? is that are the numbers correct? twenty-seven percent this one here is 0.28 your sanctuary .27. no and calculating the answer to this question. what is the probability that the pack weighs less than 280 grams? this is the question. also this the way i set it up was to say what is the chance of being less than 280? when the mean is 295 and the standard deviation is .25. because of certain technical aspects of other functions, the mean here is referred to as location and the standard deviation is referred to as scale. so if those terms location and scale confuse you just ignore it huh? this first term is the number, but otherwise this this this one here. this one here makes more sense. go ahead with this. all right. do you understand how the code works? on it. let’s do the second problem. what is the probability that the pack weighs more than 350 grams? what do you think the answer should be? yes. one – yes, one – what one – stats dot norm now. what should i do? sorry norm dot cdf. .350, same thing. about 1.3 99% the chance of being more than 380. clear, so what does cdm do cdf cumulative? distribution function what does it do calculate the area to the left top less than therefore if i want to calculate the area? mm probably demos and i need to go 1 minus y because the whole probability is 1 what was the third one? what is the probability that the pack weighs between 260 grams and 340 grams how to do this? yes 340 so i now need to be between 340. and what is it, too? 260 so less than 340 – less than so. it should be again. let’s say let’s get lazy. what is this number? 340 and this is 260, right? eighty-eight percent 88% of my packets are going to lie between 260 grams and chair and 40 grams. it’s a resumption. we’re making remember there isn’t any data at all here. there isn’t any data told here. what numbers am i using? mean and standard deviation so what i’m doing is what is the advantage that i have? i don’t need the data. all i need is this mean and standard deviation? what is the price i pay? and as i’m sure on the distribution no, so i could instead of using norm have another distribution sitting there. there’s a whole range of other other possibilities by noam is one there. there are other distributions if you want to you would you decide based on whichever distribution makes most sense for your application. now in certain cases, you know, what do this nature of those distributions look like for example, if you looking at lifetimes of things it’s an exponential distribution gamma distribution or something of that sort, but there’s a certain advantage to the normal distribution because of something called the central limit theorem and we’ll cover that a little bit. it will be mentioned within the in next residency central limit theorem essentially says that if i take the averages of things or the totals of things i end up with a normal distribution. the normal distribution is a result of averaging. so if my observation is a total of little things. then probably the normality assumption is a good assumption for that. large data doesn’t necessarily mean normal. but if you observation is the total or the accumulation lots of things are for example height is often normal why because our height is a car is in some way a random combination of many things maybe the height of each of our cells and things of that sort. so the normal distribution is often used as an assumption based on the central limit theorem. the other part of it is that even if the data doesn’t look like a normal distribution the the sort of addition for it the sample from a normal distribution doesn’t necessarily look like a sample from a normal distribution. so even like we saw yesterday the bell-shaped curve, so it’s hard to look at the data and see that it is not normal. so the normal distribution percent to the that is often made in the absence of any other information on the data. it is obviously wrong in cases where the data has a very strong skew in one sense to another but remember in many cases. you’re not even talking about the data. the question that you’re asking is not a data question. the question that you’re asking is a probability question is a situational question you’re asking for effectively the following thing. why would some why is this analysis of this kind done? what data is it going after if anything? you’re talking about the data being normal or not. normal. what data is it even referring to? why do i care about the first question? what is the property that a pack weighs less than 280 grams one context for it could be that if a person buys a pack, what is the chance that they’re getting a light pack in other words something that is less than 280 grams true. but a my question is this we’re in all of that is a data. where is the data in this? how do you even think of there? is it a data problem at all? i’m asking the question. that is my product in stick in other words. what data are you referring to? what is this a data science issue at all? or is it not we are asking the question addresses it normally is it not is it a data question? you reach the customer. yes and what weight of the packaging science so what data is that? data, what? datum is to use how many data observations which data observations for whom for which customer when one data? huh? so kilos quality check for what i could argue. for example that this is about saying that if he goes in and buys that breakfast cereal we get something that is below 280 grams. well the value of the price, yes, but where is the data? in the supermarket there is no mike. it’s a business question. what data does it apply to what i’m trying to say is that is not a data problem at all. you can solve it using you can say i’m going to gather a lot of data to solve the problem. no, i’m telling you that this could come from the past mean this could so you could say that i’m going to i’m going to gather the data to get this number and get this number. that’s a good answer then order to solve my business problem. i need a mean and a standard deviation so that i can get a handle of what is the chance that you will be underweight. now that means standard deviation has to come from somewhere and i can say i will use data to get that mean and the standard deviation. that’s a good answer. that you will now say why do i need data in order to calculate mean and standard deviation? why do i mean nina standard deviation? because that’s still that’s the least data. i need in order to be able to answer this question. which is the question i’m interested in answering. willie by the product will my network go down will i be under product? there’s a business question. i’m interested on single. there is a tech question that i’m interested in answering. and often that is made independently of the data. so for example, the car has to stop autonomous vehicles. write the data that the car is going to react to is the scene that the car sees in front of it, but that’s what the data on which the algorithm is going to be based. so the so the detail but the car sees is what it is reacting to similarly. this is reacting to only one number 280 grams. i am now solving the 280 grams problem by saying. is it this i’m giving you a packet and i’m asking the question. is this underweight. does this have less water than it should. i’m interested only in that. i’m not interested in any data. so in hypothesis, testing, what we will do when we come back is to be able to close out that question and say therefore from data. how do we get to numbers like this? which now means that i have to put the two pieces of this residency together. have to put together the idea of calculating means and standard deviations from data. the idea that it is a parameter being estimated to solve a problem. so you would say that that data this 295 comes from data that immediately raises an issue, but if it comes from data, it comes from a sample and if it comes to a sample it’s not accurate and if it comes if it’s not accurate then how well does it solve my problem? and life keeps going in circles like that. so this is the probability side to it, which explains why i need to have means and standard deviations in order to do a calculation and the descriptive part says i have the means and the standard deviations to do the calculations so that if it had normal distribution, then i am more relaxed no know if it had a normal distribution then maybe i’d be able to get good numbers around this. plus minus is would be symmetric. this calculation doesn’t rely on the normality behind the to phone at 295 estimate this calculation lies on the normality of the future data, which doesn’t exist at all. but what i’m asking is will these numbers be more reliable the mu and the sigma if i had a normal distribution, you know, not necessarily not necessarily if i have normal distributions. i’ll be able to use certain very specific formulas that will see if it is not normal those formulas may break down a little bit. so those formulas help me calculate. so normality helps me calculate it helps me calculate. how good these numbers are? it also helps me calculate using normality what the answers to questions such as these are but the normality that i use now ten minutes ago had nothing to do with data. and that to some extent is the power of probability that you’re being able to answer a question like saying do i expect that? the weight is going to be less than 280 grams. without having data in place for it. the simple answer would be give me the data and count. how many are less than 280 grams? that’s the simplest answer. right? what is the chance of the pack listen to and 50 grams empirical go collect a hundred packets and find out how many of them have weight less than 280 grams. that’s the answer to that question. so why are we doing all of this because you don’t have that? because you don’t have that data. why don’t you have that data? because that’s not the question i’m asking i’m asking the question. is it listen to it? again? i’m looking at a computer program in front and i’m asking what is the chance that there are more than five bugs in this code? i’m looking at all the computers in my office and i’m asking what is the chance that all the employees today? there’s going to be more than two hacks. or malicious attempts on my server there is no data yet. there will be but by the time the heck happened. but i still need those embers and i get those numbers using these distributions to operationalize those distributions. i need certain numbers and i can get them. i can beg them. i can borrow them. i can steal the mechanist omit them from data. i can ask her for a friend. i can read a book. see a standard i can look at market research. yes. i can do them any number of things, you know to get at those numbers. i can look at an industry standard those two pieces will put together should be is getting a little nervous this this picture is definitional for the normal distribution. this picture this is definition for the normal distribution. so if you look at six sigma 6 sigma will cover 99.7 percent co2 per thousand will allow lie outside the plus minus 3 sigma range not everything but roughly 300 thousand. this is t sigma 6 sigma 3 sigma usually says 3.4 defects per million opportunities. which is actually not statistic 3 sigma is not 60 minus 4.5 sigma. so 4.5. sigma is about 3 point 4 into 10 to the power minus 6. that’s 4.5 sigma. so if you look at six sigma literature, there’s a confusion there. what it says, is that if you have in order to get 3.4 defects per million to the customer. you have to be within six sigma which is about one in a billion. this is at this is at three standard deviations personal assistant deviations. if i go to plus minus 4.5 standard deviations, i’ll be around 3.4 in 10 to the power minus 6 to reach that for the customer. i need to go to six sigma here, which is about one in a billion. i must be more accurate in my factory floor for my customer. so if i reach six sigma, my customer will reach 4.5 sigma and for per customer 4.5 sigma is the 3 point 4 into 10 to the power minus 6. so if you look at key point 4 into the power minus 6, it doesn’t correspond to six sigma. you will confusing but that’s the basics in my literature is written. the normal distribution is just this as a formula plus plus 1 plus or minus 2 sigma is 95% actually actually plus or minus. 1.96 sigma is 95% and t sigma is about 19 percent. infinity by definition goes to infinity. you want to cover everything plus minus infinite standard deviation?

93 thoughts on “Statistics for Data Science | Probability and Statistics | Statistics Tutorial | Ph.D. (Stanford)

  • December 20, 2019 at 7:28 am
    Permalink

    hard working

    Reply
  • December 20, 2019 at 7:29 am
    Permalink

    hard working

    Reply
  • December 20, 2019 at 10:38 am
    Permalink

    What is the difference between interquartile range and range?

    Reply
  • December 20, 2019 at 11:33 am
    Permalink

    Is chi square distribution also a type of probability distribution??

    Reply
  • December 20, 2019 at 12:57 pm
    Permalink

    Can u also do a video on hypothesis testing??

    Reply
  • December 21, 2019 at 9:17 am
    Permalink

    Which statistics book did he mention?

    Reply
  • December 21, 2019 at 8:24 pm
    Permalink

    Thanks. Really it is a very interesting course

    Reply
  • December 22, 2019 at 1:00 am
    Permalink

    Wow! Thank you for posting this statistics marathon session by Dr.Abhinanda Sarkar. Exactly what I needed!

    Reply
  • December 23, 2019 at 4:18 pm
    Permalink

    Thanks a lot

    Reply
  • December 25, 2019 at 10:26 am
    Permalink

    Can we have data used ?

    Reply
  • December 25, 2019 at 12:29 pm
    Permalink

    This deserve more and more views

    Reply
  • December 26, 2019 at 4:54 am
    Permalink

    Can you please paste the link for the jupyter notebooks

    Reply
  • December 27, 2019 at 12:48 am
    Permalink

    Thanks!

    Reply
  • December 27, 2019 at 3:53 am
    Permalink

    Post-Graduation in Business Analytics and Business Intelligence is this accepted by WES ??

    Reply
  • December 27, 2019 at 6:19 am
    Permalink

    Business Statistics : A First Course Seventh Edition,by Kathryn A Szabat David M. Levine, P. K. Viswanathan, David Stephan , Publisher : Pearson

    Reply
  • December 27, 2019 at 4:07 pm
    Permalink

    Excellent explanations.

    Reply
  • December 27, 2019 at 5:12 pm
    Permalink

    How long is this video available on YouTube?

    Reply
  • December 27, 2019 at 5:45 pm
    Permalink

    This really is quality teaching from an extremely knowledgeable professor! Thank you very much for this video.

    Reply
  • December 27, 2019 at 5:48 pm
    Permalink

    2:22:39

    Reply
  • December 27, 2019 at 6:36 pm
    Permalink

    I am sorry but this course when I just show it for the first time I directly realize that this teacher is not an Indian because I never see someone who speaks English like him in my school. he is really amazing teacher you must watch it

    Reply
  • December 27, 2019 at 10:21 pm
    Permalink

    its good

    Reply
  • December 27, 2019 at 11:33 pm
    Permalink

    Just wondering if you could share the Jupyter notebooks and related data please!

    Reply
  • December 28, 2019 at 3:09 pm
    Permalink

    Wow, what a great course with an awesome professor. He goes way beyond teaching the mechanics of statistics but makes distinctions and asks thought provoking questions that really help you crystallize in your head what you are trying to do.

    Reply
  • December 28, 2019 at 5:00 pm
    Permalink

    Can you do mathematics for machine learning also , please?

    Reply
  • December 30, 2019 at 2:12 am
    Permalink

    MINE OF INFORMATION

    THANKS FOR THE QUALITY VIDEO

    Reply
  • December 30, 2019 at 8:36 am
    Permalink

    Excellent, clean , straight and simple.

    Reply
  • December 30, 2019 at 10:31 am
    Permalink

    Plz upload statistics with python

    Reply
  • December 30, 2019 at 9:15 pm
    Permalink

    Nice video. Better than my teachers in Italy

    Reply
  • December 30, 2019 at 10:51 pm
    Permalink

    Absolutely a delight

    Reply
  • December 31, 2019 at 5:03 am
    Permalink

    The course starts with an incorrect assumption. The statement "here's the data, tell me what that data is telling" assumes that coherent data exists. In reality, it doesn't.

    Reply
  • January 2, 2020 at 5:15 am
    Permalink

    Thanks a lot Sir! This has been very helpfull!

    Reply
  • January 2, 2020 at 4:04 pm
    Permalink

    I saw first time a person teaching at this age in such a good base of voice really nice work

    Reply
  • January 2, 2020 at 4:58 pm
    Permalink

    Nice

    Reply
  • January 2, 2020 at 6:04 pm
    Permalink

    Thanks for such a wonderful lecture.
    When we will get the next part of it for indepth knowledge?
    And is there any other video to watch for Data Science after this video?

    Reply
  • January 3, 2020 at 7:09 am
    Permalink

    Great lecture..made is so simple
    Can we get this notebook and data

    Reply
  • January 3, 2020 at 7:44 am
    Permalink

    9:05

    Reply
  • January 3, 2020 at 11:14 am
    Permalink

    The camera is focused on the screen too much at times..it's kinda bad when the Professor points smth out on his board..
    Other than that, thanks for uploading!

    Reply
  • January 3, 2020 at 11:53 pm
    Permalink

    Wow, I am really happy to see the Great Learning to come on YouTube and sharing their content. Quality is not questionable at all, best content in the market, I wonder how amazingly these students have been prepared.

    Reply
  • January 4, 2020 at 4:58 pm
    Permalink

    Thanks

    Reply
  • January 5, 2020 at 2:39 am
    Permalink

    very basic but got few valuable insights about business questions and what to ask onself and data. 🙂

    Reply
  • January 5, 2020 at 6:19 am
    Permalink

    Absolutely the best video on statistics that I have found on YouTube – and I've viewed several such videos. I like that the professor explains a topic from many angles, has charisma, patience with class questions and includes many real-world examples. It's the perfect blend of theory and applied statistics.

    Reply
  • January 5, 2020 at 2:07 pm
    Permalink

    Thank you so much sir 😀🤘🙏🙏 please do subscribe

    Reply
  • January 6, 2020 at 8:30 am
    Permalink

    what language is this?

    Reply
  • January 6, 2020 at 4:39 pm
    Permalink

    he is my professor in college and people are sleeping on last benches,don't respect the knowledge…lol

    Reply
  • January 8, 2020 at 5:30 pm
    Permalink

    Thankyou Sir. You can say Stats for short 🙂 or pronounce as STA TI STICS the T's are important.

    Reply
  • January 8, 2020 at 6:42 pm
    Permalink

    Thanks. I liked his explanations and jokes. Helped me to revise my statistics knowledge. It is really needed to proceed my other ML classes.

    Reply
  • January 9, 2020 at 10:25 am
    Permalink

    median is more robust to the outlier, therefore, we take most probably median
    data rightward because mean is greater than median and mode more of the data found on the left side, therefore mean, median and mode is on left side

    Reply
  • January 10, 2020 at 3:49 pm
    Permalink

    Magnifique qu'il soit in Ours spirituel nos spsychismes devenu une science exacte de la description des différents sociaux en situation de stress

    Reply
  • January 11, 2020 at 10:02 am
    Permalink

    Excellent! Prof. Sarkar is a great teacher. Looking forward for more videos in near future.

    Reply
  • January 11, 2020 at 10:57 am
    Permalink

    I can't understand the answer of 'what is probability that 4 customers arrived in 1 minute' in Poisson distribution case,can someone make it?

    Reply
  • January 11, 2020 at 3:22 pm
    Permalink

    Next video in the full course series: https://www.youtube.com/watch?v=FPM6it4v8MY&t=15s

    Reply
  • January 11, 2020 at 8:22 pm
    Permalink

    Thanks Dr.

    Reply
  • January 12, 2020 at 1:38 pm
    Permalink

    This is just amazing content

    Reply
  • January 12, 2020 at 5:09 pm
    Permalink

    It's just status sticks

    Reply
  • January 14, 2020 at 6:54 am
    Permalink

    Guys, what else do you want to learn from us? Please do comment below

    Reply
  • January 14, 2020 at 9:36 am
    Permalink

    This man is amazing!

    Reply
  • January 14, 2020 at 12:15 pm
    Permalink

    For how long this video be available on youtube?

    Reply
  • January 16, 2020 at 2:21 pm
    Permalink

    Which book are they using? Business Statistics by?

    Reply
  • January 16, 2020 at 3:41 pm
    Permalink

    Thank you so much for uploading… great teaching

    Reply
  • January 16, 2020 at 4:02 pm
    Permalink

    I am glad to see Great learning sharing the tutorial in you tube from the great mentors. Interested to see more videos like this from GL. Good.

    Reply
  • January 17, 2020 at 6:08 am
    Permalink

    The linial model fractional coefficient log formula is incorrect on the binary log he is using. Adjustment would be required on the X2 unless full squared modules are minimized to the DIFx4.
    Please forgive me .
    Thsnk you

    Reply
  • January 17, 2020 at 11:08 am
    Permalink

    sir i am beginner to data science..though i am msc in eco hons. sir is it possible to be complete data science course upto the beginner level(with knowledge in python),,because we are in finnancial shortage ,,so i am looking to finnishing it in 1 month and want to apply for as a freshers if i study day and night for it.. for 1 month will it be sufficient ??

    Reply
  • January 17, 2020 at 2:45 pm
    Permalink

    I wish I could have had a statistics teacher like this.

    Reply
  • January 17, 2020 at 3:58 pm
    Permalink

    Just found this video. Reall awesome and great. You guys have online course offered for data science?

    Reply
  • January 18, 2020 at 2:49 am
    Permalink

    Wow

    Reply
  • January 18, 2020 at 8:58 pm
    Permalink

    Who is the author of this Business Statistics mentioned in this video?

    Reply
  • January 19, 2020 at 1:18 am
    Permalink

    When someone is intelligent the things are predictable. That's why humans buy cars computers etc.

    Reply
  • January 21, 2020 at 4:52 am
    Permalink

    I want to join Great Learning Course but the program fee is quite high , is there any way to get any scholarship.

    Reply
  • January 22, 2020 at 6:52 am
    Permalink

    nice sir.I learned new this from you.Thanks alot.

    Reply
  • January 22, 2020 at 8:33 am
    Permalink

    Like the videos

    Reply
  • January 22, 2020 at 8:39 am
    Permalink

    Kya concept Pele h boss

    Reply
  • January 22, 2020 at 3:41 pm
    Permalink

    Please do not delete this video plzzz….or atleast tell me when you'll remove it.

    Reply
  • January 22, 2020 at 4:00 pm
    Permalink

    great lecture. @1:18:00 which book the teacher referring to where half the book is about coding histogram?

    Reply
  • January 23, 2020 at 1:32 am
    Permalink

    for how much time the course is available ?

    Reply
  • January 23, 2020 at 11:37 am
    Permalink

    Its interesting but how are they able to concentrate for 7 hr straight.

    Reply
  • January 23, 2020 at 8:59 pm
    Permalink

    We are required to master the Advanced Statistics and this set of Videos will be indispensable.

    Reply
  • January 24, 2020 at 1:18 pm
    Permalink

    It will be a great help if you upload the next part of it✌️😔✌️

    Reply
  • January 24, 2020 at 2:47 pm
    Permalink

    Which Book sir has referred in the video (57.15), please do share the name of the book and author

    Reply
  • January 26, 2020 at 6:23 am
    Permalink

    Whoever is here, I need a big big help.

    I was a very bad student in math. Literally bad. Now I’m working for Amazon but in non technical team with 10 years of experience. Now I want to learn data science.

    How do I start learning statistics? What’s the basics I should learn? I’m not even aware of few basic terminologies used in statistics.

    Please help guys 🙏🙏

    Reply
  • January 27, 2020 at 7:33 am
    Permalink

    Great lecture. Can you please upload next level of statistics video? Also, it would be helpful if you could upload basic and advanced mathematics video which are required for Data Science, Machine Learning and Deep Learning. Looking forward for next set of videos 🙂

    Reply
  • January 27, 2020 at 10:22 am
    Permalink

    THE BEST till now ….

    Reply
  • January 29, 2020 at 5:31 am
    Permalink

    Where could I get the sample files from the video lecture?

    Reply
  • January 29, 2020 at 12:59 pm
    Permalink

    It was indeed a Great Learning experience. Many thanks to Dr. Sarkar for such a wonderful course. Many thanks to greatlearning team for reaching him and arranging the course.

    Reply
  • January 30, 2020 at 9:47 am
    Permalink

    This deserves more and more views

    Reply
  • January 30, 2020 at 9:57 am
    Permalink

    wow very good

    Reply
  • January 31, 2020 at 4:36 pm
    Permalink

    Great lecture

    Reply
  • February 1, 2020 at 10:42 am
    Permalink

    Thanks for sharing wisdom. ..!!

    Reply
  • February 1, 2020 at 12:27 pm
    Permalink

    These are the topics covered in this full course:

    1. Statistics vs Machine Learning – 2:22

    2. Types of Statistics [Descriptive, Prescriptive and Predictive] – 9:05

    3. Types of Data – 1:50:45

    4. Correlation – 2:46:02

    5. Covariance – 2:52:33

    6. Introduction to Probability – 4:26:55

    7. Conditional Probability with Baye’s Theorem – 5:24:00

    8. Binomial Distribution – 6:17:01

    9. Poisson Distribution – 6:36:02

    Reply
  • February 2, 2020 at 2:58 am
    Permalink

    Doing course is OK after completing course getting job is very tough without experience we r not getting calls from companys

    Reply
  • February 2, 2020 at 11:49 am
    Permalink

    Great Content . Cant say how thankful i/we can be 🙂

    Reply
  • February 3, 2020 at 10:49 am
    Permalink

    We have also uploaded a full course video on Haddop: https://www.youtube.com/watch?v=rxVg3XQhnpI&t=200s

    Reply
  • February 3, 2020 at 6:00 pm
    Permalink

    Thank you so much sir for this vedio, providing us this much knowledge and salute to your energy truly 🙂🙂

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *