Skip to content

Instantly share code, notes, and snippets.

@ymoslem
Last active February 11, 2022 16:35
Show Gist options
  • Save ymoslem/d4d9db34e2519d078a865faf27e0c10d to your computer and use it in GitHub Desktop.
Save ymoslem/d4d9db34e2519d078a865faf27e0c10d to your computer and use it in GitHub Desktop.
CourseraParallelCorpus Test Dataset (Human-validated) after removing duplicates - English
For example, if the cost of the ad campaign were really low, and it didn't really matter how much you made as long as you made something.
So the primary model may not be plausible, because it didn't incorporate the possibility of any sort of background trend, okay?
Because if we tell a patient that they don't have cancer, then they're not going to go for treatment.
We also talked though about the importance of positive emotions in maintaining our resilience.
At the same time the intent is for your course project to be a demonstration of the skills that you've gained in manipulating messy data into something of coherence.
Happy smarts is the ability to consistently make happiness enhancing decisions.
In this video I would like to just quickly summarize the main topics of this course and then say a few words at the end and that will wrap up the class.
And, by the way, I'm saying these things not to pat myself on the back, but to emphasize that there seems to be a great deal of hunger for the topic of happiness.
However, if we use the gamma model we find that only 7% of people would be willing to pay more than $30.
Now, there's no prediction algorithm that I'm aware of where a single set of tuning parameters works fine for all problems.
In this lecture, I will talk more about the kind of routine communications and the purposes of those routine communications as the data analysis project moves forward.
And we can estimate that from the data by just calculating the mean and the standard deviation in the usual way.
The only thing that worked was having a computer learn by itself how to fly this helicopter.
And so, often there's no parameters of interest.
Whether they were too high, too low, whatever it is depending on the problem you're working on.
And so the data might suggest certain hypotheses that you can further explore.
The last question you can ask yourself is really are you out of time?
I think the biggest challenge to global health is unequal access to health care.
They may be inside or outside of your organization.
Most algorithms have lots and lots of tuning parameters.
Now with a causal question we're often looking to determine how average changes in a set of features or in a given feature will change when we modify another feature.
And it's continuous, so it takes all the different values in between there.
We don't necessarily know that, but it helps us to simplify how we think about two different variables in the population.
So setting the right expectations and making them as sharp as possible is a really key element to this whole data analysis cycle.
So when you look at a plot, you get a sense of kinda how the variables are related to each other if you make a scatter plot.
You can think of this as like not having the disease and having the disease or any sort of binary class outcome like that.
So that's machine learning, and these are the main topics I hope to teach.
Now, you might think it doesn't look that good, actually, right?
And so if you have the wrong model, a couple of things may occur.
So revisiting the question is really important because often in a long and complicated data analysis, it's easy to go off course or kinda go off on a tangent while you're looking at interesting things in your data.
It may add variables, it may subtract variables and you may add different functional forms.
People, either in the literature or in your organization, that people already know about.
This lecture is about interpreting your results.
And very often your goal is to estimate them.
So, in this case, your primary model might be very simple, it might look something like this.
And whenever you publish data you're sending the message to your organization broker.
Just to summarize, inference is the process of making statements from data about things that you can't observe.
In particular, if you look at around 85 degrees in temperature, you look at where the blue line is around that point, you'll see that most of the data points happen to be above the line, okay?
Whereas a perfect F Score, so if precision equals one and recall equals 1, that will give you an F Score, that's equal to 1 times 1 over 2 times 2, so the F Score will be equal to 1, if you have perfect precision and perfect recall.
We'll also talk about which files are components of a data analysis.
People think that, you eat better, your health will be better in general, but it's not a very specific question, so another version of that question might be, if you eat five servings of fresh fruits and vegetables per day, does that lead to fewer respiratory tract infections or, you know, colds, things like that?
In the sense that any analysis that you do with the data might be difficult to interpret and may not really lead to strong evidence about anything in particular.
And the goal here is essentially to uncover a deterministic link between two sets of features, okay?
As before, this application initially displays a single button labeled load data, and as before, when I press that button the application will issue an HTTP get request to an external server and that server will respond with some complex text containing the requested earthquake data.
So, the normal model is based on the normal distribution, and it's the familiar bell curve that we've seen many, many times.
I won't get into the details of the model cuz it doesn't really matter at this point.
This differs from peer-to-peer communication, like email or SMS texting, where the sender sends a message to a specified receiver.
And we can make sure, that only reasonable amounts of fat, sugar and salt go into the food that our children are eating.
Nested inside that element is a series of earthquake elements, and each of the earthquake elements contains other elements that provide the data for one earthquake.
In this case, rather than setting higher probability threshold, we might instead take this value and instead set it to a lower value.
This Toolkit helps bring functionality that exists in R into the Python world.
There's gonna be many results that you generate, so you wanna think about the totality of the evidence.
Now the third model that we did which had this kind of fourth order polynomial was a relatively complex model to capture a trend.
The last step of course is revising your expectations.
There are a number of things that can affect the quality of inferences that we make from data.
So, the weight of an object is 9.8 Newtons per kilogram of its mass.
This lecture's about the ingredients you need to make useful inferences from data.
Let's begin at the beginning, see even if you have a great idea, success is far from assured.
Things like that. So that's kind of what you're trying to do with your data set here.
Now it's not necessarily going to be the final model, it's going to be the model that you start with to kind of structure your formal analysis.
The final challenge to making good inferences is sampling variability.
At the next level, you might be doing exploratory data analysis and you are looking at the data and you see something that's really unexpected, remember it's part of the EDA process.
If you don't adhere to these specification, you will find that you will not be able to connect to the IBM IoT platform.
JSON data is packaged in two kinds of data structures.
Now, the idea in any part of data analysis is that everything you do is gonna have some sort of consequence, whether it's collecting data, whether it's fitting a model, whether it's asking a question or making some sort of plot.
I like to make an analogy to learning to become a carpenter.
If they don't care about the problem your idea solves for them, It's game over.
Okay, so if you can generate evidence that suggest that your primary model is incorrect.
This is fairly common.
Now so as an example, I recently published a book called R Programming for Data Science.
Now the last thing you need is a model for the population.
And that is the idea of having a number that just tells you how well is your classifier doing.
So the first step I'm gonna do is just to try the easy solution.
So a hallmark of almost all prediction algorithms is tuning parameters.
Can one billion people be wrong?
So, make sure you know what type of question you're asking, and interpret your data analysis appropriately.
It could just be that the dataset you have only has an intrinsic amount of ability to predict whatever outcome you're interested in.
So the first few rows are helpful just to make sure you've got the right numbers there, the kind of things you were expecting to see.
Now when the interpretation does not match your expectations, that you're gonna have to figure out whether it's because your expectations were incorrect or because your interpretation is incorrect.
But then if you want to make an inference, you have to think a little bit carefully about what exactly are you inferring, or making an inference about?
People love to talk about data, but, so even if you don't want that discussion to happen necessarily, you may be better off in the end if it does happen.
Because this is the test data set, we actually know the truth, and so we can compare the truth with what our prediction says.
And so that would be a really bad outcome because they die because we told them that they don't have cancer.
Season plays a big role in explaining variability in both mortality and in air pollution okay?
Now the example application looks exactly the same as the one we showed for parsing json responses.
The last set is a very large class of variables, is all the potential confounders.
So that's basically it.
Now recall that we said previously that all models are wrong.
But those are kind of the three properties of any result that you're gonna wanna think about.
And then still your primary interest is on the coefficient beta which tells you how your total daily sales increases with the ad campaign in place.
Hi there, I'll be your professor in this course.
And so you'll be able to see if there are very large deviations that are perhaps unexpected.
So there are other pollutants that may affect health, and they may be correlated with particulate matter because they may share common sources.
Remember, positive emotions, they broaden our thinking and creativity.
But in this example, Algorithm 3 has a higher average value of precision recall than Algorithms 1 and 2.
And there's two possibilities.
And ask them how much they would be willing to pay.
So if you want people to have good discussion, to have informed discussion that may be useful to you, I think it's very useful to show the data.
One of the reasons machine learning has so pervaded is the growth of the web and the growth of automation All this means that we have much larger data sets than ever before.
And the reason is it's usually easier to tell a story about the data when you have fewer parameters and so this makes the model more useful.
We're looking at pm10 as our key predictor.
If you want to learn more about happy smarts, and how it's different from academic or career smarts I'd like to invite you to visit my website.
And by the way, these superscript 2, you know, what that means is that the z2 and this a2 as well, the superscript 2 in parentheses means that these are values associated with layer 2, that is with the hidden layer in the neural network.
So you can see that the data increased from the left to the right, like a line but with some scatter around it.
If we use that model, what you'll see is that we'll estimate beta, the increase in the daily sales due to the ad campaign to be $44.75, okay?
So alpha here is the intercept.
And by the intervention I mean an action that you can take given the answer to any data analysis you might do.
However, if your secondary models successfully challenge your primary model and maybe put some of your initial conclusions in doubt, then you may need to adjust or modify the primary model to better reflect all this additional evidence that you've generated via all the secondary models.
And the fundamental property of an inferential question is that you wanna make a statement about something outside the dataset.
I hope that many of you in this class will find ways to use machine learning to build cool systems and cool applications and cool products.
And so figuring that out is part of the iterative nature of interpretation.
That's why your first work is to search for the right business model.
So, it really depends on your perspective.
But most exploratory data analyses iterate for a couple of times as you go through and you revise the question or update the data or change your model.
So the data that we've got here is just daily levels of 24 hour average temperature and 24 hour average ozone levels for the year 1999 in New York.
The primary models kind of capture the basic relationship.
And so you may need to consider other things later on, but this is sometimes good as a primary model.
But the problem is that it seems like the increase started, actually before the campaign even started, the sales were kind of going up.
And on improperly weighting the evidence from one kind of analysis versus another kind of analysis.
Any parameters that you estimate from your model may end up being uninterpretable.
On the documentation, shows you how to organize that topic space.
And the fact that one penguin is wearing a turquoise hat doesn't really influence whether another penguin is gonna wear a purple hat or a turquoise hat.
In this week we'll discuss basic statistical tests and methods that ensure you have a solid grasp going forward into the next course.
Often, lots of machine learning types of techniques can just learn the model from the data without you having to be very specific about it.
But realistically, only a tiny fraction of that data can be used to answer any specific question that you might have in your mind.
It has to be in here somewhere.
While potentially trying to adjust for all these confounding factors.
And as you vary the threshold, if you want you can actually trace of a curve for your classifier to see the range of different values you can get for precision recall.
When we are going out, it's important that we realize that the choices we make when we go to the coffee shops are very important.
So the data set classifies individuals into good credit or bad creditworthiness.
In this specialization, we're focused on teaching applied skills using the Python programming language.
So the first question you wanna ask yourself is how does it compare to what we are expecting from the previous picture, okay.
But then also within the data set, there's going to be a certain numbers that you expect.
The underlying question here really that you want to ask yourself is do you actually need to make an inference at all?
From here where you go, it depends.
So in this video, we talked about the notion of trading off between precision and recall, and how we can vary the threshold that we use to decide whether to predict y=1 or y=0.
So that's a reasonable question.
But most questions that you're trying to answer don't necessarily have the big data component that necessitates the need of huge numbers of computers, although sometimes it does.
This lecture gives an example of how to think about formal modeling in a prediction setting.
If I put a function node in between them, I can very quickly create a simple web page where the function node is just going to return some text in the payload.
And particularly, if you think about something like pollution, it may have an inherently very small association with the outcome, but because everyone in New York City ostensibly breathes, and is exposed to this polluted air.
But it's tempting to want to iterate it forever.
And that's a very easy thing to check.
Next, the code sets the parser's input to be the XML document that was returned the body of the HTTP response.
So, using P and R to denote precision and recall, what you could do is just compute the average and look at what classifier has the highest average value.
We're a long way away from that goal, but many AI researchers believe that the best way to towards that goal is through learning algorithms that try to mimic how the human brain learns.
You need to be able to tell a story about how the data came from the population and ended up in your lap.
We will speak to people at Shape Ways, which provide you a service bureau to upload designs and get the prints shipped to you.
It turns out that the ball's weight is exactly proportional to the ball's mass.
On the other hand, sometimes we'll use a primary model that doesn't have any confounders.
So, how to prioritize how you spend your time when you're developing a machine learning system.
Now why are we talking about these six types of questions?
So just for the sake of example, let's use the normal model.
Imagine you're selling something and you're thinking you're buying ads on Facebook and you wanna know how effective those ads are gonna be.
So it's clearly not a perfect model.
But now, it uses the XML response handler class to process the response.
Typically real world data are more noisy.
If you imagine there's a stochastic process of earthquakes that randomly drops earthquakes into our world, or wildfires, or hurricanes, whatever it may be, when you fit a model and you make an inference from that model, what you're inferring are properties of this unobserved stochastic process.
And so, you have to go all the way from defining the questions to creating sort of reproducible code that you can share with other people.
And the dataset is taken from the UCI Machine Learning Repository, which is an excellent repository for all kinds of machine learning and prediction types of data sets.
For example, for things like drug development or whatever.
For me one of the reasons I'm excited is the AI dream of someday building machines as intelligent as you or me.
So what we can see here is that, from the blue line, there's a sharp increase in ozone after about 75 degrees, and then maybe around 85 degrees or so, there's some suggestion of leveling off of the relationship.
And then, once you've completed the third part and you've revised your expectations, you may go back, with these revised expectations and collect more data and try to match them again, and then this iteration continues, often for many different times in any given data analysis.
And there are three things that you wanna think about with your primary result.
On top is a definition of machine learning by Tom Mitchell.
It turns out that machine learning is a field that had grown out of the field of AI, or artificial intelligence.
As you know, machine learning is a technology that's having huge impact on science, technology and industry.
In the long run, a diet of mostly home cooked foods, prepared by someone who cares about the people eating that food, will almost always be healthier for the whole family.
We wanted to build intelligent machines and it turns out that there are a few basic things that we could program a machine to do such as how to find the shortest path from A to B.
Now the first thing you want to do, of course, is you have to read in the data.
The have to be able to see where the light is.
Of course you have to have a question about that population, but that's kind of secondary.
One way to differentiate a highly processed food from a less processed one, is to look at the number of ingredients listed on the label.
So just to summarize the five features of a good question are it has to be of interest to an audience.
If at the end of the day we aren't giving our students the knowledge to lead happy, fulfilling, meaningful lives, what's the purpose of that education?
Keiko from Tokyo wrote in, well , that she couldn't go to the bathroom.
Let's scroll down and take a look at that class.
You'll have a sampling process that you should be able to describe as part of your kind of data analysis process.
So these are 20 numbers from the survey that was put out on the website about my book, okay?
You just want to be able produce solid and high quality predictions, and so any variable that could play a role in that might be useful to you.
How do we decide which of these algorithms is best?
So thinking is really important.
So many prediction algorithms these days are very good at exploring the structure of complex data and making very good predictions, especially once you get the tuning parameters right.
And recall that this is the increase in sales due to the ad campaign.
Routine communication is an important tool for performing any good data analysis and it's important to realize that communication is both a tool and a product of data analysis.
That's not predicted by the volume, the normal distribution doesn't have a huge spike right there, and furthermore, there are no values that are either close to zero or negative, whereas the normal distribution has all these negative values in its functional form.
So you can't move forward.
The second type of question you can ask is an exploratory question.
And so maybe that's probably not the best model, but it may be still useful.
And the key goals for the modeling is to estimate this association and to make sure that you appropriately adjust for any other kinda factors.
So you've defined the population, you've described the sampling process, you've developed a reasonable model for the population.
A tree, and then the application does it's processing on this tree structure.
And so on the task of systems performance, on the performance measure P will improve after the experience E.
And we also want to be able to characterize our uncertainty about those statements that we make.
So any summary statistic or any model parameter that you can estimate from the data, you should always be asking for a measure of uncertainty that comes along with it, because there always will be uncertainty if you're sampling data from a population.
Drawing a fake picture I find to be terribly useful because it really helps to set expectations.
Hang on a second, I'll get out of here.
Sometimes it will look like this, sometimes it will look like that.
So when you collect data, you collect data in the most convenient way and it may just involve.
So for example, in a linear regression, the coefficients of the linear regression models are parameters.
So I'm actually going to spend a lot of time teaching you those sorts of best practices in machine learning and AI and how to get the stuff to work and how the best people do it in Silicon Valley and around the world.
If you didn't choose to use the MQTT protocol, it's important that you refer to the Internet of Things platform documentation.
And one of the aspects of random forest algorithm is that it allows for a summary statistic called variable importance.
Once you've identified the population, then you can talk coherently about what you're making inference about.
This is called the F Score and it uses that formula.
You might think of Apple as being a random draw from those other stocks in the stock market.
In the next two videos, I'm going to say exactly what these two types of learning are.
And so all of these were different tools for helping you to decide what to do next and how to spend your valuable time when you're developing a machine learning system.
And you wanna think about how the evidence that you generate will inform the next steps, or any decisions that might be made afterwards.
You often do sensitivity analyses to see how the association might change in the presence of other factors.
And so this is really a problem that isn't really solved by the data or affixing the data.
And the goal of this is really to start developing what I call a primary model.
Also, if our children are drinking sodas and other sugar-sweetened drinks on a regular basis, they're much more likely to be taking in too much sugar.
That's when I noticed two very interesting things.
Okay, so if we use this model, and we still try to estimate beta, what we get is that the, our estimate of beta is $39.86.
So the last part of the data analysis cycle is to think about what have we learned from the data, from our expectations, and their comparison.
So we usually think of this average of precision and recall as not a particularly good way to evaluate our learning algorithm.
It's important to know your audience and make sure you've got the right people in the room to get the right feedback.
And so I know how time-consuming it is to learn this stuff.
It probably looks very familiar to you.
And in that book too I cover a lot of the material that I'm going to share with you in this class.
And so you often, for example, you wanna know whether a relationship that you observe in the dataset holds somewhere else.
And then finally the result is returned back to the calling method.
XML is a markup language for creating XML documents.
In the next, I realized that some of the intuitions in this video of how, you know, other certain layers are computing complex features of the early layers.
And so, you see in the numerator here that the F Score takes a product of precision and recall.
So revisiting the question is really important for just setting your expectation for what you will find in your results.
The MQTT protocol is a lower level protocol than the Internet of Things, Platform API.
Another approach is to use very flexible models.
Next, the code summarizes the various pieces of earthquake data converting them to a single string and adding that string to a list called result.
This can serve web pages, but it can also serve a REST API.
This data is simulated, so I just want to show you so you can see what it looks like.
And so, but now it may be that if your model is not working very well, that you have to change to another algorithm or another procedure because different procedures can work well in different settings and different types of data structures and different types of data set up.
But sometimes your primary model will hold up and then you'll be able to stick with it.
Well, what determines the ball's weight?
So just make sure you got those dimensions right.
To observe the ball's weight in action, let go of the ball, and let it accelerate.
So these are things that often may need to be accounted for in some way or included in a formal model to properly examine the relationship between your key predictor and your outcome.
The six types of questions that you can ask in a data analysis are descriptive, exploratory, inferential, predictive, causal, and mechanistic.
This forward propagation view also helps us to understand what Neural Networks might be doing and why they might help us to learn interesting nonlinear hypotheses.
So we'll discuss these in module three of this course.
You'll also discover a wealth of resources on that website including many of my articles that have appeared on my popular Psychology Today blog, Sapient Nature.
And so it's not necessarily true that all, the algorithms are exchangeable, you may want to change the algorithm. However, if you try a few algorithms and they all seem to be producing kind of a similar quality of prediction, regardless of how well you tune them, it may be time to get more data or other data to help you predict the outcome.
So it's important when you use the http nodes that you allow the messages to flow from the request to the response.
So there's five characteristics that I just want to highlight about what makes for a good question.
So there's a variation there as we would expect.
You've got this data set, it's daily data for a year of Apple's closing price.
And what the population might be willing to do.
If the magnitude changes that may indicate you've got a potential confounding going on, or maybe you've got to think about what type of model is the best model for estimating this type of association.
I also wanted to let you know that this class has been great fun for me to teach.
So this is a very simple design, you might say look at the one week before the ad campaign, the one week during the ad campaign.
Is it meaningful to you?
I live in the US, I'm a citizen of the US, but I'm currently visiting India as a professor of marketing at the Indian School of Business in Hyderabad, which is this really beautiful city of old forts, palaces, and monuments.
So before you go out and spend the resources to conduct an experiment and do an analysis, try to think about your questions, think about whether you could explain how things would work if you get the answer that you expect.
And so we'll talk a little bit more about this in the lecture about interpreting data analysis.
With the result passed in as a parameter.
So that the data collection process might have been very skewed.
Okay, so this is the point where you ask yourself those three questions that we talked about in the beginning.
The first problem is that the population is not well-defined.
It is the things that I have learned from teaching the class that I would now like to share with you in this course.
So if there's gonna be a deviation from the line it's equally likely to be above or below it.
And so, usually the way that we write down these models is with mathematical notation or with computer codes so that we can get very specific about what we're trying to do.
The nice thing about the normal model is that it only requires two parameters to estimate.
So this is a big problem, because a vaguely defined population leads to vague inferences.
Finally, we will also consider the importance of intellectual property and have a chat with an expert in the field to talk about whether intellectual property is a big concern for 3D printing or not.
And it's useful to think about whether or not that sampling process accurately represents the population or whether there may be some bias in terms of who you selected into your dataset and who exactly they represent.
Seeing the start of an XML tag, seeing the end of an XML tag, and seeing element content, and when the event is a start event.
You can see that the histogram and the blue curve match very nicely with each other.
So in an email client like this, you might click the Spam button to report some email as spam but not other emails.
First, you have to ask yourself is the model's accuracy good enough for your purposes.
Because typically the outcome is typically going to be some binary outcome or some multi-class outcome where that can take either two values or just a handful of values.
For example, if you're trying to approximate an index of stocks, Apple would be a random element of that index, you might sample different stocks to approximate the index.
The students said that they'd be absolutely delighted to take such a class.
So, the next thing you wanna do is actually just look at your data.
So that's what we'll talk about in this lecture.
And then we'll compare the predictions on the test data to the truth that we know from the test data to see how well we did.
It's important to ensure that the people in the data science team or the analysts stay on task with these two basic communication goals.
We just wanna find a kind of a model form that predicts the outcome with a high accuracy and low error.
Previously we said that the sequence of steps that we need in order to compute the output of a hypotheses is these equations given on the left where we compute the activation values of the three hidden uses and then we use those to compute the final output of our hypotheses h of x.
But also it's important to provide sufficient information so that the audience can understand where you're coming from and where you're going.
And you want to check to see is the data formatted correctly?
So there's two possibilities.
But sometimes you wanna be able to predict the outcome with all of the available information that you have.
Before you observe the real thing.
That can be very useful when you don't have a good understanding about what the relationships of the population are.
At the same time, some of the more advanced ways to query and manipulate pandas' data frames like boolean masking and hierarchical indexing are different than in databases and require some careful discussion.
Using just words alone to describe a model, so that's why we usually have to resort to mathematics or computer code.
All the things that might go on between all the 300 or so million people in the United States, it's impossible to think about.
This lecture provides just a couple of ideas of how to make a good data analysis presentation.
So, when you look across all of the evidence that you generated, how is the directionality preserved, or not.
I'm Kevin Werbach, a professor at the Wharton School of the University of Pennsylvania.
They might require too many resources to try to answer a question.
Now there are many different possible shapes for the precision-recall curve, depending on the details of the classifier.
This is a sign of the range of problems that machine learning touches.
Now, here I've opened the application in the IDE, and now I'll open up the file that does the downloading and displaying.
The last question you want to ask is do you have an appropriate model for the population?
So here, you're summarizing kind of the features of a dataset and you're focusing on the data that you have on hand.
Consider the following neural network and let's say I cover up the left path of this picture for now.
And the term F Score, it doesn't really mean anything, so don't worry about why it's called F Score or F1 Score.
So it's important that we kind of set these expectations before we look at the data, so we know whether we are kind of right, or wrong, or far off, or pretty close to the mark.
So, this suggests that the coefficient for pm10 is quite a bit bigger than zero, maybe statistically significant.
And the important thing to realize, and this is a well worn saying in the field of statistics.
Now what we do with this information will depend on what the goal of the analysis is, who the stakeholders are, and what we might do with this information afterwards.
Well, the response that came back was actually formatted in JSON.
I'll give you a much more concrete and detailed definition later on.
Gamification is about learning from games.
The first thing you're probably gonna wanna do is to revisit your question, and just to kinda make sure that you haven't gone off course from your original question.
In the last video, we gave a mathematical definition of how to represent or how to compute the hypotheses used by Neural Network.
The first is the outcome.
If you have a question that's kinda vague in its nature, then there may be a lot of different patterns in the data that match those expectations that are raised by your question.
Can you place your results in a larger context or do these results make sense, and finally are you just simply out of time?
Because people like to talk about data when they see it.
So ultimately what you wanna do is sell more products and make more money from this.
You'll pick up some of the main machine learning terminology, and start to get a sense of what are the different algorithms, and when each one might be appropriate.
DOM stands for Document Object Model.
So, I'll skip the first two steps there, and I'll just show you, here's what the picture looks like with the data, and the gamma distribution that's fitted on top of it, okay?
And by varying the threshold, you can control a trade off between precision and recall.
So using this type of design and this kind of experiment what would you expect to see?
Which is that: It was maybe not so long ago, that I was a student myself. And even today, you know, I still try to take different courses when I have time to try to learn new things.
First, it's easy to learn.
There's quite a bit of noise because we wouldn't necessarily expect particulate matter to explain all the vulnerability in mortality.
Whatever the source, you need to be able to access that data.
I think the biggest problem in public health issues in the world the, is migration because of war or natural disaster.
It's often just a convenient sample.
And you can see that I highlighted the coefficient for pm10 here is actually quite a bit larger now, is 0.00149, and you can see that the standard error is quite a bit smaller relative to the estimate.
Finally, the implications of analysis are always important to consider.
So, big data for somebody without a computer might be 1,000 numbers, but data for, big data for somebody with access to Amazon EC2 might be enormous, much, much, much larger.
And this involved both trying to understand what is it that makes a machine learning algorithm work or not work.
So framing the question correctly is really important for developing an important modeling strategy, and for kinda drawing the right conclusions.
Or there are kind of values that, for example, negative values or maybe they're positive values that you weren't expecting that don't appear correct.
So the idea is it's not gonna be the final product, but it's gonna give you a sense of how things flow and how things work.
So the normal model told us there was gonna be a big hump kind of around 20.
And the models help us by imposing structure on the population, so we might for example assume that things are linearly related to each other.
So for example, if you have a primary model, it can be used to show results for say, a bunch of secondary models or to show confidence intervals for parameters.
And likely with real world data, there's gonna be lots of things going on in the background.
The main objective of the course was to be very simple, to give the students the opportunity to discuss this life's important question.
And see if there is any reasonable increase in the total sales.
So, I think doing that provides for a useful presentation and provides for a more important, a useful discussion about the meaning of your results and what we should do about them.
So, the other thing you wanna think about as you reconsider your question is check for potential bias in your results.
And by the way, the precision-recall curve can look like many different shapes.
And so, there in situations like that, we need to accumulate evidence through many different types of studies, and to develop a pattern that would suggest that a causal relationship exists.
And the basic idea here is on the x-axis we have our predicted probability of being good, a good credit quality.
And on the y-axis we have the actual truth of whether you're a good or bad credit, you have the good or bad credit.
But not any more than you would have gotten from just looking at a very simple picture or a table or whatever it is that you've tried to do it first.
So the next stage is collecting information.
And I would add to it, no matter how big the data are.
One that we saw in the very first module, when we used the sample application, was data coming in from a simulated device.
There are many different kind of populations that you can think about or construct when you're making inference from data.
If you find things that kind of make sense to you, it may be an indication that you actually need to sharpen your question a little bit, to make it a little bit more precise.
So the basic process that we'll go through here is we'll first split the data set into a training and test set.
Whether they were wrong.
And that liquid calories are important things to watch out for, because many times, when people are consuming liquids, they don't realize that they're consuming calories, and you're not listening anymore.
And so it's not necessarily going to be good for be predicting the outcome, but it may, nevertheless, have an important association with the outcome.
So the next type class of variable is what I call the key predictor.
And intermediate values between 0 and 1, this usually gives a reasonable rank ordering of different classifiers.
So that's what we'll talk about more when we talk about formal modeling.
The objective of this course was to expose the American MBA students to a totally different country and culture.
Now, the last feature of a good question is very important, and it's that a good question tends to be very specific.
So we don't have to make too many assumptions about the population in order to order that question.
Another way to ask this question really, is are you out of money, okay?
And so it's important to kinda either get back onto track with your original question.
So it's possible to tell just from this graph, without doing anything fancy, that the ad campaign seems to add about $100 per day to the total daily sales.
So, knowing who your audience is and knowing what questions are relevant to them is a really important part of asking the right question.
The second part involves collecting information and comparing your expectations to data, and the last part Involves reacting to data, and revising your expectations.
And this method begins by creating the PullParser object.
And then you can iterate through this process until you arrive at a model that you think reasonably summarizes your data and answers the question that you were looking to ask.
So ultimately, what we're leaning toward with setting your expectations in collecting data is called a change in behavior or an understanding of the mechanism you're trying to study.
As I mentioned, 4/4 time, the top number Indicates the amount of beats that are in any given measure.
So it makes it harder to analyze the data.
So sometimes with inferential questions where you're just looking for an association, if the association is of interest that meant you evolved into kind of asking a more causal question where you may need to design a different kind of study, maybe an intervention, or a controlled trial.
So auxiliary data about the population can be very useful for characterizing any sort of selection bias you might have in your sample.
And the problem with asking a question that's not really plausible is that it may lead you to collect some bad data.
For things like earthquakes ,hurricanes, or wildfires, things that occur in space, you often collect data on that and analyze it using models or whatnot.
So you may underestimate how certain you are about certain characteristics of the population.
Parameters usually represent characteristics of the population.
Keeping track of your time and your budgets for both money and time is an important part of managing the data analysis process, so that you kind of manage the continuation into the next phase and to argue for more resources if you need them.
Because if you go to a patient and you tell them that they have cancer, it's going to give them a huge shock. What we give is a seriously bad news, and they may end up going through a pretty painful treatment process and so on.
And analysis for decisions is often different from doing analysis for like supporting a research paper, or for shipping a product, or even if you're writing a technical report.
And they're a much simpler form than what might actually be going on in the population.
You always set your expectations for what the data should look like and then when you look at the data or you make a plot or a table and it doesn't look like what you expected, and you can't quite explain, it you may need to communicate that to a person or a group of people in order to get some feedback or answers about that.
And so, try to have a broad array of measures of uncertainty so that people can get the full picture of what's going on in your analysis.
A tool that makes it easier to stress test your ideas, track your progress, and avoid breaking the bank along the way.
Part of this is just this idea that you always wanna be checking to see that you haven't missed anything, you wanna be challenging your results all the time.
If I go and edit the function node, and change the way that I return the text, like this, so I'm no longer passing on the request object to the response object, you'll see when I now try and load the page that I don't get any response back.
In this video, I like show you how to actually carry out that computation efficiently, and that is show you a vector rise implementation.
On our planet one out of five people is Chinese.
So it's a very basic setup, very simple, and this is kinda what you would love to see in your data in an ideal world.
It may not be the last thing that you focus on, but it'll be what you initially focus on.
Our final estimate is one-third or .33 penguins with turquoise hats.
Both the interface that we're going to use for doing assignments, called Jupiter Notebooks, and the main libraries for the first two courses, Pandas and Matplotlib, are part of the SciPy stack, and provide an excellent basis for moving into machine learning, text mining and network analysis.
If every subject was measured three times, you wanna make sure that every subjects got actually three measurements associated with it, right?
And the reason why is it just provides for a richer discussion when you can incorporate the uncertainty into any predictions or any estimates that you make from the data.
Are there positive and negative values, are you expecting positive and negative values, things like that?
But they're limited to doing their processing in a single pass of the document.
This is 20 data points.
You're just trying to make a prediction using any combination of features, using an functional form of any type of model.
But as you saw, that data has a complex format.
Do you have enough evidence to make a decision?
So, the two extremes of either a very high threshold or a very low threshold, neither of that will give a particularly good classifier.
So, for example, if you know that the average level of some feature In your population is, on the order of plus or minus whatever, ten.
These questions will help you determine how to manage the data analysis process in general and whether you can stop the exploratory analysis process and move on to the next phase.
He was CEO of a big multinational company.
All right, let's take stock of where we are, we've covered so much.
And finally of course, included with the totality of the evidence, is the kind of any results they may already be existing that people have researched before.
Sometimes, these are called hypothesis generating types of analysis because you're looking at the dataset that you have in hand and looking for relationships that might be of interest.
If you can't do this, then it's usually a bad sign that it's not a great question.
And I've overlaid it with the blue curve, the normal distribution, that's fitted to the data.
Or how it might manifest in that way.
Here I've got the application open in the IDE, and now I'll open up the file that does the downloading and displaying.
And so and that may end up wasting a lot of time, a lot of money for people if you've collected bad data that question is not really plausible.
So you don't necessarily have to distinguish between, say, a key predictor and a set of other predictors, okay, you just want to use all the information.
And in that week, I found I was consuming way less calories and I lost a few pounds.
Around the Internet, And many web services now provide data in such formats.
Now here are the results that we get from a regression model of that nature.
Now if the results don't match what you're expecting, this may be an interesting thing, but it definitely needs to be followed up.
So there's three separate components there that all could be evaluated independently.
And then slowly add things to the mall to see how our results change.
There was a realization that the only way to do these things was to have a machine learn to do it by itself.
You heard a few of our PWC professionals talk about these different types of data, how they're used, and how technology is impacting data analytics.
And so, part of this is because season is very strongly related to both, but it's kind of positive, it's correlated in one way with mortality, but it's correlated in a different way with pollution.
And that will often result in problems in your interpretation.
So the first question we want to ask is, what do we expect to see?
Imagine you've got some box, and there's something inside you wanna get at.
So we talked about evaluation of learning algorithms, evaluation metrics like precision recall, F1 score as well as practical aspects of evaluation like the training, cross-validation and test sets.
And there aren't that many points above the line.
Because the type of question that you are asking can greatly influence the modeling strategy that you pursue in any data analysis.
The third type of data analysis question is an inferential question.
Over and above whatever background trends that might of been going on that you that you're not aware of.
But the point is that our estimate is .33 and the truth is .4.
Well, it's conventionally represented by the letter g, the lowercase letter g, and it carries a mysterious name, it's called the acceleration due to gravity. In the next video, we'll take a look at why that name is appropriate.
In the course of teaching this happiness class, I have learned more about happiness than I could have ever hoped.
Suppose you're developing a new product, and you want to know how much people would be willing to pay for this new product.
If you want more details about JSON, please take a look at this website.
So that's the third part.
Often in data files, there's some junk at the end, some comments maybe, just some notes that someone put in there, especially if they were exported from an Excel spreadsheet.
We can use the models to compute other quantities too, for example we might want to know how many people will be willing to pay more than $30.
The easiest solution is just a scatter plot of mortality and PM10 on the x-axis, and here you can see what the relationship looks like.
So, if we collect the data on people who have five servings of fruits and vegetables per day, and we collect data on how many respiratory tract infections that they have.
So after the trip to India got over, when I went back to Austin, I put together a syllabus.
So here's what that picture looks like.
So unlike the normal which has negative and positive values.
So you solve the summary statistics from this particular model, which is okay.
Everything you do there will be some sort of action, and the point is you wanna think about what that consequence is gonna be, before you do it.
The next thing to do is to try to take a stab at your solution.
On the x-axis here I've got a simulated predictor that ranges between -2 and 2, roughly.
By virtue of the fact that you are sampling data, you will introduce uncertainty, because you don't have all the data.
And when temperature is lower, ozone tends to be lower.
And so, collecting that information is key, because it will tell you whether or not your expectations were right.
So here at the very top is what's called a confusion matrix and it shows the number of predictions that are in the truth, bad or good, that's called the reference, and then what we predict to be bad or good.
The advanced Python isn't strictly necessary for the rest of the specialization, but many of these examples you might see on the web or broader data science topics like Big Data and real-time analytics, might require a knowledge of some of these more specialized features.
And gravity is a physical phenomenon that produces attractive forces between every pair of objects in the universe. This ball is exerting an attractive force on me by way of gravity, I'm exerting an attractive force on the ball by way of gravity, every possible pair.
Now each one of these questions has very different characteristics and imply different ways that you can interpret findings from any data analysis.
Can give you a little bit more confidence in the idea that your data set is properly formatted and it came to you in the right way.
So in many ways this is like with film-making, you want to make a rough cut of your movie, right?
So the question you have to ask yourself is how much money should you bring.
Try to always think carefully about whether or not you actually have to make an inference because, in many cases, you may not.
When you publish data, you use a topic.
And then if there's any other metadata that you might need, things like codebooks, things that describe what the variables are, make sure that comes with the data too, okay.
Here is what particulate matter data looks like.
Here, in Silicon Valley where I live, when I go visit different companies even at the top Silicon Valley companies, very often I see people trying to apply machine learning algorithms to some problem and sometimes they have been going at for six months.
So another question that we could ask, is what best predicts mortality in New York City, using the data that we have available, okay?
That's a very simple question and the solution is to really just do more thinking beforehand.
So, one common source in a city like New York is gonna be traffic.
And when you're finished, take a picture, do a reflection and send it to us.
Then again, if you look at the temperature 70 degrees and where the blue line is there, you'll see that most of the points are below the line, actually.
Inference is the task of making statements about things you can't observe.
And there may be multiple parallel purposes all at once.
And so one thing you might do is try to pilot a one week experiment where you buy Facebook ads for a week and see how it does.
Let's break it apart. First, the data comprises a single JSON object, and that object is a map, and that map has one key value pair.
This week, you spent a lot of time learning about the different types of data that can be used to solve problems.
So here's what we would expect the data to look like if they were drawn from, as representative samples from the population that was governed by a normal distribution.
The first module focuses on getting prerequisites in place and reviews some of the basics of the Python language.
And the goal is to generate evidence against your primary model.
But I am going to show you how some of the techniques that designers use in games like this one can be applied to problems in business, education, health, and other fields.
Now we have looked at two different types of models to tell us about our data and to tell us about the population, okay?
So you may pursue other important findings, that's okay, but it's important to just realize what question you're actually asking.
There are a couple of just core concepts that are important for any type of routine communication that you wanna think about.
I'm gonna talk a little bit more about populations and the forms in which they can come because populations can come in a variety of different forms that you may not necessarily expect.
And we want to know how that data matches our expectation, which is what this picture is giving us.
So, at this point you've done a lot of exploratory data analysis, and you probably have a reasonable sketch of the solution that you're looking for for your question.
This data comes from the National Morbidity and Mortality Air Pollution Study which is an Air Pollution Study that I was heavily involved with.
So, here is the results of fitting that model to the data.
We can compute z2 as theta 1 times x and that would give us this vector z2; and then a2 is g of z2 and just to be clear z2 here, This is a three-dimensional vector and a2 is also a three-dimensional vector and thus this activation g.
So here's the actually data for New York in 1999, with a fitted linear regression line overlaid on top, okay.
This may be something that you do without much thinking but actually, it's a very important part of the data analysis process.
The last part is incorrectly specifying the model for your population.
Do you care about those differences?
If the data are the population, one of the features of your analysis is that it will have no uncertainty because there's no sampling process that generates the uncertainty.
It's mature, and there's plenty of resources available from books to online courses.
But it's not, it doesnt exactly fit it perfectly, and still you have a bunch of, the curve, is kind of covering values where there's no data between the zero and five range.
So, cuz you wanna be able to start with something simple and then you can kind of get more complex a little bit later.
The sign that you have a vaguely defined population is that often the results are uninterpretable or difficult to interpret.
So do you need to collect more?
You just want to be able to generate a good prediction from a set of variables and you're not developing some detailed understanding of the relationships between all the features okay.
So it's possible to just kind of go on forever.
So the example I'm going to use here is going to be an advertising campaign for a new product.
And that way you'll know whether the data meets your expectations about the question.
There are also nodes to allow you to respond to messaging systems like MQTT and MQ.
Once you're outside that gray area, you have almost absolute certainty, because the outcome will either be 0 or 1.
And before wrapping up, there's just one last thing I wanted to say.
One is that, that cost meets your expectation. So suppose you thought it was gonna be $30, and it ended up being $30, then that's great.
Computational biology.
You estimated some parameter and that was the result, and maybe there's a competence level that you can present around that too.
So you wanna think about how sensitive your primary model is to various changes that are introduced via the secondary models.
But usually the number will be small, and very often the number is actually just one.
So look at the data and make a plot.
So doing a simple search and looking to see if your question has already been answered, either inside or outside your organization, can save you a lot of time and money and allow you to ask the real question that people want answered, opposed to one that's already been answered.
I find it's very useful to look at the last few rows just to make sure, for example, that the data set is read in properly, that you've got all of the data that you were expecting to get.
This is basically regressing mortality as the outcome and PM10 as our key predictor, without any other factors, just as kind of a baseline model.
Now you have a lot of tools at your disposal, but you've probably heard a lot of different songs that are really good.
If the way that you got this data said that you were expecting there to be 1,000 rows, then there should be 1,000 rows in the table for example.
The meal was $10 more than you actually thought it was gonna be.
And it's always important to set expectations for models, so I know it's very tempting to get right into the data and see what they look like, but you gotta be able to set your expectations.
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
Don't worry if these two terms don't make sense yet.
With associational analyses, the basic goal is to determine whether a key predictor and an outcome are associated with each other in the presence of many other potentially confounding factors.
You may have enough information as it is to kind of set prices or to figure out how your marketing campaign's gonna go.
Another word for this is sensitivity analysis.
You don't have to necessarily use a computer.
On the other hand, if the ad campaign were really expensive, so maybe it's $20 a day to run these ads.
And so, if you're interested in making these predictions and being accurate about them, you want to make sure you have a model that's reasonably a reasonable approximation of the population.
And so what this suggests is that over time as technology increases, big data will change.
Suppose I want to know how is air pollution in New York City related to mortality in New York City.
And so the key predictor explains some variation in the outcome and it's something that you're interested in.
Angry Birds and its various incarnations has been downloaded over one billion times.
Finally, it's just your attitude when it comes to communication, and it's important to have an open and collaborative attitude so the audience is fully engaged in giving you feedback and helping you out with your analysis.
Or, you know, what's the average level of air pollution in the city of Baltimore?
Okay, so that's the art of song writing.
In this module, we will discuss examples of how 3D printing enhances product customization, as well as the development of on-demand manufacturing.
So, as usual, we're going to predict 1, y equals 1, if h(x) is greater or equal to 0.5. And predict 0 if the hypothesis outputs a value less than 0.5.
But that's true, but it doesn't necessarily mean that pm10 is not associated with mortality, from an associational standpoint.
So the addition of this confounding factor dramatically changed the association we estimate between pm10 and mortality.
>> I am for global health and i think the largest single challenge for global health will be climate change.
When it's possible, if you're making a presentation and you're showing the data, if you're showing some summary or a statistic about the data, try to show a measure of uncertainty to go with it.
Suppose you wanna make an inference from this data set, what is the population there now that you're trying to make an inference too?
Because remember for every diamond in the rough, there are 99 pieces of coal.
In this video, we will try to define what it is and also try to give you a sense of when you want to use machine learning.
Before we go into programming fundamentals, though, we'll talk a bit more about what data science is, and why it's sweeping over the world.
Most of natural language processing and most of computer vision today is applied machine learning.
And so you get to an answer that you can use or can make a decision on.
Failing to do just one of these things, can mean death to your business, however, great your idea is.
Then that web service will return the earthquake data in XML format rather than in JSON format.
So there are linear relationships between them.
Just to say that again, what this neural network is doing is just like logistic regression, except that rather than using the original features x1, x2, x3, is using these new features a1, a2, a3.
And so that prima facie evidence is really this initial solution, is the simplest solution that you can think of at the moment.
It's very simple, there's a key predictor and there's only one confounder.
To the right of the grey area you'll notice that the outcome is always 1, and to the left of the grey area you'll see that the outcome's always 0.
Because one of the things that plots give you is they give you a summary plus a deviation.
Now, the next thing I like to do is to check the edges of the data set.
So here's the results of including temperature which is the tmpd variable and dew-point temperature which is the dptp variable.
And so, watching you label emails as spam or not spam, this would be the experience E and and the fraction of emails correctly classified, that might be a performance measure P.
Secondary to that, you might want to describe what type of question it is that you're trying to answer if it's not immediately obvious, okay.
So the data I'm gonna use comes from this national morbidity and mortality in air pollution study, and here's a picture of daily mortality in New York from 2001 to 2005.
The first part is setting expectations.
But the important thing to know here is that you should always be asking for a measure of uncertainty along with any estimate that's obtained from the data.
The next question you wanna ask yourself is, can you place your results in a larger context?
This lecture talks about the more routine kinds of communications that you do on a day to day basis.
And so, if we take a variable and deliberately make changes to it, on average, how will another feature or another characteristic be affected?
So put it this way, if all models are gonna be wrong, you might as well try to find something that's useful.
It is, and though the journey for these guys and for you won't be a walk in the park, all you'll need is your creativity, willingness to work and passion for your idea.
The http node in Node-RED allows you to set up a REST API.
So, the first question is really are you out of data?
So it's important to know when to stop the data analysis process or the exploratory data analysis process in order to prepare results for presentation and to argue for continuation of the project, if that's what you believe needs to be done.
So I got to meet some of my old batchmates from both schools.
Who are you at your best from a character perspective and how can you use what already strong and right in you to enhance your resilience and enhance your relationships,and have greater positive emotional sort of stuff?
So there is at least an association there, so that's good.
And now, the important part of this feature is that you have to be able to define who your audience is going to be.
So for example, the data for this might look something like this in this table here where you have 21 days.
Which is they want more people to come to their site, and they want people to do things on their site to interact with their products, to write product reviews, to watch videos, to find out more, to register products they've already bought.
So what we see here is that the model kind of captures the general trend of kind of increasing temperature and increasing ozone, but it tends to be biased within certain ranges of temperature.
Another thing you can do is to look at some measurements that you have in your data set.
And it may be too complex or more than we need to capture a very simple smooth trend in the background.
Now, don't worry, I'm not actually going to teach the class from inside a video game.
And so, different groups of people will interpret evidence differently, not because of their interpreting it incorrectly but rather their decision to act on evidence will based on a variety of factors that are really outside the dataset and are not really part of the analysis.
Okay? And the problem is the trivial model is not useful.
I asked them whether they'd be interested in taking a class that focused on one of life's most important questions, what are the determinants of a happy and fulfilling life?
When we cook, we get to decide what is going into our children's bodies, and unlike the processed food manufactures, we have a strong interest in the health of the people who are going to be eating the food that we make.
Now I've left this to the end for a reason, because sampling variability will exist even if you've done everything else right.
We're not going to be talking about those kind of models of course.
One-third of the penguins have turquoise hats so we'll assume that one-third of the population wears turquoise hats.
So, I've plotted here, the variable importance plot that rank orders all the variables in terms of how important they are to improving the prediction skill of the algorithm.
And particularly, you wouldn't want to send someone who didn't have the disease to a treatment that's going to be very painful and have a lot of side effects.
And even though they're called tuning parameters, they can often have a very big impact on the prediction quality of the algorithm depending on how you set them.
And again, this method identifies whether this data element is one that was being saved and if so it records that in addition if this is the end of the earthquake tag, then the result string for this piece of earthquake data is added to the result list.
Secondary analyses that I call them, that will try to test whether your primary analysis is appropriate or not.
We know from the science that strong relationships help us to stay resilient.
In this lecture, we're gonna talk about whether you have the right data to answer your question.
So, the first trap that we all fall into and by the way I am guilty of all this traps too and I'm always trying to work on them.
This format is intended to be light weight and resembles the data structures found in traditional programing languages.
But you may not be so sure about whether that solution will hold up to kind of challenges or will be sensitive to the little changes in the data or the model.
Because ultimately you're gonna want this model to be reproducible.
Teaching about learning algorithms is like giving a set of tools. And equally important or more important than giving you the tools as they teach you how to apply these tools.
Here's a slightly more recent definition by Tom Mitchell who's a friend of Carnegie Melon.
So in this lecture I was going to talk a little bit about what are the characteristics of a good question.
So maybe the health outcome is mortality.
So we had two different models kind of adjusting for the background trends.
In fact, this is an increasingly popular way to transport data.
Now consider a different example.
Small amounts of sugar are almost certainly not going to cause problems for a healthy child who's eating a balanced diet.
But plants, they can't do that, they can't escape.
So just for comparison, we can fit actually even more complex model.
We don't need to know, we don't need to worry about estimating associations, or adjusting for confounding factors.
So from the dataset we can estimate that it's one-third of the dataset.
HSK is the most important Chinese proficiency test in use today.
Under different sets of models, including different sets of confounding factors.
And this will kinda make up the totality of the evidence for your analysis.
We allow for this background trend.
One thing to note about this example is that it was easy to know whether your expectations were matched with the data or not.
And if you're gonna make an argument to someone, you're gonna argue for doing this versus that and you wanna build evidence to make your case, this is gonna be a basic sketch of how that argument's gonna work out.
So once you've gotten the check, you observed the reality of what the meal costs.
So the outcome can take either value in that grey area.
So the next item that I always try to think about when I'm looking at a new data set, I'm just getting involved in a data analysis is what I call ABC, always be checking your Ns, okay?
With the advent of automation, we now have electronic medical records, so if we can turn medical records into medical knowledge, then we can start to understand disease better.
Including DOM parsers.
And what this is doing is just logistic regression.
So these are the kinds of things that can go wrong with inferences and how you can try to protect yourself from these things.
This is the story of Beth, her friend Carl and an idea they believed could become a great business.
And it doesn't have an important association with the outcome.
In particular, particulate matter is high in the summer.
Most of the processed food that we find on a supermarket shelf today contains some form of sugar.
So many analyses are done to ultimately inform a decision.
But what the data doesn't tell you is why that relationship might be nonlinear.
Next the code uses a JSON tokener to parse the JSON response into a Java object, and then to return that top level object, which in this case is a map.
So this model is telling us something completely different about the population than the normal model was, right?
And for events data, there is a recommended format to put the JSON data in and that's the format that the QuickStart service uses. However, you are not required to use this.
But because your original expectation was so diffused, and so kind of general, you don't really learn that much from collecting the data given your very diffused expectation.
With that, I'd like to extend, once again, a very warm welcome to a life of happiness and fulfillment.
So, this is a linear model, and the parameter that we're trying to estimate here is beta.
So plots can very quickly reveal this kind of information in a way that often, tables cannot.
But that the association is inherently weak.
The fifth type of question that we're interested in is a causal question.
And here I've plotted the 95% confidence intervals for each of these associations.
I live in New Zealand and I am currently studying health science and psychology in University of Auckland.
It's one of the most fundamental components of being a data scientist.
So ultimately, the question you're asking is is it worth the risk of buying these ads. Given the evidence that you've seen from these different models showing you that there might be a range of say, $39 $49 increase in total sales during the period of the ad campaign.
And that method creates and sets a list adapter for the list view, passing in the result list, that was computed back in handle response.
The next thing, once you've solved the directionality issue is the magnitude.
And by doing so, we're saying that, you know what, if we think there's more than a 30% chance that they have cancer we better be more conservative and tell them that they may have cancer so that they can seek treatment if necessary.
That's the key, and often the more sophisticated approaches will give you a little bit more insight.
That's an example of gamification.
Okay?
So a couple of example of what a model might consist of is that we might assume, for example, that the units of the population are independent of each other.
Now this type of relationship is often very difficult to identify outside of highly controlled environments, for example in engineering processes.
So for any new data set that you bring into a prediction algorithm, you'll probably have to tune it a little bit, and that's okay.
And if you can't remember or you can't reproduce the tuning parameters, you'll never be able to reproduce the algorithm itself.
So if they truly have the disease, you'll want the algorithm to pick it up.
It's also different from career smarts, which has to do with the skills required to advance in your career.
And so it's useful to the audience sometimes to know if you're trying to ask an inferential question versus a causal one, versus a predictive one.
And, furthermore, if you choose any point on the blue line, there are roughly the same number of points above the line as there are below it, all right?
So as you can see as the predictor quality goes up you'll see that the number of actual good credit quality people increases.
And so this is a much more difficult type of question because now you're concerned about things that are outside the dataset.
You'll also see that there are input nodes to respond to various other inputs, such as a web request or a websocket, or simple tcp sockets.
Because It will save you a lot of time and a lot of money and a lot of frustration if you just think a little bit beforehand about what you want to do before you start digging into the data and digging into the details.
And so the result is that if you fit any models, if you estimate any characteristics from your data, those things that you estimate won't actually apply to the target population that you originally designated.
Okay.
So I wanted to give a brief example of how these different types of questions can lead to different conclusions, even on the same dataset.
So you can see from this picture that the fit's not perfect either, okay?
Each course progressively builds on your knowledge from previous courses to give you a well-rounded view of what Data Science is, while helping you to develop skills to practice data science.
Or it might be 5% and it's terrible.
We have beta, which is the change in y associated with a one unit increase in x, adjusted for z.
For starters, the data may not even be representative of the population.
This lecture just raises a couple of questions that you can ask yourself to figure out when it's time to stop.
This lecture's just another quick example of how to use statistical models to explore your data and to develop a sketch of your solution.
So Associational Analyses, there the aim is to look at the association between two or maybe more features while in the presence of many other potentially confounding factors, okay And so for the most part we're interested in looking at the association between two things, but there may be other things that we have to adjust for, or account for, and that we want to make sure we're accounting for properly.
So a simple example might be, you might wanna know whether a good diet leads to better health, okay?
Then there's a sharp increase in ozone with every additional degree of temperature.
So, if you implement it using these equations that we have on the right, these would give you an efficient way or both of the efficient way of computing h of x.
Or it maybe some information from another system where you have additional information that puts the sensor data into context.
What if there was a tool that could help with this challenge, simply and visually.
The first format we'll talk about is the JavaScript object notation JSON.
Because that's what the normal distribution based on this data set tells us about the population, okay?
So that would be a high precision, relatively low recall.
And check to see if they’re actually correlated, right?
Because we don't have any a priori information about might more or less important.
But gravity is a very weak phenomena, and as a result, the only object big enough to produce noticeable gravitational effects, that is once we, we are routinely aware of is our earth.
There will always be something more complicated.
So, for example, if you're an academic your audience might be collaborators, they might be a scientific community, they might be your funders or the general public for your problem.
So whether a model could be considered more or less plausible will depend on your knowledge of the subject.
Here at Stanford, the number of recruiters that contact me asking if I know any graduating machine learning students is far larger than the machine learning students we graduate each year.
That's food that's produced by companies who prioritize short-term profits over the long-term health of our children.
Plants have to know what their surrounding are. They have to feel the weather.
So these things that we can't observe, the things that we wanna make inferences about, these are usually referred to generally as the population.
So, let's say we're going to sample three penguins, and the way that we sample those three penguins is, we stand there in front of these ten penguins and we take the first three that walk up to us, okay?
You might not have the money.
But it was also designed to be very, very small in code footprint. And also use of power and network bandwidth.
On the y-axis I have the values of the outcome which are just 0 and 1, so it's a binary outcome.
So the goal of EDA is to get a chance to look at the data, see if you've got the right data, do you need to get more?
Again, you can be more specific about the differences between the two.
So this is a sign that you're out of data and you need to get some more.
So one of the pollutants that we'll look at is nitrogen dioxide.
And then, once you've collected that information, and compared it to your expectations you can react to it, and maybe change your behavior in some way.
So this brings us to an important point which is that it's important to have a very sharp expectation or a sharp hypothesis about what you're trying to investigate.
So it's important to always keep in mind the ultimate goal of the analysis, the decision that might need to be made, and the stakeholders who are involved in making those decisions and what kinds of factors they need to consider.
So a very specific question will tend to lead to a well defined intervention and that will give you, those kinds of results will allow you change behaviors, to change actions to actually make a difference in the world.
The point is not really to think about yourself as defending the analysis but rather to work to get their input so you can do your best work.
All the predictors might be equally important before you look at the data.
And so for example, if you're looking at a table, I like to check the top and the bottom.
Let me give you an example.
For example, in health research, there are a lot of questions that we have about how the body works and how does disease occur, but we simply cannot conduct the experiments that we might like to conduct because you might be putting people in danger, or it might be simply unethical to do such experiments.
So, in case what I'm saying here doesn't quite make sense, stick with me for the next two videos and hopefully out there working through those examples this explanation will make a little bit more sense.
Or it doesn't help you learn the process you're trying to study or in this case, the cost of the meal at this place.
The first, of course, is the audience.
What does that mean?
So it could be that you just made a simple mistake, and you need to check for this.
This method begins by passing the raw response through a basic response handler, which just returns the response body without the HTTP response headers.
Happy smarts is, by the way, very different from academic smarts, which has to do with IQ and test-taking ability and critical thinking ability, those kind of things.
So the basic example I'm gonna present here is going out to dinner with your friends.
So then we can just repeat all the steps that we just went right through.
This was a quote by John Toupee, one of the first data scientists.
And then anybody interested in receiving that data, subscribe to the topic.
So we have three models.
It's also useful to kind of remind yourself of what type of question you're trying to ask, whether it was descriptive, exploratory, inferential, predictive, causal, or mechanistic.
When XML is used to encode an HTTP response.
So, the first and I think the most important thing you want to have in a presentation is to state the question that you're trying to answer.
And yet he seemed far happier to me.
We now have two real numbers.
There are several different types of learning algorithms.
Your data set is the thing that's inside the box but you want to make sure, kind of see what's, everything's kind of the right shape and size.
Then you can actually end up with a very high precision with a very low recall.
This node allows you to create a web server within Node-RED.
And for each of the kind of results that you generate, you similarly wanna think about the direction, magnitude, and uncertainty of those results.
So, what are the expectations that we have about prediction problems?
Even some brands of the foods you might consider to be inherently healthy, can turn out to be heavily processed, like yogurt, or cheese, or even pasta sauce.
You'll need to build an infrastructure that won't collapse, as your business begins to grow.
So steady expectations involve deliberately thinking about what you're gonna do before you do it.
But in reality we're doing something more exploratory and so that's sometimes called data dredging.
Because the goal is to really produce a very good prediction of a given feature, given a set of other features, and the goal is not really to explain how things are working.
So what are we going to do about that?
And it should be, any actions that might be taken that are a result of the answer to your question.
You can go to the closet, get a sweater.
Those are the three parts of data analysis that you often will cycle through, many times, in the course of analyzing any given data set.
So we think of the line as being kind of unbiased, right?
The task T would be the task of playing checkers, and the performance measure P will be the probability that wins the next game of checkers against some new opponent.
Here's some data with a low-S smoother, which is a flexible smoother that can capture a lot of different kinds of smooth trends, okay.
But this model tells us that the hump's more like around seven and ten.
We just wanna know what's the proportion of penguins with turquoise hats.
And they include a variety of variables to help you to predict the creditworthiness of these people.
Again, like a descriptive question, typically with an exploratory question, you're not interested in things that are outside the dataset, but rather in summarizing and characterizing relationships within a dataset.
And then finally a simpler model would be more efficient because you're using more data to estimate fewer parameters and so that's generally good from a statistical standpoint.
Let's look at the source code to see how this works.
So imagine if you're trying to make a statement about the entire United States, everyone in the United States, okay.
But they represent population features and relationships, okay.
And an inferential question can often be the result of lots of exploratory and descriptive types of analyses.
It had a 70% accuracy, it had 2.6% specificity.
They might describe relationships between variables or certain features like means or standard deviations, and because they are characteristics of the population they're generally considered to be unknown.
So similarly it's important that you identify an important question that you wanna ask.
Sweetened drinks like sodas and energy drinks, and even fruit juices are very high in sugar.
Handwriting recognition.
So typically in a formal modelling, you'll have a primary model and maybe some secondary models.
Now whether your findings do fit the kind of existing literature or what people believe.
And so, the key to an inferential question is that you wanna make a statement about something that you don't observe.
Finally, Python has a significant set of data science libraries one can use.
And it's value is the longitude at which the earthquake was centered.
As we'll see, gamification is by no means limited to these kinds of contexts that you see here.
The fact that there's that relationship between the object's mass and its weight is convenient, but what to call that 9.8 Newtons per kilogram that, that, what's known as a constant of proportionality.
SAX parsers.
It is a number of topics, or a number of levels separated by the slash symbol.
This course is the first course out of five in a larger Python and Data Science Specialization.
Because often what you'll want to do is get another data set to replicate what you've done.
But it doesn't matter, at first, kind of what you choose, you want to be able to say craft the solution and then kind of hang on to it for a moment.
And so the key word here is sharpness.
You can see it asks you what actv verb you want to respond to, and also what the URL is for this endpoint.
And at 70 degree point mark the linear model is actually biased upwards.
So reproducible research is concerned with creating code and documents that will completely reproduce all of the analysis that you've done in a transparent way so that you can communicate that to other people.
And that is what actually sends the response back to the requester.
If we teach out children to see sweetened drinks as an occasional treat instead of as the norm it can really help to keep their sugar intake in check.
Always assess the sampling process and think about differences between the data that you collect and the population that you're trying to make inferences to.
Then you can kinda update your thinking, and try to come up with a different primary model.
So the first thing we need to talk about is what is a model, okay, and why do we need them, okay.
So, the general framework for doing formal modeling is very similar to kind of what we've been talking about all along in this course.
For anyone considering a career in business today, the ability to frame and solve problems using data analytics is really an essential skill.
And concretely, what the hypotheses is outputting is h of x is going to be equal to g which is my sigmoid activation function times theta 0 times a0 is equal to 1 plus theta 1 plus theta 2 times a2 plus theta 3 times a3 whether values a1, a2, a3 are those given by these three given units.
But in contrast this classifier will have lower recall because now we're going to make predictions, we're going to predict y = 1 on a smaller number of patients.
And so, separating these kinds of questions out, and the goals of these questions is very important because if you were to ask an inferential question, but then do an analysis that's really tuned for prediction you might be lead to come to the wrong conclusion that, oh, pm10 is not important.
To map this world of challenges, so that, it can be tamed and organized, allowing you to explore it, manage it, and plan for it.
I know that you're probably a busy person with many, many other things going on in your life.
And z3 is stated 2 times a2 and finally my hypotheses output h of x which is a3 that is the activation of my one and only unit in the output layer.
Okay. So just take this as a basic example.
This is just a website that was up there and anyone who just happened to come by could fill in their name and say what price they'd be willing to pay.
I can't tell you how many times I've been in a presentation and we've been ten minutes in and I still don't exactly know what question this person is trying to answer. Okay.
We talked about managing anxiety either through cognitive strategies, deliberate breathing or developing and sustaining a mindfulness practice.
These include things like breakfast cereal, granola bars, cookies, crackers and sweetened drinks.
It's important to be prepared, so that you can be focused and concise about what you're trying to say or what you're trying to communicate.
Exploratory data analysis is a highly iterative process.
And so if your model's incorrect, it doesn't correctly reflect the relationships in your data in your population, then your parameters that you estimate will not really apply to that population and may be difficult to interpret.
There's no actual data here.
Because you're usually not interested in developing a kind of detailed understanding of how the variables are related to each other, or how about they're related to the outcome.
There are many other tools that one can use in data science, such as specialized statistical analysis languages like R, or more general purpose programming languages like Java and C. We chose Python as the basis for this specialization for three reasons.
And we challenged you to develop your own gratitude practice, so that one of the lens through which you view the world is what am I receiving.
Some of you may not be familiar with the concept of gamification, so let me give a brief introduction here.
So something you might do is just put out a simple survey, you might survey 20 people, and that may be a representative of the the larger population of people that would be willing to buy this product.
Hopefully, that will inform the next steps of your data analysis.
For example, we often think we're asking a causal question, but the way that we've done the analysis or collected the data is really results in us asking an inferential question, and that is the classic correlation does not imply causation type of problem.
And similarly, if the uncertainty blows up when you add certain factors into your model, then that will indicate to you that maybe you don't have a very good estimate of an association or an effect in the population.
That's a very, that pattern is kind of repeated across the years.
And so the question you really wanna know is, did the ads cause an increase in sales.
As far as the style is concerned, I generally find it useful to avoid a lot of jargon because as soon as the audience grows to more than just a few people, it's likely that different people will have different expertise, and different skills.
That will afford you to have less uncertainty about those parameter estimates because you're able to use more data for each of the parameters.
And so in this example, here are the F Scores.
It's www.happysmarts.com.
Ultimately, you want to gather a single fact.
If you've got data, for example with dates, often looking at the top and the bottom can be useful because if they're sorted by date, you can see if the range is correct.
They have to be able to feel if something is touching them.
So the easiest way to do that is to draw a fake picture.
And again, the use of the tokens, whether it's for the application API key or the device or gateway token, is also required to be followed.
First, you need to be able to identify the population to which you're making these inferences.
So we're going to talk about in the reproducible research course, the structure of a data analysis, how do you set it up, how do you organize it and put it together.
Welcome to the final video of this Machine Learning class.
And before we do the analysis, we may not weight them any differently, we may weight them all equally.
And in particular that content comprises the response data.
Now any good prediction algorithm, will help you to determine which variables are useful for predicting the outcome and which aren't.
So what we're going to do in this class is actually spend a lot of the time talking about how if you're actually trying to develop a machine learning system, how to make those best practices type decisions about the way in which you build your system. So that when you're finally learning algorithim, you're less likely to end up one of those people who end up persuing something after six months that someone else could have figured out just a waste of time for six months.
A few months ago, a student showed me an article on the top twelve IT skills.
Who knows who these people were, who knows if they were even prospective customers, people who would actually buy the product?
I'm also a professor of marketing at the McCombs School of Business at the University of Texas at Austin and I have been there for over 15 years.
It may be that you have enough information or you have enough evidence to make a decision based on lots of external factors that are not really in the data set, okay.
And then we're also going to slow down a little bit and shine the light directly on relationships.
And based on which emails you mark as spam, say your email program learns better how to filter spam email.
But once you've done that, if you wanna move on, then you can start using exploratory data analysis techniques to kinda refine your primary model, and then move on later into a more formal model.
And sugar can have many different names.
But when we do include it, we see this kinda quite a bit stronger relationship.
At any rate, before you've gone to the restaurant and eat the meal, you can use any sort of opreory information to set up your expectations for what the cost is ultimately gonna be.
But where the features fed into logistic regression are these values computed by the hidden layer.
That's communication as a product but as you go through the whole process, you want there to be constant communication and dialog so that you can assess findings and challenge them constantly to make sure that you get the best results that you can.
The better and more clearly you can describe how the data came to you, the stronger your results will be.
For the list view passing in the result list that was computed back in handle response.
And finally, a good question is very specific so that it can lead to a well defined intervention in the world.
The startTag method is called, passing in the element that is being started.
So, even if you have gigantic data, it might not be big enough to be able to answer your question if it's not the right data.
There may be cases where you actually have all the data you ever wanna see and then you actually don't care about things that are outside the data set, which is really, fundamentally, what inference is about is that you're talking about things outside the data set.
And then taking some of those techniques, and thoughtfully applying them to other situations which are not themselves games.
So, an increase in the pollutant results in a five percent increase in mortality.
So for example, in biology often people will use mice as models for humans.
I felt that we weren't doing such a good job of achieving this goal.
In contrast, there's a different way for combining precision and recall.
With the prediction question, the goal is to develop a model that best predicts the outcome using whatever information you have available to you.
This data's a random sample from all the different years that you might look at.
Now data analysis is often done in support of some sort of decision making.
But the truth is there are a lot of good questions, really interesting questions, that are just simply not answerable.
In addition to those factors, we have a number of parameters that we wanna estimate.
So I hope that you also got a lot of out this class.
Throughout these videos, besides me trying to teach you stuff, I'll occasionally ask you a question to make sure you understand the content. Here's one.
So, you're not trying to estimate any parameters.
So, machine learning was developed as a new capability for computers and today it touches many segments of industry and basic science.
But when our children consume too much processed food, then they end up eating too much sugar as well.
The important thing to remember is that the input node adds some information to the message object that flows along your flow.
It's important to kind of classify the purpose of a given communication and to provide the appropriate environment and audience for that communication.
It's gonna help you tell the story better in the end after you analyze the data.
Just so you can kind of keep them in mind and think about them the next time you're making your next presentation.
We'll talk about how researches are using this to make progress towards the big AI dream.
And so maybe we want to tell someone that we think they have cancer only if they are very confident.
But there's no actual data in this picture.
But draw a fake picture of what you're expecting to see with the actual data okay?
The difference once again, they, they, they, third person, verses me, me, me.
And then after the campaign finishes you have an average of, again, about $200.
Hi, I'm Roger Peng and I'll be your instructor for this course.
Why is it that you thought it was $30 and the meal turned out to be $40.
And then revising your expectations based on what you've seen.
The short answer to that question is that it's pulled downward by its weight and it accelerates downward in response to that force.
So a primary model is the model of which you kind of base your other analysis around, right?
So that similarly we have a(2)3 equals g of z(2)3.
It's the reason why evidence doesn't change people's mind, but one Facebook post will.
So in module three, we talked about the role of emotions in resilience.
So, the a1 is vector, I can now take this x here and replace this with z2 equals theta1 times a1 just by defining a1 to be activations in my input layer.
And so we can include nitrogen dioxide in our model as a potential confounding factor, and see how the estimate for particulate matter changes when we do that.
But once you kind of get into a data analysis, if you discover these things later, it can be a real pain in the neck to deal with.
How do they do that?
So that the implication is kind of the last step of interpretation.
It's often helpful or tempting to just show summaries, or you just kind of just not even data summaries, but just kind of summaries in words of what the results are.
And well, I won't even say who wrote this but someone else wrote in that there would be a problem with sex.
So, given a certain type of question, you will have a population that question applies to.
And mortality is our outcome.
This is all very nice and ideal because it's simulated, okay?
In the next lectures I'll give examples of associational and prediction analysis and how the formal modeling framework can be used to work through these different kinds of analyses.
Especially, if you have an interdisciplinary type of audience, you wanna make sure that everyone's on the same page at the very beginning.
And what would be the impact on the population of having an effect of this size or if the association was of this size?
And then gamma is the change in y associated with a one-unit increase in z, adjusting for x.
And so the primary result will be kind of initially what you focus on.
The third step is building statistical models to summarize your data and to quantify the relationships.
And that's something that's called over fitting the data.
In fact, they all seemed to lament that they did not have the opportunity to take such a class.
They have to be able to know where the light is.
So the first trap, speaking in third person.
And the question you wanna ask is does that magnitude make sense?
This F Score formula is really maybe a, just one out of a much larger number of possibilities, but historically or traditionally this is what people in Machine Learning seem to use.
But let me show you a couple of examples of the ways that people have tried to define it.
Such as randomized controlled trials, or directly controlled kind of laboratory experiments.
So it's important to not kind of conflate the magnitude of the association with the ultimate effect on the population that you're interested in.
That's what we're going to be learning in this course.
But because it's so important, it deserves a separate discussion.
In the very first year I taught the course, I was offered the university-wide professor of the month award.
The data you send can be in any data format, typically it's in a JSON object format.
There are a couple of key components that I feel like are important to making a good presentation that involves data analysis, and so I thought I would just list a couple of them here.
But the models that we use are mathematical models in many cases.
Now, there are other parameters, gamma and alpha, that are in the model and we need to have them in the model for the model to work.
And I'm going to categorize them into what I call Associational Analyses and Prediction Analyses.
Now on the other hand if you do find something that is expected, and your findings kind of match what everyone is looking for.
But the problem is, so this is, there's no model to help us think about the data.
There's also a key called lng.
But rather to focus on developing a model that's actually useful to help you tell your story about the population.
Obviously there'll often be many more things that you're gonna wanna do.
Or you might want to go into more formal modeling.
The fourth feature is that the question has to be answerable okay?
Every setting is going to be a little bit different, and so you're not going to always focus on a given metric for every single application.
But it's a good example as a starting point of what we're talking about.
Even as a teenager, I would read all kinds of books on philosophy and psychology, on what it takes to be happy.
And if you just take the average of (P+R)/2 from this example, the average is actually highest for Algorithm 3, even though you can get that sort of performance by predicting y=1 all the time and that's just not a very good classifier, right?
There aren't many data points below the line, right around that neighborhood.
So one scenario might have very little cost or another scenario might have very high cost.
Now that I've given you some background information about myself, let me turn to a question that you may be wondering about.
And, if every time you try out a new algorithm you end up having to sit around and think, well, maybe 0.5/0.4 is better than 0.7/0.1, or maybe not, I don't know.
For me, I work on machine learning and in a typical week I might end up talking to helicopter pilots, biologists, a bunch of computer systems people (so my colleagues here at Stanford) and averaging two or three times a week I get email from people in industry from Silicon Valley contacting me who have an interest in applying learning algorithms to their own problems.
Okay because there's nothing to make inference to.
So this is life without a model, okay?
And so you do the survey and then someone comes at you and says, okay well what did the data say?
Completing those two bits of information is all you need to do.
It's seen significant adoption over the last five years.
So Tom defines machine learning by saying that a well-posed learning problem is defined as follows.
Often, the only way that the procedure can be specified, is through computer code, or through some algorithm.
So decision making can depend on many different inputs.
So this is what I call, drawing a fake picture, okay?
So the idea is that you wanna build, if you're kinda making a case for something, you wanna build some prima facie evidence.
Interpretation in the data analysis is a continuous part of the entire data analysis process.
Well, you got this huge spike in the histogram at around $10, okay?
Now, given what we've seen before with the theoretical normal curve, with the fake data and the fake picture that we showed, how does this picture compare to the fake picture, okay?
And we're going to focus on two critical topics, and these are ones that are near and dear to my heart.
So in this example, this was an associational analysis, and it focused on estimating the association between two features.
And then, do you need to get other data, if you don’t have the right data, to answer your question?
And I think that one of the biggest global health care issue at the moment is the increase in mobility of patients to consume health care abroad.
So, the important thing about having a very specific question is that it often can lead to a well defined intervention, okay?
We're going to talk about some of the components of reproducible research in terms of programming which include Markdown, LaTex, R Markdown.
Where a value up here, this would correspond to a very high value of the threshold, maybe threshold equals 0.99.
Then we'll make predictions based on the model, but on the test data.
So, the first thing to think about is prediction quality.
Now the question that we're asking here is very simple.
Now what this results in is that the sample data will not be representative of the population.
Python is now the language of choice for introducing university students to programming.
So far the examples in these lectures have generally illustrated one phase of the EDA iteration.
And so the basic question that we wanna ask is, what proportion of this population of penguins is wearing a turquoise hat?
So we may wanna revise our original linear model based on what we found here to think, to try to capture these trends and to learn more about what's going on underneath.
In this lecture I'll talk about how to use statistical models to explore your data.
The other possibility is that the expectations don't match of what the reality is.
We will meet with people like the founders of 3D Hubs, who make available a platform which gives you access to 30,000 printers.
They can't move.
So just to summarize briefly, always think about what the population's going to be.
The guys who had done really well for themselves in school weren't necessarily the ones who were doing well in their careers.
And the nice thing about formal modeling is that it allows you to specify which types of parameters you're interested in and are trying to estimate.
So that's kind of what this course is about, is about managing this entire process and the iteration that you go through to analyze data and to produce coherent results.
And I believe that the biggest global health challenge, we're facing currently is poverty.
So the term architecture refers to how the different neurons are connected to each other.
The final module of the course is dedicated to the course project where you'll take some datasets, merge and clean them, then process the data and answer some questions.
He says, a computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
So, what do we do with this?
So it could be a good thing if you find something that's unexpected.
Reproducible research is another one of or more unique courses.
But they'll generally follow this framework, and you're gonna wanna iterate between kinda fitting primary models and secondary models.
And so you have to factor all this in when you kind of ultimately draw your conclusions about what to do or what decisions to make.
For example, this model, the normal model, says that 68% of the population of readers would be willing to pay between $6 and 81 cents, and $27 and 59 cents, okay, how do we know that?
A second range of machinery applications is ones that we cannot program by hand.
Now the second problem you can have is an incorrectly specified sampling process.
So that way, you can get a sense of, okay well, whatever I’m measuring in my data set, it’s measuring the same thing that this other metric is looking at, and so there may be some kinda validity there.
So now I've changed the question to a prediction type of question, and we want to know what predicts the outcome best.
So, that's basically it.
And the basic idea here, is that you need to be able to have a reasonable explanation for how the world works, if the answer to your question is kinda what you expect it to be.
And second, and more importantly, I want to start giving you intuition about why these neural network representations might be a good idea and how they can help us to learn complex nonlinear hypotheses.
So, the interpretation might be, okay air pollution is bad for you, okay, and then the decision might be we need to lower air pollution levels in whatever environment we're thinking about.
Cuz that gray area is the area where you have the most uncertainty.
If you can't coherently describe the population, then you can't really make any inferences.
And lastly, always require a measure of uncertainty for any estimate that you draw from the data.
Okay, 4/4 time.
That's part of why we saw it when we didn't include season we saw no relationship.
So, most of what I'm gonna talk about in this lecture really pertains to inferential or kinda causal questions, where the interpretation can be very challenging.
Finally, another type of factor that may be related to both mortality and PM, particulate matter, is other pollutants, right?
So the first is can we characterize the population?
So that's useful to know.
The basic form of a model in an associational analysis will look something like this.
Learning algorithms are also widely used for self- customizing programs.
So now the question we should be asking right now is, what does a linear model look like?
For as long as I can remember, I've always been interested in happiness.
And then you have maybe some potential confounder that will call z.
Things like confidence intervals and other measures that can give you a sense of the range of possibilities and kinda be magnitude of the uncertainty in your estimate are useful at this phase, before you start making decisions about what to do or what not to do.
Then we'll fit the model to the training data.
And actually I've highlighted the coefficient for pm10, it's even bigger than it was before and the standard error is similar, so it's actually more statistically significant in some sense than in the previous models.
So there's three classes of variables in an associational type of analysis.
There are two kinds of communications that you want to engage in while you're working through a data analysis.
So some of the things that I like to check are, do you have the right number of rows and columns?
How does those change from model to model?
Something that can also be helpful is to compute a set of summary statistics about the prediction algorithm, and you can see that here.
This lecture discusses the goals of EDA and what to expect from it.
And for sure, you will incorrectly specify the model for the population.
So the directionality, magnitude, and uncertainty, are the key things to look at when you've got your primary model results.
Samsung Nation is something that Samsung has on its corporate website, and it's a system using what we call game elements or game mechanics to solve Samsung's business problem.
But I need one more value, which is I also want this a(0)2 and that corresponds to a bias unit in the hidden layer that goes to the output there.
But it's often glossed over as a sort of a side component of many other programs of data science.
The next thing to think about is the content.
So the effect size essentially is what is the value of beta that you estimate.
So for setting expectations, typically you want to have a primary model, which is your best sense of what the solution should be and what is the answer to your question.
I'm also writing a book right now tentatively titled " If You're So Smart Why Aren't You Happy?" Which will be coming out sometime in 2016.
I should say that there are many different possible formulas for combing precision and recall.
So here are some of the results from the model that we fit.
So just keep that in mind.
Traffic can produce particles, it can also produce other pollutants.
In the next video, we'll start to give a more formal definition of what is machine learning.
One of the most important things about data analysis doesn't actually have to do with the analysis itself, but it is really important to ask a good question.
So you don't necessarily care about explaining how a variable can predict a given outcome.
We have to eventually figure out what the answer's going to be and move onto the next phase.
So, at around 85 degrees, what we see is that the blue line, our model, the linear model, is kind of biased downwards.
If the population were truly coming from a normal distribution, what would that look like, okay?
When different models tell the same story, for example, if you don't care about the range of $39 to $49, then it's sometimes better to choose the simpler model or the model with fewer parameters.
And you don't want to just forget about them because they could affect how your sales go, and you may think your campaign is having a big effect.
So if you've answered the question and maybe it's matched your expectations, it's often important to try to replicate your findings in an independent data set just so you can be sure that the answer that you've come to is actually the truth.
So these z values are just a linear combination, a weighted linear combination, of the input values x0, x1, x2, x3 that go into a particular neuron.
Or if you divert it onto a different or perhaps more interesting question to recognize that fact that you're looking at something different.
And further more, what you do about it is further separate from what you think about it, how you interpret it, and what the evidence is.
We're tending to predict too high for what the ozone level actually is.
And then you try to pick away at it to see if it falls apart.
And so you're gonna wanna think about each of these primary and secondary models, in the context of the three's factors that we just talked about, which are directionality, magnitude, and uncertainty, kind of roughly in that order.
And you also saw how we can vectorize that computation.
So imagine that the different penguins in this population are wearing different hats.
So depending on who you are and where you work and what you're doing, your audience is going to be a different set of people.
So, how much data is there?
But the ultimate decision and how much evidence is needed to inform that decision will depend on the stakeholders of the analysis of who asked for the analysis, who is being informed by it, and kind of what their cost benefits and what their kind of values are.
So that's somewhat less than the beta that we estimated for the primary model.
In fact it's so common to mistake one type of question for another that we actually have little nicknames for when this type mistake occurs.
In particular, two formats that we'll talk about now are the JavaScript Object Notation, JSON, and the Extensible Markup Language, XML.
But ultimately, for a given data set, it takes a person to put it together, to ask the right question, to assemble the right methods to interpret the results, and to communicate them in a way that people really care about what you found, all right?
At the next level, you might want to get some feedback about results that are puzzling or unexpected as you were analyzing your data set.
It's often easy to kind of conflate all these things into one sentence or to one statement and I think it helps people to provide a useful discussion if you can separate them out.
Communication is fundamental to good data analysis.
If you were expecting a positive association, then you can see whether the result is positive.
And this is usually the predictor of interest that we wanna know, how does it vary?
The ideas of the models will stand in for the population.
The gamma distribution only allows positive values.
Lastly, I think it's important that when you present results from a data analysis, that you separate three things, they are the evidence, the interpretation and the decision. Okay?
So we sometimes call them nuisance parameters, because we have to estimate them, but we don't actually care about their value.
And, if you do this, then you're predicting someone has cancer only when you're more confident and so you end up with a classifier that has higher precision.
So that may be the decision that results from your analysis.
And so, for example, in a financial application like the dataset we just looked at with good and bad credit quality, there may be asymmetric costs associated with mistaking good credit for bad versus bad credit for good.
Often, this is called a convenient sample.
So when we put all these results together from the primary model which just had pm10 in mortality, and then the various secondary models we've shown, you can see that the primary model had a zero association effectively, and then the other three models had a relatively strong positive association with the outcome.
So for example, they give you the mean or the median.
Another intentional assumption that we might want to make is that certain features of the population are linearly related.
Instead of setting the threshold at 0.7, we can set this at 0.9. Now we'll predict y=1 only if we are more than 90% certain that the patient has cancer.
Now it's still quite a strong effect, relative to the standard error that we estimate but it's not estimate, that is not as strong as it was before we entered, we included no2 in the model.
And that be a pretty reasonable way to automatically choose the threshold for your classifier as well.
He was very successful, but it was also very clear that he wasn't happy.
And so you might wanna ask, well are there background trends that for some reason increase sales over a three week period?
Now, let's go back to our example application.
It was a slightly older article, but at the top of this list of the twelve most desirable IT skills was machine learning.
Now, I assume you've already formulated your question, but it's good here to always double check to make sure that it's as sharp as it can be.
So we're going to look at ways that Node-RED allows you to make your sensor data accessible to other systems, but also how to pull in other data into your Node-RED flow so you can then combine it with the sensor data to create your logic.
And instead of saying, better health, I'm saying fewer respiratory tract infections, which is something very specific that we can measure.
We can set expectations.
The second data format that we'll talk about is the Extensible Markup Language, XML.
And so there's no notion of let's say a key predictor and confounder.
So let's take our primary model, which is just gonna be a simple model with the outcome and indicator of the campaign.
And this gives you a sense of how important a variable is in increasing the skill of the algorithm in predicting the outcome.
With your study or your data analysis process.
So here is a picture of what kinda more realistic data might look like from an experiment like this.
So that quick check on directionality is a useful kind of quick check for seeing whether your results kinda match your expectations.
>> That's because that every great idea that you shape our world, there are thousands of seemingly great ideas that have flocked.
Okay, what is the most obvious thing that you would do in the kind of simplest of scenarios, right?
As they say, the best way to learn about a topic is to teach it.
And so a higher fraction of the patients that you predict have cancer will actually turn out to have cancer because making those predictions only if we're pretty confident.
And once everyone knows what the goal is, we could all be kind of on the same page and oriented towards achieving that goal.
And each one of these five steps you can imagine there's a little cycle that you can engage in which involves setting expectations, collecting information, and matching those expectations to the data that you collect.
So you have to ask yourself how does the data match up with this normal distribution, with this model, okay?
Just to remind yourself of the kind of nature of the question, and to make sure that the analysis that you did matched the question that you were asking.
Is that what you expect it to be?
What is machine learning?
So if you have auxiliary data about the population, maybe certain features of the population, then you can quantify the differences between the sample that you collect and the auxiliary data on the population to see how different they are.
And also, you may have to think about the plausibility of the various models that you've chosen.
The markup encodes a description of the document's storage layout and logical structure.
In a few seconds, the video will pause and when it does so, you can use your mouse to select one of these four radio buttons to let me know which of these four you think is the right answer to this question.
But what he did was he had to programmed maybe tens of thousands of games against himself, and by watching what sorts of board positions tended to lead to wins and what sort of board positions tended to lead to losses, the checkers playing program learned over time what are good board positions and what are bad board positions.
And notice that, there, z2 this is a three dimensional vector. We can now vectorize the computation of a(2)1, a(2)2, a(2)3 as follows.
And the cool thing about this, is that the features a1, a2, a3, they themselves are learned as functions of the input.
So you can look at the check to see if there is a problem with the data that you collected.
And then you'll be testing it in various different ways, just to see if it's going to work.
First of all, the development of ozone in the air depends a lot on the amount of sunlight that's available.
So here are the results from that model, and you can see that compared to the previous model the coefficient for pm10 drops a little bit, it goes down a little bit, so the effect weakens a little bit when we include nitrogen dioxide in the model.
Then you might care which end of the range you fall on when you run this ad campaign, because maybe $39 is not worth it, but $49 would be worth it.
But for the most part we just did not know how to write AI programs to do the more interesting things such as web search or photo tagging or email anti-spam.
So in this scenario where you thought it was gonna be $30 and it ended up being $40, well then the next time you might bring an extra $10.
Once your idea meets the world, it will meet customers.
Just pick one that seems reasonable from the data, and provides a, kind of, sensible solution.
Ultimately, when you've completed a data analysis project, you're gonna wanna communicate your findings and your results to an audience.
Okay. So that way you get a sense of kind of how bad things can be or how difficult it might be if you didn't ever use a model.
There may be multiple explanations for why the histogram from the data doesn't look like what we'd expect from a normal distribution.
So that you can kind of reduce the number of variables or the kind of fact features that you might want to look at, before you kind of go ahead, okay?
However, if the treatment for that disease is very painful or there are a lot of bad side effects you may want to be careful about exactly who you send out to treatment.
At the highest level, you may have reached a major milestone, or maybe you've developed a primary model, and you want to get some feedback on what you've done so far.
Now one thing that we can see is that temperature and ozone are in fact increasing together.
For an IoT solution, the value comes from not being able to send data from a device to the cloud, but in what you do with that data.
So once you've got your primary model, this is based on exploratory data analysis and kind of formal modeling, you're gonna have a primary result.
In other words, when you let go of the ball, its weight domainte its motion and it plummets. Well, weight is associated with gravity.
So it might be useful to start at this point, which I think it was to ask what's it like to have no model at all?
And compare it with measurements that are similar, if not exactly the same feature, but things that you would expect to be correlated with what’s there in your data set.
So it looks kind of like your primary model but it has variations to it.
In pop sub communication, there is a broker in the middle on the sender, publishes data to the broker against a topic.
So pretend I'm giving a lecture on how to be healthier and I tell you something like this.
So it's important that the http input and output nodes are used as a pair.
The next thing you need to be able to do is to describe the sampling process.
So what have we done?
And the IBM Internet of Things platform specifies how that topic space is organized.
And from that they found out that there were about 5.2 people in between the person that originally got the letter and the person that finally received the letter.
So I was personally very interested in teaching a class on happiness, but what I wasn't sure of is whether the business school would approve such a course.
These parsers read the XML document as a stream.
So we're often not, we're not often gonna be looking at mechanistic types of questions, but it is an important area to think about.
That's because data analysis is a highly verbal process and requires regular back and forth and discussion to move the process forward.
Every time you use a web search engine like Google or Bing to search the internet, one of the reasons that works so well is because a learning algorithm, one implemented by Google or Microsoft, has learned how to rank web pages.
You need a model for how either elements of the population interact with each other or how they're related to each other.
When I said that I expected the meal to be $30, it was very easy to know when my expectations were not met.
There may be many reasons why a question can't be answered.
We also spent some time talking about special applications or special topics like Recommender Systems and large scale machine learning systems including parallelized and rapid-use systems as well as some special applications like sliding windows object classification for computer vision.
So, what can we learn from Angry Birds?
You'll need to find reliable channels to reach and acquire your customers.
That can help to inform any discussion that may occur afterward in terms of improving your analysis or modifying it.
You may update it later based on any information that you collect, and you may change your primary model to be something different.
So you wanna make sure you've got everything kinda correct there.
The doInBackground method is similar to what we've seen before.
And if you don't find any mistakes the other thing that you might want to do is to try to replicate it in an independent data set.
It tends to be low in the winter in New York City.
Here you'll find details of how the platform uses MQTT and you must follow the directions on the brokening and the use of the topic space.
And concretely, if P=0 or R=0, then this gives you that the F Score = 0.
So this is what normal data looks like, if we see a histogram it kinda looks like this.
So this could all be done without actually looking at any numbers yet.
Now, if you want more details about XML, please take a look at this website.
So it's as if the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression.
Every time you use Facebook or Apple's photo typing application and it recognizes your friends' photos, that's also machine learning.
For the first seven years after I joined UT Austin, I was teaching a very standard business school course, Consumer Behavior.
Every time you read your email and your spam filter saves you from having to wade through tons of spam email, that's also a learning algorithm.
And in engineering as well, in all fields of engineering, we have larger and larger, and larger and larger data sets, that we're trying to understand using learning algorithms.
But just as a warning though, this could be a double edged sword.
So, we have to estimate them from the data and often the goal of an analysis Is to estimate given types of parameters.
There are many cases where, for example, if you have data collecting on your organization that you won't have to worry about things that are outside the data set because you may not be worried about things that are outside your organization.
Python programs tend to have minimal templating that you've might have seen in other languages, and have more natural constructs for typical tasks you might need to accomplish.
It's not by any means the only kind of example of gamification.
But by switching to the precision recall metric we've actually lost that.
And I'm going to skip right to the HttpGetTask class.
Well if you haven't understood this, these are all problems that plants have, because actually, the biggest difference between plants and animals, are that plants are sessile organisms.
And so, similarly, cloudy days tend to have lower temperature and less ozone.
Then there's the context.
So the observed pattern from the data seems to suggest that the relationship between ozone and temperature is actually kind of flat up until about the point where you hit 70 degrees, 72, 73 degrees.
So here's the normal curve.
And can weigh the evidence, in terms of how they would react differently or how important that evidence is to whatever process they're involved in.
So we have a dataset that results from a sampling process that draws from a population.
So, this raises another interesting question which is, is there a way to choose this threshold automatically?
And so in general, for most classifiers there is going to be a trade off between precision and recall, and as you vary the value of this threshold that we join here, you can actually plot out some curve that trades off precision and recall.
But it's actually very useful to show the data.
You can see that the accuracy of the algorithm is about 70%.
Now models are very useful, because they can tell us a lot of different things about the population.
Okay, so the gamma distribution is another model and one of its key features is that it only allows for positive values.
So the basic problems that occur when we're making inference, usually result from a violation of the assumptions that we make about the population, the sampling process or the model for the population.
There is no fundamental reason why weight and mass should have any connection to one another, and yet they do. It turns out that, before observation, that every kilogram of mass here near the earth's surface weighs about 9.8 Newtons.
It has to not already been answered.
You know exactly, you brought the right amount of money, and then you can pay for the meal.
When the event is an end event, the endTag method is called.
Well that can be useful just so you know that roughly the mean of your, the distribution in your data set, is corresponds to kind of what you might expect in the population.
We're going to start by looking at the http node.
So why is big data such a big deal now?
The average American diet consists of about 70% processed food.
But it's a very realistic question.
So we'll talk about that in the next lecture.
Then the next 7 days, this is during the campaign, you have an average of about $300.
It's useful to figure out what situation you're in and for any given question, and to always define the population coherently, so that you know what you're making inferences about.
Statistical models can play many different roles in a data analysis.
So if you drag that onto the sheet, and then open up the properties, it's very simple.
Basic details about the packaging of a data set.
How does the outcome vary with the key predictor?
It's the difference between telling someone what to do versus saying this is what happened when I did this.
Appropriately, so that you know whether you're right or wrong in the end, okay?
And of course, you wanna join this with everything that's already known about the question you're asking, either from literature or from other colleagues in your organization.
We're just gonna put all the data that we have available to us and see how well that predicts the outcome mortality.
By taking this course, I'm confident that you will devil up something that I call happy smarts.
So it might be a reasonable thing to include as a potential confounding factor.
This may require a slightly larger audience, maybe a team meeting or a group meeting, where you present your results and get some feedback on what you've done so far.
But eventually you'll need to go out and collect more data on things that you haven't already measured.
But this usually gives you the effect that you want because if either a precision is zero or recall is zero, this gives you a very low F Score, and so to have a high F Score, you kind of need a precision or recall to be one.
You may change what the primary model is later.
You can have neural networks with other types of diagrams as well, and the way that neural networks are connected, that's called the architecture.
So in that sense, it kind of combines precision and recall, but for the F Score to be large, both precision and recall have to be pretty large.
So remember that we have the six types of questions that you can try to ask.
For example, when you're at a narrow level, you might want to say clarify the coding of a variable because maybe the metadata's unclear.
But the important thing is that the model connects the data that we observe to the population that we don't observe.
And you can see at the very top here is temperature, which is kind of ranked as the most important variable.
It's possible that they added up the check wrong, maybe they charged you for something for that you didn't actually eat.
Maybe if a variable is labeled zero or one, you may not know what it means by zero and what is meant by one.
For example, you might need an answer to a very focused and technical question, it's often just a clarification or about the details of a data set or a model.
So this is like saying, we'll tell someone they have cancer only if we think there's a greater than or equal to, 70% chance that they have cancer.
It may be easier in some circumstances to revise the model than to revise the data, especially at the data collection process.
So what we're gonna talk about in future lectures is how this little cycle, this three step cycle, applies to each one of the five stages of data analysis.
The fourth type of question in a data analysis is a predictive question.
So here we see Klaus from Germany wrote in that you know, he'd be hungry and thirsty and he couldn't get food or couldn't find water.
You can write it as a3 or as a(3)1 and that's g of z3.
But before you do that, you gotta figure out how much money to bring, and so you have to figure out well, what's your expectation for the cost of this meal.
In this class, I hope to teach you about various different types of learning algorithms.
Is that good enough for your purposes?
So what Samsung has done here is to build a site using simple elements that they've developed from games.
It has to plausible so that you can kind of explain how things would work if the answer to the question met your expectation.
Finally, it's important to think about the implications of your analysis.
Put it this way, suppose you're a song writer, and you're trying to write a hit song, okay?
And you're not really worried about things that are outside of the dataset yet.
Really, it's about thinking a little bit more clearly about what type of population you're trying to make an inference about.
Machine learning is one of the most exciting recent technologies.
So stating the question is very important from the get go in a presentation.
And the way we recognize that is by seeing that we end up with a very low precision or a very low recall.
So suppose you're going out to dinner and the restaurant you're going to is a cash only place.
And so you have to be careful of what types of methods and what types of approaches you use there.
Which has a quadratic trend for time, so this allows for kinda a little curvature, and allows for kind of rising and falling of the daily total sales.
And so, we may need to resort to some modeling, some formal modeling to see if there's any sort of association here.
There's a huge difference between people that know how to use these machine learning algorithms, versus people that don't know how to use these tools well.
I noticed that he had put on a lot of weight.
It seems like it's having some difficulty finding out, finding a good combination of features that can separate people with good and kind of bad credit risk.
This lecture is about the most important part of a data analysis, and you don't even need a computer to do it.
So, suppose that, when in doubt, we want to predict that y=1.
So, it's an ideal protocol for the Internet of Things.
Once you've set your expectations the next thing you wanna do is collect some data or collect some information that will allow you to compare those expectations to reality.
So you have some uncertainty there.
Because it doesn't provide any summary or any data reduction.
Now, the problem, of course, is that we can't collect data on the population, right?
This method creates and sets a list adapter.
And there are many other kinds of things that you might wanna assume about a population if your question's a little bit more complex.
So hopefully you got that this is the right answer, classifying emails is the task T.
There's going to be a number of columns, you should always check that end.
So now you may want to keep, continue to refine this, think about different models.
But one of the endpoints of exploratory data analysis is to develop this prima facie case to develop this primary model.
And then finally, pm10 and then dow, which is the day of the week.
So, thank you for that.
Concretely, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters, theta 1.
If you did not follow this correctly, you'll find that you'll be unable to connect to your organization, using the MQTT client.
So the three different models represent a range of going from roughly $39 to $49.
Okay, when you analyze data, we have a lot of tools at our disposal.
Obviously this last one didn't really fit perfectly, so you might wanna either refine your model or you might want to do another survey to get more data, to get a better sense and so you kind of think about where you go from here. The point of this whole exercise is that you get a little sketch of where you're gonna go and kinda what your solution's gonna be.
So exploratory data analysis can have two broad goals.
And I'm thrilled that so many of you have signed up for this online course of the emerging field of gamification.
And then the one week after the ad campaign just to see how the sales numbers change while you're running the ads.
In terms of how do you know whether you've made a good inference or not.
So, hopefully from this video you've gotten a sense of how the feed forward propagation step in a neural network works where you start from the activations of the input layer and forward propagate that to the first hidden layer, then the second hidden layer, and then finally the output layer.
If the data suggests that this question is true, that more fruits and vegetables leads to fewer respiratory tract infections, then we already know how we might intervene in terms of modifying people's diets and improving their health.
I think that the biggest global health challenge that we have is to take health not as a service but as a right.
We just simulated it on the computer.
One of the things we talked about earlier is the importance of a single real number evaluation metric.
And so, you have to ask yourself then why do you have that mismatch.
That's the directionality, the magnitude, and the uncertainty.
Always try to diagnose your model as carefully as possible or use a very flexible method to kind of model your population.
And in the debug tab, I get an error saying that the response could not find the request.
The first way it's useful is for setting expectations about your data, okay?
Now, the fundamental feature of a descriptive question is that you're often looking to summarize a characteristic of a dataset, so often this involves taking the average or taking the proportion of some feature in your data.
So there may be multiple problems.
It may or may not be a good thing.
But there are some deviations from what we might expect from what we saw in the previous picture.
Samuel's claim to fame was that back in the 1950, he wrote a checkers playing program and the amazing thing about this checkers playing program was that Arthur Samuel himself wasn't a very good checkers player.
And very often, tables will only give you the summary.
To me, it seemed that the ultimate purpose of education is to give students the tools and the skill sets required to lead a happy and fulfilling life, and of course to also help other people do the same.
So that evening in Mumbai, I turned around to my MBA students from UT Austin.
But fundamentally, there are only three parts to it.
So obviously how you do this will depend on exactly what your problem is, what your question is and the data that you have at hand.
Two common ones that I wanna talk about are for data that vary over time or data that vary over space.
Another good example is when we think we're doing inferential, we're asking an inferential type of question.
Now one thing we know about pollution and mortality, just from the pictures that we just saw of the data is that they're highly seasonal.
So, as I mentioned, when you publish or send data to the broker, you're just going to send the data against the topic.
Our children are in trouble, because we've outsourced the job of feeding them.
And in this class, you learn about the state of the art and also gain practice implementing and deploying these algorithms yourself.
So even if your model for the population is correct and even if you have a sense of what your target population is if your sampling process is not well specified or well characterized or if you don't understand it properly, then the data that you collect and the inferences that you make will only apply to some other population that you haven't really well characterized yet.
And so, formal modeling is the process of kind of specifying your model very precisely so that you know what you're trying to estimate, and you know how you can challenge your findings in a rigorous framework.
And then we're going to talk a little bit about a unique idea called evidence-based data analysis, doing data analysis based on what's best practices in the field right now.
So the first characteristic is that it has to be of interest to your audience.
So we going to look at some of the nodes that can be useful in helping you implement the back-end solution.
Because in particularly you can see on the left-hand side there that there are negative values, okay?
But that's our estimate for the population level feature.
If you can't describe that process, then it is going to be very hard to understand how valid your inferences are from the data.
Cuz they're penguins after all, and we can't take care of all of them, so we need to draw a sample from this population.
So what's wrong with this picture?
Another mistake that's commonly made is when we think were trying to do prediction but we build our prediction models improperly and so it ends up being more of an exploratory type of analysis.
So, the data might not contain the answer.
So to make inferences from data, you need three simple ingredients.
So really what you need to know is kinda what are the questions that you need to be asking when you're involved in a data analysis process?
Did we match our expectations, did they not match, why or why not.
So, it's important, upfront, that you not assume that everyone is on the same page, that everyone has the same background, but to state the question clearly and succinctly so that everyone knows what the goal is.
So it seems like season is clearly related to both mortality and pollution.
When actually there's something just going on in the background that's moving your sales numbers.
So the idea, so one of the end products of exploratory data analysis is to really get a sense of whether the solution is out, whether the solution exists, and what it might look at.
At the end of six weeks you'll be able to understand and use simple Chinese phrases, meet basic needs for communication and possess the ability to further your Chinese language stuties.
We just did not know how to write a computer program to make this helicopter fly by itself.
So for example, you might ask how many people have visited this website in the last 24 hours?
Like I mentioned sometime back, people don't normally associate MBA's and business schools with the pursuit of happiness.
You know, that would really be a problem or you could go to the bathroom, but you're stuck in place.
So, for this example, I'm gonna use a dataset on the creditworthiness of a group of individuals.
There are a lot of tools out there that we have that have been developed by a lot of different people in this area.
So you can see that from this prediction model, particularly matter actually is second from the bottom in terms of improving the prediction skill of the algorithm.
So then we have our dataset of just three penguins.
And I hope that you find ways to use machine learning not only to make life better but maybe someday to use it to make many other people's life better as well.
This ball's gravitational force, due to the earth, is known as its weight.
So, a question that often comes up when you talk about data science is, what about big data?
This method identifies if this data element is one that needs to be saved and, if so, it records that by setting certain variables.
Finally learning algorithms are being used today to understand human learning and to understand the brain.
It pulls this ball downward with a gravitational force, it pulls me down with a gravitational force, it pulls you downward with a gravitational force, and we give a name to these individual gravitational forces.
And how is the uncertainty about your estimate preserved or not?
But we won't talk about that now.
And you can see that in that grey area that the outcome will take values 0 and 1 depending on the value predictor.
Next, the code iterates over the earthquake's list.
Does it look like the right numbers are in the right columns and the right numbers are in the right rows?
The main two types are what we call supervised learning and unsupervised learning.
Now before we actually get to the data, one of the things I just want to do is to show you, what would data look like if it came from a normal distribution, okay?
Because often the population is too complex to think about all at once.
And the basic question that we're interested in asking here is how our ambient temperature levels related to ambient ozone levels in New York City, okay?
And so we want to know if we can, if we change one measurement on one hand, can we, does it always result in a specific outcome on a different measurement?
Now there are other potential factors that we might wanna consider in terms of things that might both be related to mortality and to air pollution.
Another possibility is that there's something wrong with the data, for example.
Now, but eventually we'll look at the data.
So one of the things we can do is let's try the gamma distribution.
Because all of the patients that you're going to and saying, we think you have cancer, although those patients are now ones that you're pretty confident actually have cancer.
And so you might want to know, well if the population is roughly 50, 50 male, female, but your dataset is say, 70, 30 male, female, then you can characterize that difference between the data you collected and the population your trying to target.
The first set of nodes we're going to look at are the input nodes.
Next, the code extracts the value associated with the earthquake's key.
In the last level, you may not have any specific question whatsoever but you want to gather information about the overall process and get some first impressions and feedback about what you're doing, so you can refine your data analysis even further.
Sometimes those things can be shifted by one or two.
So instead of the primary model where we just had the key predictor and the outcome, we fit the following model.
And of course, if your goal is to automatically set that threshold to decide what's really y=1 and y=0, one pretty reasonable way to do that would also be to try a range of different values of thresholds. So you try a range of values of thresholds and evaluate these different thresholds on, say, your cross-validation set and then to pick whatever value of threshold gives you the highest F Score on your crossvalidation .
You can tell by the size of the standard error, which is much bigger than the estimate, that there's a lot of variability around this estimate and so it's effectively zero.
And so, it's basically a little cartoon of how the world works.
Now we use the data to draw the picture, because we use the data to calculate the mean and the standard deviation.
But sometimes when I look at what their doing, I say, I could have told them like, gee, I could have told you six months ago that you should be taking a learning algorithm and applying it in like the slightly modified way and your chance of success will have been much higher.
There may be one, or two, or even three key predictors that you're primarily interested in.
Big or small data that you need, regardless of the size of the data, you need the right data.
But it's not very complicated, or at least not overly complicated, to try to capture this trend.
And DOM parsers read the entire XML document and convert it into a Document Object Model structure.
They're also gonna keep in mind the way in which the ultimate results can be communicated to a wider audience and think about what's the best way to do that.
Because now, in order to answer this question, you may need to search the literature, or talk to people inside or outside your organization who are familiar with this area and to see if your findings fit with what they expect to see.
But often the simplest thing can be very revealing.
Now, just a couple things about how ozone works.
And so nitrogen dioxide tends to be correlated with particles.
So the sampling process is the manner in which the data come to you, and this gives you the dataset.
Well, we can learn that there's something really popular there.
Now the expectations for your interpretation will be based on things like exploratory data analysis and formal modelling.
That is a typical setup of a prediction problem.
The context is important because it really takes into account the totality of the evidence that you've developed through both primary and secondary models.
And you can use the data to help you see if that fits well.
If you're working at a start up your audience might be your boss or the company leadership or investors or maybe even potential customers.
One, two, three, four.
The situation is Samsung wants you to spend time and do stuff on their website, so that you will eventually buy more products.
Now, often with predictive types of questions, they lead you to solutions that don’t necessarily tell you how things work or explain the mechanism of what’s going on inside any given system.
So there are a couple of things to think about when you see the results of a prediction model like this one.
The fourth step is interpreting the results, and the fifth step is communicating those results to the appropriate audience.
Do you need to get other data, right?
And then X is just an indicator of whether a given day fell during the ad campaign or not.
Now, with what I've written so far I've now gotten myself the values for a1, a2, a3, and really I should put the superscripts there as well.
We've already seen the inject node is very useful when creating, testing and debugging a flow.
Don't worry, if you already have Python down and you want to be challenged, we have some advanced Python in here as well.
So there are a number of ways to evaluate your formal modeling and your examination of primary and secondary models.
The key method in this class is the handle response method.
And also time to talk about when you would use each of them.
So under this model, the ad campaign added an extra $49.10 per day to our sales.
And so, if there is no change in what you might think or what you might do based on the collection of the data and matching it with your expectations, then that's often a sign that either the evidence from your experiment is not very strong or the data analysis was not able to generate enough evidence, or there may be some other problem.
It turns out one of the reasons it's so inexpensive today to route a piece of mail across the countries, in the US and internationally, is that when you write an envelope like this, it turns out there's a learning algorithm that has learned how to read your handwriting so that it can automatically route this envelope on its way, and so it costs us a few cents to send this thing thousands of miles.
Now with a predictive question you wanna know whether you can take a set of features and use them to predict another feature on a given person or on a given unit of analysis, right.
In the next video, I'm going to define what is supervised learning and after that what is unsupervised learning.
The next year, I was nominated for the professor of the year award.
What did we learn, and what would you do differently next time?
Maybe you know, well the most expensive restaurant in this city costs this many dollars.
This inner term there is z3.
And this classifier may give us some value for precision and some value for recall.
So it's helpful often to discuss your question with experts in your area, or maybe experts in your organization who could help you to shape your question and to ask one that actually hasn't been answered before.
So now that you've checked the packaging, you've checked your ends, you looked at the top and the bottom of the data set.
But there's more to it than that.
And so this lecture is really about using statistical models to help you to summarize your data, and to eventually kind of make things to things like make inference, okay.
For the checkers playing examples, the experience E would be the experience of having the program play tens of thousands of games itself.
There's all kinds of knowledge that we've accumulated over the years about what makes a great song.
The reason is because knowing what type of question you're asking is really important to interpreting the results of a data analysis.
And so, no2 and pm10 might be sharing a lot of the same effect on mortality and it may be difficult to completely disentangle them.
How is the magnitude of the effect or the association preserved, or not.
For a week when I went to Starbucks, I stopped buying any specialty drink and I only drank Americanos.
And then finally, given what you see from the data, and given kinda what you make of everything that’s going on in there, do you have the right question?
Well this is only one example of that. So, there was an experiment run by Stanley Milgram where he took 296 individuals and he, what he tried to do was basically send them a letter and ask them to send a letter from someone they knew and so forth until they went a specific address.
And usually, every new answer that you get will raise more questions.
So directionality really pertains, I think, to inferential and causal types of questions where you try to estimate some association and the association can either be positive, negative, or zero.
So this is the kind of analysis that we're interested in.
So the algorithm just basically classifies everyone as good credit quality.
And so, keeping track of what these purposes are and why you're analyzing data will keep you focused on how much you can iterate your data analysis to get the appropriate answer.
And so, a large fraction of those patients will turn out to have cancer. And so this would be a higher precision classifier will have lower recall because we want to correctly detect that those patients have cancer.
If there are gonna be 60 features, there should be 60 columns, for example, in a table.
Another feature of prediction analyses is that, usually the model that you use cannot be written down in a convenient mathematical notation.
Now in particular, if your data match your expectations it may simply be an indication that you have a question that's not very sharp.
Welcome back, you're halfway to the end of this course.
It's a little hard to answer that question without knowing kind of what the context is and what your situation is.
And it doesn't seem plausible that people would be willing to pay negative dollars for this book.
But, assuming you can do that, the thing that I like to do is what I call check the packaging.
This lecture's gonna provide an overview of the cycle of data analysis.
And now, but the important thing is that we have a different model, and so a different model is gonna yield different predictions.
In the second module we're going to dig into the pandas Toolkit.
And then you have some independent random error that we call epsilon.
So big data is obviously different to different people.
It's important that data arrives at the response node because it's that data that joins the request and the response.
Okay, so here's what the data actually looked like, okay?
Let's talk about each of these, one at a time.
So because if your former modelling ends up answering a different question and perhaps the modeling approach doesn't exactly match what the question was, then you might have a problem with interpretation later on.
However, in many other situations, causal questions can really only be answered indirectly using observational data, so in situations where we can't control what the settings are or what the experiment or design a specific experiment to kind of to collect data, on directly on this question.
This is another one that might seem a little obvious.
So I'm gonna break down each one of these pieces to give you a little bit of a description of what they are, and I'll give you a little example of kinda how they might be applied in the real world.
And for each element for that list, it gets the data associated with a single earthquake, And this data is stored in maps.
And so, there's a lot of questions that simply have to go unanswered.
Python is a very general programming language with a lot of built-in libraries and excels at manipulating data, network programming, and databases.
It may have an important association with mortality.
So the question that keeps coming up is what about big data?
We have a lot of music theory that tells us what notes sound good together, what notes don't sound good together, what chords you should use, things like that.
For example, suppose you are looking at time series data.
I noticed that there was an even lower correlation between career success and what you might call life success.
And to me, the easiest way to look at the data to determine if there are any problems is to make a plot.
And then from there, if you wanna continue, you might go onto something more like formal modeling which we'll talk about later.
So that's okay, and so you don't have to worry too much about kind of what your primary model is going to be from the get-go.
And so if you can replicate your findings, and then regardless of whether they matched your original expectations, you can have a little bit more confidence in the fact that whatever you found could actually be the truth.
And so that's another thing that you might wanna check for.
And finally, most prediction analysis often are what you might call classification problems, so the outcome is really something that takes two different values and you're trying to predict one of those two different values.
So we're expecting this kind of positive correlation between temperature and ozone in New York City.
We have statistical methods, we have machine learning methods, we have all kinds of software packages that we can apply.
By the way, in a network like this, layer one, this is called an input layer. Layer four is still our output layer, and this network has two hidden layers.
For example, there's gonna be a total number of observations, or your sample size.
There's a companion node in the output section which is the response.
The force that's causing that acceleration is the ball's weight.
But presumably you've done this already.
The next set of analyses that you might do is a prediction analysis. So a prediction analysis differs from an associational type of analysis, because the goal here is to really to use all available information to predict an outcome, okay?
So it's the threshold that says, do we need to be at least 70% confident or 90% confident, or whatever before we predict y=1.
Okay, so now I'll press the Load Data button and there you can see the requested data summarized and presented in a list view.
He had dark circles under his eyes and bags under his eyes.
My name is Christina and I'm from the United States.
So one last thing I want to mention is about the availability of other data.
The http input node, however, does not send any data back to the requester.
And you thought you had the right data set, but when looking around in the data set and exploring what's in there, you realize that maybe the question is better asked a different way.
So thought it was $30 and it ended up being $40.
So, we've also spent a lot of time developing exercises for you to implement each of these algorithms and see how they work fot yourself.
And so, 64 of these such chains came back, so 64 out of 296.
I really loved teaching that course and I was very happy doing it.
It's not going to be a perfect model, but you hope it's a reasonable approximation.
Then you could compare the total sales for the three weeks during, the three weeks before and the three weeks after.
If your question was originally, how much are people willing to pay for this product, you have a better sense now in terms of what the shape of that distribution might look like.
So given that you have a primary model, you want to develop a series of what I call secondary models to test and challenge your solution.
And you wanna know how your results, along with kind of the evidence that you've generated, are consistent or not with existing results.
So I know it's not gonna cost this more than that, so I'll just bring that to kind of serve as an upper bound on how much money I might end up spending at this restaurant.
So that you know that the data you got are at least kinda within the realm of reality.
What would we do differently next time.
This can be kind of addressed to a thorough exploratory data analysis or by using things, for example, things like robust methods for inference when you're fitting statistical models.
And then if we were to fit that model, this gives us a beta of $49.10 per day.
That's really intended for machine processing.
He defined machine learning as the field of study that gives computers the ability to learn without being explicitly learned.
This lecture talks about when to stop the process and move on to the next phase of data analysis.
Now here the do in background method is similar to what we've seen before but this time it uses the JSON Response Handler class to process the response.
The advantages of the coupling, the sender and receiver means that a message can be sent to multiple recipients without the sender needing to know who those recipients are.
I actually think he came out with this definition just to make it rhyme.
You can think of it like a tree or directory structure on a computer file system.
Is very expensive, okay?
Well, it depends on what your purposes are.
And so this suggests that maybe the relationship is non linear.
And so what the result of this is that you'll have a what's called selection bias, so the people in your study will represent a selection of your population, but not really the entire population.
So, just to summarize, when it comes to interpreting results, you wanna always revisit the question, make sure you know what question you're asking and what type of question it is, so that you can figure out whether the analysis matched the question.
So we talked about things like bias and variance, and how regularization can help with some variance problems. And we also spent a little bit of time talking about this question of how to decide what to work on next.
And so the way that we can check that is by using a non-linear smoother.
Now, specificity is useful because it often leads to simpler data collection and simpler kind of experiments and data analysis.
It's because this is the range of the predictor where the outcome could actually take both values.
And so what ended up happening is, is that people nowadays can collect much more data than they could before and much more cheaply.
Now the bottom line is, that when we can, we should choose foods that are less processed for our children.
Because then you can send them into treatment, and then kinda send them down the road to recovery.
So ultimately, it's not clear that this prediction algorithm is particularly good.
So we'll talk a little bit more specifically about what this means when we talk about our examples okay?
So we can fit a secondary model to the data, which includes pm10 as our key predictor, and then maybe we'll include the season of the year as a potential confounding factor, so the season will just be, you know, there'll be four seasons, and we'll have a categorical value with a category for each season.
And so you might wanna ask yourself is that a big range?
And, in a few short weeks, that number is going to be several orders of magnitude higher, thanks to Coursera.
I can create that in my head in like a second and then I listen to you." What does P2P mean?
So, if our application gets XML data from the internet, it will need to parse the XML document so it can create the list you display that we saw earlier.
Conversely, if you have a classifier that predicts y equals zero, almost all the time, that is that it predicts y=1 very sparingly, this corresponds to setting a very high threshold using the notation of the previous y.
Another thing that surprises some people is the amount of sugar in a typical sports drink.
This looks a lot like logistic regression where what we're doing is we're using that note, that's just the logistic regression unit and we're using that to make a prediction h of x.
And the data that we're gonna use is ozone and temperature data from New York City for the year 1999.
You can't do an exploratory data analysis without reading in the data.
And so really the goal is not to get the correct model, but rather to get a model that's a reasonable approximation for what the population looks like.
XML documents contain markup and content.
Now, I have to confess, that I had a vested interest in teaching this kind of a class.
The next question that you might want to ask yourself is do you have enough data to make a decision?
So, keeping in mind the purpose of a data analysis is very important.
You know, how would we deal with changes in temperature if we couldn't move?
Let me tell you how to do that and also show you some even more effective ways to use precision and recall as an evaluation metric for learning algorithms.
If you originally thought it was gonna be between 0 and $1,000 then the cost ended up being $40, it's not clear that you would change anything about your behaviour based on this data.
And so, now the model for the population describes how the features of the units in the population are related to each other, okay?
So the goal of inference is to be able to make statements about things we can't really observe directly.
I remember this as if it happened just yesterday.
But one thing you'll notice is that the prediction scores were all kind of on the high end, they're all basically greater than a half, and so there isn't a lot of range there.
So that's the kind of summary of what formal modeling is used for and in what context is may be used.
In the five years that I've taught this class, I have had over 1,000 students from all around the world.
Cuz it could be that, maybe, you thought you had the right question.
I'll tell you a little bit about that too in this class.
The take home message is that it's now possible to collect much more data much more cheaply than it was before and to analyze it.
And you have to kind of get other data that will have better predictive power.
This can happen either in the exploratory data analysis process or in the formal modeling phase.
We will look at how individuals, entrepreneurs, and businesses can leverage the ecosystem, to turn ideas into objects.
For example, if you have a number of subjects, you wanna count the number of subjects or units in your analysis.
So in the next series of lectures, I'll start to unpack what exactly gamification means which will then allow us to start to understand how to do it effectively, and what are some of the challenges in applying these techniques.
You'll also need to insure that things like the device ID follow the specified format.
They will apply to some other population that you may not be able to accurately define.
If you would like to communicate with 1.4 billion people, if you're curious about the fascinating culture, please join us in Chinese for HSK.
The other kind of phenomenon that often, where the population can be difficult to define, is when you're looking at natural phenomenons.
The third feature of a good question is that it's plausible.
Another way to ask this question is really, do the results make any sense?
This is something that I feel pretty strongly about. And exactly something that I don't know if any other university teachers.
The question itself has to be answerable in the real world given all the constraints that you might have on you.
And after that, the code gets the first parser event and then begins to iterate over the XML document.
In the next six weeks, I'm going to teach you about what gamification means, and how you can apply it to solve real world problems.
You don't have to look at any data to figure that out.
And we'll begin to talk about the main types of machine learning problems and algorithms.
It's in here somewhere, because this is the data.
Now on occasion, a predictive question can lead you to an explanation about what's going on, but the key point here is that it's not the ultimate goal.
So there's two basic types of situations In which we often use formal modeling.
Okay. So that's kind of what our basic first cut analysis tells us.
Recognizing whether you are asking an inferential question or a prediction question is really important.
Just a few quick examples.
And, again, I'm going to skip right to the HTTPGetTask class.
In the last video, we talked about precision and recall as an evaluation metric for classification problems with skewed constants.
And so if either precision is 0 or recall is equal to 0, the F Score will be equal to 0.
And before it was published, on the website, you could ask people to put their names, their email addresses and ask them how much they'd be willing to pay for this book before it goes on sale.
This lecture gives an example of an associational analysis and how to use formal modeling to challenge your findings.
One is that your expectations were wrong.
And so a couple things you'll notice here, first of all you'll notice it doesn't quite look like that picture where I simulated the data.
Is it bigger than a bread box?
There's a couple of different candidates that you can think about.
And because they're liquids it's easy for our children to overdo it.
As long as you make something back, then the ad campaign's worth it.
Maybe you know, well in this city, the typical meal costs this many dollars, and so I'll just bring that much money, cuz this is an average kind of restaurant.
There is usually a very small number, or even a single key predictor that we're interested in and it's relationship to the outcome.
Because data analysis is an iterative process, interpretation is actually something that happens continuously throughout the whole process.
We also talked about the F Score, which takes precision and recall, and again, gives you a single real number evaluation metric.
And the basic activity you're gonna engage in is eating a meal, and you're gonna check for the bill, and you're gonna have to pay, money to pay for the meal.
Not that there is no role for tables in data analysis, but a plot has a unique ability, in my opinion, to show you both what to expect and what not to expect in the sense of what the deviations are from that expectation.
And then your prediction would be right about 70% of the time.
And so to this extent, I find plots are better than tables, because plots show people, you know, a summary of the data, but they also show people deviations from what might be expected, and so plots are very useful for kind of producing discussion and kind of encouraging people to think about the data.
Whereas in unsupervised learning, we're going to let it learn by itself.
In fact, this definition defines a task T performance measure P and some experience E.
Or you might Google the restaurant and maybe look up the menu to see what the meal typically costs there.
So that's saying, predict y=1 only if we're more than 99% confident, at least 99% probability this one.
Whether you will take the exam or not, please follow us to discover a new language and a new world.
So this example looks at fitting linear models to some data.
If you were told that certain variables were gonna be included in the data set, just check to see that they are in fact included in the data set, right?
And so if the secondary models are largely consistent with your primary model and exactly what consistent means depends on the context and your application and the question that you are asking. And then that's great and you can either move on to the next phase or maybe you're finished and you can just kind of record your results.
The best option, is almost always the kind of food that we prepare, and serve to our own families in our homes.
One: Maps, which are essentially sets of key and value pairs, And two: Ordered lists.
And let's say we're trained in logistic regression classifier which outputs probability between 0 and 1.
But it's the kinda focal point, your initial focal point for your analysis, and then you'll try all kinds of other analyses.
And you can't get at it yet, and so you kinda shake it around, maybe measure it.
It's important that the communication works between team members, between managers and team members and that it's a continuous process to get feedback and improve the quality of data analysis.
So here I'm gonna use a random force algorithm to make predictions of mortality in New York City.
Another possibility, for example, is that you could've said well the meal being between 0 and $1,000.
This got me wondering about what the purpose of education is.
So if you truly have bad credit the algorithm will have a difficult time picking that up.
But the question is, is how much of that data is useful for answering the question that you're sort of involved, you're involved in.
That's what tells us how our outcome changes along with our key predictor.
That's one assumption that we'll make as part of our model.
Every time you go to Amazon or Netflix or iTunes Genius, and it recommends the movies or products and music to you, that's a learning algorithm.
Okay. So we can revise our expectations for what the relationship should look like to be this kind of non-linear relationship, rather than our original expectation of a linear relationship.
So remember we wanna set expectations before we start looking at the data.
And because of this, all of plant biology is quite complex to allow them to change their own development, their growth, in order to survive stuck in one place.
So making a plot is useful in two ways.
Now this kind of parser requires more memory, but does allow the application to do things like multipass processing of the document.
We can draw a fake picture and then we can compare our expectations to the data.
And so basically is close to zero.
So every aspect of your data set is gonna have some kind of count or number associated with it.
So, for example, let me show you a service called Samsung Nation.
So, I'm kinda talk about each one of these things and kinda what happens when there is some sort of violation.
Often with natural processes, it's useful to think of an unobserved stochastic process that lies in the background and drops these events on the ground.
And so, when the data actually comes in and you see the check is $40 then it actually matches your expectation which is that it's between 0 and $1,000.
Again, temperature and dew point are strong factors that are related to both mortality and pollution.
And then there was this other guy, who had practically dropped out of the rat race.
So make sure that you can accurately characterize your population.
This course is about managing the data analysis process.
And so any kind of positive findings that you find that match your expectations may require some follow up before it ultimately decision is made.
And so you should be careful about how you set them and how you change them around.
So I think there is a vast, unfulfilled demand for this skill set, and this is a great time to be learning about machine learning, and I hope to teach you a lot about machine learning in this class.
However, even with no2 in the model we still see a reasonably strong association between pm10 and mortality in New York City.
And it's important to be aware of these budgets, even if you don't directly control them, because they will help you to manage the data analysis process.
So once you're here, you may need to slightly revise a few things with your data and with your question.
And so the question is well how do we know if 0.3 or 0.33 is a good estimate?
So, now if the analysis is successful and you've answered the question that you've set out to ask.
So how can we get a single real number evaluation metric?
So we can't do experiments on humans so we'll use, we sometimes people do experiments on mice to use, to kind of give us sense of what might happen in a human being.
So the importance of using models, different types of models is that they tell you very different things about the population, and they result in very different predictions.
Data analysis is a complex process that can involve many pieces and many different tools.
But about point 6 they are mostly in the good credit qualities category.
Or it maybe from some analytics engine where you want to compare the current data with a previously created model.
How come a business school professor is teaching a course on happiness, of all things?
So, one way to solve the big data problem is to just sort of wait until the hardware catches up with the size of the data.
So there's all kinds of just ends that you can check within your data set and kind of around your data set to make sure that everything is kind of in structured in place, okay?
So one of the things that we could do is fit a complex prediction algorithm, all right?
Is that, all models are wrong, but some are useful.
And lastly, we sometimes think that we're doing an inferential type of analysis, but it's really more descriptive, particularly if you have very small sample sizes.
But if my expectation was very diffused and not sharp at all, like between 0 and 1,000, then, collecting the data doesn't really help you.
So just doing a little check to see that your data matches with something that's kind of independent and outside your data set.
In order to succeed, a great business idea needs a great business model, and they're not the same thing.
And any good prediction algorithm will tell you which predictors or which variables are more or less important for predicting the outcome.
Then maybe you don't care whether it's $39 or $49.
All right, so I think a plot is very important to make.
The first type of question you can ask is a descriptive question.
The three basic steps are setting expectations, collecting information, and then revising those expectations based on what you see.
Welcome to this free online class on machine learning.
And then, another goal of exploratory data analysis, assuming you pass that first part, is to think about how can you develop a sketch of kinda what the answer to your question might be.
For example, in many medical applications where the outcome is the presence of a disease you may want that test or that algorithm to have a very high sensitivity.
Maybe, you could argue it's a little bit better, you've got a little hump wherever that spike at ten is.
It turns out that in supervised learning, the idea is we're going to teach the computer how to do something.
So I guess that makes sense because most of the individuals in this data set have good credit quality.
So once you've set your expectations, you can figure out how much money to bring.
Apple's been around for many years.
But this is kind of an indirect relationship.
However, what's really great about a business school too, is that we believe in a free market economy.
If I then deploy that, and then launch a browser on the URL, and the URL is the same URL for Node-RED. But where I have red for the Node-RED editor, I simply replace it with a property I entered in the input node.
So you'll notice that in this picture there does appear to be an increase in sales during the campaign period.
And finally, to compute the actual value output of our hypotheses, we then simply need to compute z3.
So you can test the sensitivity of your assumptions, of your expectations to various features.
And the three things that you wanna think about are the directionality of the result, the magnitude, and the uncertainty.
They're literally rooted in place, just like you would be if you were cemented.
The form that a routine communication takes depends on the goal or what kind of information needs to be collected.
Now for most prediction analysis this really isn't a distinction between the key predictor and a bunch of other potential confounders.
So for example, if your expectation was the meal would cost $30, and then it actually cost $40, you know immediately that your expectations were not right.
And even though this is one of the last steps of any data analysis, you may need to go back to the exploratory data analysis or the formal modeling stages if the things that you've discover in your interpretation challenge what you found previously.
And we use them to help tell, to kind of help us describe the population that we're talking about.
So the, these investigators took an instant messaging network and they looked at 30 billion conversations between 240 million people.
Try to design your figures and tables so that people with a broad background will be able to understand what's going on in those figures and tables with just a little bit of explanation.
So the first thing we're going to do is to take that scatter plot and just fit a simple linear aggression to the data in that scatter plot.
So, days with lots of sunlight have higher temperatures than otherwise similar days with lower temperatures.
But here's a plot that you might make.
Of course, in reality, you will never see data like this.
Now remember that an, a Newton, 1 Newton is the SI unit of force, and it's about the force that a small apple, conveniently enough given that it's Newton a small apple exerts on your hand when you hold the apple there steady.
When temperature is higher, ozone does tend to be higher.
Everything from the sort of a raw data to the exploratory figures to sort the final analysis that you'll be performing.
We're going to ask you to think about not what's wrong with you and how to fix it, but what's right with you.
Using this observation we're going to be able to vectorize this computation of the neural network.
And you wanna combine this evidence that you've generated with existing knowledge.
Now, these smoothers like low S are very useful for capturing these kinds of null in your trends but they usually don't tell you much about how things work underneath.
And then there may be all these other variables.
The first is, this data came from a single year's worth of data.
In fact, in most data analyses, you should probably do more thinking than doing of the data analysis.
The last thing that I'll talk about when it comes to inference is actually not about inference at all, it's actually a special case of using the data as a population.
And they found that the average path length was actually 6.6, so they sort of upgraded the six points, the six degrees of separation to seven degrees of separation.
This method identifies which tag is currently being parsed and then saves the content for later use, and as before, after the do in background method finishes.
While the secondary models are used to kind of adjust for different factors in different ways.
In 1974, Donald Knuth, a computer scientist at Stanford, wrote an essay where he talked about the difference between an art and a science.
But ultimately the song writer has to inject this creative spark that puts it all together and makes a piece of music that people actually want to listen to.
For example, before we said that 11% of people would be willing to pay more than $30.
Where you have Y, is the outcome, that would be the daily sales.
I reckon he greatest challenge to global health will be obesity.
If the food has a long list of ingredients, the food is most likely highly processed.
And when you look at the kind of totality of the evidence, you wanna look separately at the direction, magnitude, uncertainty of all the different models that you fit.
So this is the context that we're talking about.
MQTT is a pub-sub system.
If you can't do that you may need to refine your question a little bit.
We could have a fourth order polynomial to model the trend, just in case there might be something more complicated going on in the background.
For example, if you're studying people in an observational study or in an experimental study, you just kinda take whatever people come your way and enroll them in the study.
And we'll talk more about this in a second.
And I was told that if a sufficient number of students enrolled in the class, then I could teach it.
So that's kind of hallmark of prediction analysis.
And I'm going to share with you the science of how to communicate in a way that strengthens the bonds that you have with the people you care about.
The basic purpose of a routine communication is to gather information, you gathered communication by communicating your results and by taking in the responses from your audience.
So the goal of most prediction algorithms is essentially to minimize the size of that grey area.
So for example in the example about the ozone and temperature relationship, one of the things that we discovered was that the relationship appeared to be nonlinear.
The first is more basic where it's you determine whether you have the right data to answer your question.
Think about your primary result, and for that result, you wanna consider the direction, the magnitude, and the uncertainty.
One way to do this would be to modify the algorithm, so that instead of setting this threshold at 0.5, we might instead say that we will predict that y is equal to 1 only if h(x) is greater or equal to 0.7.
You can maybe collect auxiliary data about the population characteristics.
And you can see now.
Now, when we talk about formal models, very often we'll talk about parameters, and parameters play an important role in statistical modelling.
The next useful thing that I like to do with data sets is to try to validate it with at least one external data source.
And so every time you draw a sample, it'll be slightly different and so anything you estimate about the population will be slightly different every time you sample data.
For many applications, we'll want to somehow control the trade-off between precision and recall.
And try to do some secondary analyses around it to see if your initial solution holds up.
So that's a much more specific question, because I'm, instead of saying good diet, I'm saying five servings of fruits and vegetables per day.
So, that's it. Those were the topics of this class and if you worked all the way through this course you should now consider yourself an expert in machine learning.
At this phase it's important that you focus on continuous measures of uncertainty and not kind of binary measures of yes, no's or statistical significance or not.
We know that it's important for children to stay hydrated, especially when they're active or when they're spending time in hot weather.
Android provides several different types of XML parsers.
As we go through the rest of this course, we'll be using sensor data from the Raspberry Pi and the Sense HAT, and that will arrive within your Node-RED flow using the IBM Watson IoT service running on Bluemix.
So, and furthermore it doesn't matter kind of if the variables are related to the outcome in some sort of causal or mechanistic way, but if they carry any information at all about the outcome, they may be useful in a prediction setting, and you might want to use them.
And part of exploratory analysis is to determine do you need to get more data, do you need to change datasets, or do you need to change your question or refine your question.
Remember mortality was high in the winter and low in the summer, and pollution was high in the summer and low in the winter.
So the data analysis process for the most part involves five different steps, and these steps are basically stating and refining your question, and exploring the data, and more importantly, determining if the data that you have is appropriate for answering the question that you asked in the first part.
The base of these is called the SciPy Ecosystem, and it even has its own conference series.
This is a basic sketch of kind of routine communication as you're going through the data analysis process.
Now in addition to primary model results, you'll have a lot of secondary model sensitivity analyses, other things that you tried.
And even just a single number can be useful.
And then they performed a similar sort of experiment to try to identify how far apart people were, and they looked at the sort of to analyze the same question that was looked at with just 64 email chains before.
So, this is an info graphic that says that there in 2001 there will be 1.8 zetabytes that were created, which is a gigantic amount of data.
So here's a very simple example of a single predictor on a binary outcome that produces very good separation, okay?
The first step for any data analysis is pretty simple, you have to think.
Suppose we want to avoid missing too many actual cases of cancer, so we want to avoid false negatives.
And so, model one tells us that our estimate of beta is $44.75.
And you're now well qualified to use these tools of machine learning to great effect.
Are you expecting a certain number of columns?
If you don't have the right model, then it's pretty straightforward to see that your predictions just may be poor and they may have a high error rate.
So for example, if you're looking at, let's say, an air pollutant and some health outcome, you might find that the increase in the air pollutant results, you estimate that an increase in the air pollutant results in a 5% increase in the health outcome.
And you have seven days without the campaign, seven days with the campaign, and seven days without.
When I began my career, data gathering was manual, pen to paper.
So when you create an organization, in effect you are creating a broker for your organization.
So that data will come in using the IoT node shown here.
So, now one thing that we can see from the fake picture is that the normal distribution probably isn't going to be perfect from the get-go.
Especially, if those ingredients are not easily recognizable to you.
And they generally serve to kinda confuse the association between your key predictor and your outcome.
So for example, I've worked on autonomous helicopters for many years.
So here's a data set that's not real, but it kind of represents the ideal scenario for what you might see in an experiment like this.
So just to recap, with inferential questions the goal is to estimate the association between an outcome and a key predictor.
What we're really doing is we're using temperature as a proxy for the amount of sunlight that's available, cuz we don't really actually have any data on sunlight, okay?
And so model two seems kind of reasonable.
So this is the picture that we expect to see when we look at the data.
Or more generally, if we have a few different algorithms or a few different ideas for algorithms, how do we compare different precision recall numbers?
It's used in eight out of 10 of the US's top computer science programs.
So if you were to make a prediction it would be easiest just to say you have good credit.
One of which may be your analysis.
In my humble opinion the biggest, or rather one of the biggest problems that we have today in the world is mental health problems.
You really just want to summarize the numbers that you have.
And in fact if you've seen the fields of natural language processing or computer vision, these are the fields of AI pertaining to understanding language or understanding images.
Any good data analyst is gonna be engaged in regular informal communication multiple times through the data analysis process.
There may be ethical reasons where you can't collect the data.
Now every project has a time budget and a monetary budget.
You might be looking at the closing value of Apple stock every day for year.
But a plot will allow you to visualize both the mean and the deviations from the mean.
So there are a couple of basic principles to keep in mind when interpreting results, particularly in the later stages of a data analysis.
Chinese civilization has a long history, and China has made great changes in recent years.
If they don't fit, for example, this may lead you down a path of new discovery.
So you have to use a specified format for your topic space.
And you want to be able to separate the two classes using a set of features that you collect, and a model that you develop, okay?
Throughout my career, the data that's been available to use to solve problems has changed significantly.
So I'm looking forward to working with you in this course and to bring you through this data analysis process.
But it turns out just knowing the algorithms and knowing the math isn't that much good if you don't also know how to actually get this stuff to work on problems that you care about.
That greatest amount of challenge we face in this part of the world is maternal and child mortality.
Pull parsers, like sax parsers, read the document as a stream but pull parsers use an iterator based approach where the application rather than the parser decides when to move the parsing process along.
So there's a suggestion of a classic kind of S-shaped curve here, where there's a kind of roughly flat relationship, then there's a sharp, increasing one and then kind of leveling off.
And as we saw in the previous module, you can register your own message handlers.
So this is what I might consider to be a primary model.
There may be a specific person that you can talk to to answer that question and the communication will be very brief and to the point and the answer will just be a single fact, essentially.
That there always might be this possibility that you have to bring in additional data to improve your predictions.
So typically you don't put any weight on one predictor over another.
There is autonomous robotics, computational biology, tons of things in Silicon Valley that machine learning is having an impact on.
And there are a bunch of other keys as well, and together all of these values provide the data for one earthquake.
But it's nice to be able to check certain aspects of your data set, that they match something outside.
So this is at a large scale about essentially looking at correlations between lots of features in a data set.
So that's the evidence, that is the result of your analysis.
And the first trap is speaking in third person.
But when our children are drinking juice at every meal, or as a thirst quencher during the day, then the sugar really starts to add up.
You know, see how big it is.
As the processed food industry expands, many other parts of the world are also beginning to follow this trend.
So, separating the evidence, the interpretation, and the decision can help people think about the different components.
Today, with the digital capabilities around us and the amount of data being captured, our enhanced abilities to analyze data and draw conclusions from it allow us to provide our clients with real time insights that impact their business decisions faster than ever before and on a more informed basis.
Using prediction algorithms for prediction questions, and associational analyses for associational questions, is really important so that you can draw the right conclusions from your data, and not mistake the results of one question for another.
You know, if the earliest and latest dates are correct.
The second important part about a good question that hasn't already been answered already, so this may seem a little bit obvious, but with the wealth of data out there and literature and the availability of the Internet, it's quite possible that someone,somewhere out there, has either at least studied your question or maybe even answered it.
So Hadoop is another of these buzzwords you frequently hear around big data, and it is an incredibly powerful and useful technique, if your data is very, very large.
Usually all predictors are considered equally cuz they may contribute information to predicting the outcome.
So who would be affected is an important question when you kind of think about the magnitude and whether it's meaningful to you or not.
Or maybe you need to just slightly refine the question, make it more specific, or maybe focus on specific variables.
The second thing that I noticed was actually even more interesting and impactful on me.
Remember, that a ball's weight is a force exerted by gravity, whereas the ball's mass is its resistance to accelerations. How hard is it to make it change velocities?
Why we might want to have higher precision or higher recall and the story actually seems to work both ways.
Two words?
So if you were to look at this analysis and then ask the original question, and say, how is pm10 related to mortality?
Do you have the right data?
The only way to have software give these customized recommendations is to become learn by itself to customize itself to your preferences.
This might take the form or an informal meeting with a few people to present some preliminary tables or figures, you may need to show the data in terms of visualizations.
Given a set of evidence, you're interpretation might be that, oh if it's only 5% maybe air pollution's not so bad for you.
Now if you make a box plot, you can look at the distributions of the variables to see whether they're skewed.
For instance. There's a key called eqid and it's value is an earthquake id.
So the basic idea of a secondary model is that it's slightly different from your primary model.
So that's what can result from kind of incorrectly specifying your sampling process.
So given our little data set here of three penguins, we want to estimate what's the proportion with turquoise hats in the population.
Model two tells us it's $39.86, and model three tells us it's $49.10.
And so here's what the data looked like.
The answer to what did the data tell us.
All along the range of the x-axis you'll see there are both values of bad and good and so the separation isn't necessarily so good.
So you usually don't want to read that kind of stuff in, so looking at the bottom of the data set can, so to speak, can kind of help you to see if there's any of that junk down there.
Maybe you've dined at this restaurant all the time, so you know exactly how much it's gonna cost.
This applies the sigmoid function element-wise to each of the z2's elements.
So I just wanna give you a quick example of how you can use these three components in a kind of generic or kind of commonplace setting.
As you remember that application made a request to a web service for some data about earthquakes.
These types of questions can often be directly addressed via experiment.
Even among machine learning practitioners, there isn't a well accepted definition of what is and what isn't machine learning.
So I want to show this application to you now. Instead, let's look at this application's source code.
So anything that's not an input layer or an output layer is called a hidden layer.
And so the fact that you still found the time or took the time to watch these videos and, you know, many of these videos just went on for hours, right? And the fact many of you took the time to go through the review questions and that many of you took the time to work through the programming exercises.
So, it doesn't quite match our expectations of having this very good separation, right?
And is it big or is it small, or how do they compare to each other from the different models?
So one of the most common ones is the Eclipse Paho project.
You'd like to have a question that's very sharp.
Before you engage in the inference process, just take a little bit of extra time to think about what the population is and maybe engage in experts in this area.
Rather than have no model at all that's almost certainly not going to be useful. All right?
In this module, we're going to be looking at MQTT, MQTT is an open source protocol that was designed to solve the problems of machine to machine communication.
So, it's important not to get hung up on finding the right model.
And so, and finally, the purpose of your analysis may change over time.
And so, just looking at the very edges of the data set can be very useful to flag a number of basic problems that can very often occur and are usually easy to fix.
And so you wanna think about, given the outcomes of your decision what types of metrics you want, whether sensitivity or specificity, or all these other kinds of metrics, which ones are going to be most important to you in your setting.
And that way you set the expectations for yourself, and you can determine whether the reality kind of meets that expectation.
Now inside the while loop there are 3 events that this code checks for.
You might think that this data set represents a sample from all the future or past years, for example, of data that might exist.
So we're gonna talk about the characteristics of each of these types of questions and how they can be used or misused in any data analysis.
This may not involve a sampling process that represents your population.
So we need a model to help us simplify that, to allow us to think about it in a kind of a reasonable way.
For example, you might have data on the sex of people in your study.
And so the three basic questions related to the different kinds of violations that you might encounter when doing inference.
I hope to make you one of the best people in knowing how to design and build serious machine learning and AI systems.
Many scientists are starting to think that the large amount of sugar in the average American diet is actually one of the main reasons why we see so many people with obesity and diabetes.
We might think okay a normal distribution is a pretty reasonable approximation for the dataset okay?
In this kind of setting, both informal and formal communications are very important and you want to be able to coordinate communication between members or with yourself to make sure that the process is constantly moving forward.
Much of the thinking behind pandas is similar to relational theory.
Now, it may not actually respond to the key predictor in a causal sense, but the idea is that when the key predictor changes, the outcome changes along with it, whether it is causal or not.
Remember that no model is going to be right, but it may actually still be useful for helping us summarize the data.
Now, now let's suppose we add a background trend into our model.
This first course is broken into four modules.
Almost everything you find on the supermarket shelf that has been packaged, canned or bottled, falls into the category of processed foods.
In this module we're going to be looking at some additional nodes available within the pallet.
In particular, if a patient actually has cancer, but we fail to tell them that they have cancer then that can be really bad.
The point I want to make is the kind of the analysis that you do for an associational type of analysis is very much along these lines.
So there, you would make sure that if someone, you would want to make sure that if someone does not have the disease, that you pick that the algorithm picks that up.
Again, we'll talk about this in more detail later on.
And each of those maps again contains key value pairs.
You won't have to worry about confidence intervals or standard errors or P values because you'll have all the certainty in the world that you'll need, assuming that you don't have any problems with that data set.
And the basic goal here is you want to look at trends or relationships between variables in your data set.
So weather is associated with mortality and it's also highly associated with various air pollutants and so we can characterize the weather with something like temperature or dew point temperature to think to just capture a piece of kind of what weather is and so we can include that into our model.
And then, when data is published against that topic, you're going to receive a copy of the message.
Now, so positive or negative is what we're talking about when we talk about directionality.
You'll have an outcome, y and you'll have a key predictor, x.
The fact that fruit juice is on this list often surprises parents, because many of us have grown up thinking that fruit juice is healthy and it can be, in the right amounts.
And eventually learn to play checkers better than the Arthur Samuel himself was able to.
Now the simplest model that we can use to describe this relationship that we expect is a linear model.
Tom, from Pittsburgh came in, they said he be cold, hot, you know, he couldn't get a sweater.
You might ask your friends, if they've been their before, how much does this place cost.
And finally we also spent a lot of time talking about different aspects of, sort of, advice on building a machine learning system.
So really actually, no matter what, no matter how you answer the question of do these results make sense, the thing that you need to do is check what you've done, check it over to make sure there's no simple mistakes.
And you'll notice immediately that most of the predictions are just of good.
Since the ball's weight is the force that the earth's gravity exerts on the ball, the only way to observe it directly is to let go of the ball and watch it accelerate.
So we can use the normal distribution to say that 11% of the population would be willing to pay more than 30.
Exploratory data analysis is an important part of any data analysis process.
And if you're curious about what the things are on the bookcases behind me, well, you'll just have to watch the rest of the videos to find out.
The F Score, which is also called the F1 Score, is usually written F1 Score that I have here, but often people will just say F Score, either term is used. Is a little bit like taking the average of precision and recall, but it gives the lower value of precision and recall, whichever it is, it gives it a higher weight.
So when we build a prediction model the thing that we want is to be able to find a feature or a set of features that can produce good separation in the outcome.
So what I'm gonna talk about in this lecture is really how do we develop a sketch of a solution to your question, assume your question is appropriate and is answerable, we want to start sketching out a solution.
And then you can start the process again with a new primary model, and then perhaps a new set of secondary models to challenge that primary model.
Formal modeling is the process of precisely specifying statistical models for your data, and for providing a rigorous framework for challenging and testing those models.
So you kinda have a sense of what makes it good, how long it should be, what the structure should be in terms of the verse and the chorus and the bridge.
So that's the first part of the data analysis cycle.
Plain water is by far the best way to quench their thirst and to keep them hydrated.
And so there are different metrics that you want to favor over each other, depending on the kind of decision that will be made and the consequences of those decisions.
Okay, and then to make inference, what we do is we collect a sample from the population.
As a reminder, here are the definitions of precision and recall from the previous video.
And you can see that the coefficient for PM10 here is 0.00004, etc.
With the advent of cloud, REST APIs are becoming the de facto way for systems to communicate, using http verbs, post, get, delete, and JSON as the data format to exchange information.
Finally, we usually don't care about the mechanism or the specific details of the relationships between the various predictors or variables.
But in reality, events are going to come in automatically, so there are a series of nodes to allow you to capture events and respond to those events.
So a small glass of freshly squeezed orange juice can be a great source of vitamin C, and fiber, and a really nice way to start the day.
The last thing of course, you wanna think about is the uncertainty.
So you usually don't care about the mechanism, or how things work when you're trying to do a prediction analysis.
Okay, what did they tell us?
So the magnitude is really the size of the association or the size of the effect that you estimate.
So, models are generally speaking are just constructs that we build to help us understand the real world, okay.
And that's a remarkable result.
If you have programming experience, but not Python-specific experience, you can pick up Python very quickly.
I've got a histogram of all of the data points that were from the survey.
Last thing you wanna think about is the implications, so you wanna consider the implications of your analysis and how they might determine what actions to take.
Then there's the special case of using the data as a population, in which case you're not making any inferences at all but you actually have all the information you need.
But these ingredients are all really just variations of sugar.
So if you have a background in databases, you'll find the pandas environment fairly natural to work in.
Now the nice thing about most software packages now is that we could just simulate the error from a normal distribution and see what it looks like.
Like machine learning methods that are useful for making very accurate predictions without you having to specify very carefully what the model is.
And you can see that it has highly seasonal components, the mortality tends to be higher in the winter and lower in the summer, and it's very specific to kind of to pattern across every year.
And we want to see how that association changes under different scenarios.
Now that list. Has several objects inside it, and each of those objects is itself a map.
Followed by dew-point temperature, followed by the date, and then no2, ozone which is o3. Season.
So if there's a population out there that we're trying to make inferences to, or to describe in some way, we use a model to kind of help us do that.
Okay, so just look at the first few rows and then maybe look at the last few rows.
It turns out what are the other things to spend a lot of time on in this class is practical advice for applying learning algorithms.
So eventually, as you iterate through the exploratory data analysis, you may be led to a place where you simply don't have any more data to answer the questions.
One natural thing that you might try is to look at the average precision and recall.
It's often very common to mistake one type of question for another.
So let's open up that class and see how it works.
So, I went up to my dean and I asked him if I could be allowed to offer a course on happiness.
so, created by IBM and so it's it was able to contain much less data than you could even store right now on your computer or even on your cell phone.
We're not going to cover the exact programming model, but I will show you some of the clients available.
So, that's sometimes called an n of one analysis.
Now as before, the key method in this class is the handle response method.
One of the most common things you may need to do when you create your IoT solution is augment your sensor data with data from other sources.
It's a very simple check.
So this is what I would call the trivial model, meaning that there's no model.
And in this case, that's an ordered list.
This was a remarkable result.
Now, the one thing about this picture that you have to just remember is that there is no data in this picture.
And so think about that as you're building prediction algorithms and you're seeing the results.
The first thing that I noticed was that there was very little correlation between academic success and career success.
So it's possible that we would've seen higher sales in the product, even without the ad campaign, just because of these background trends.
So the model is telling us something very different about what the population is willing to pay for this product.
Another thing that I like to do that I think is very useful is to show the data.
Cuz if you don't keep in mind the exact purpose of why you're analyzing the data, you may over or under invest resources into your project.
And for the Internet of Thing platform, the broker is your organization.
You could turn on, the air conditioner or you could catch a plane down to Florida in the middle of the winter to, to escape the cold.
Because they are used as part of the authentication mechanism.
The weight pulls down on the ball and according to Newton's second law, the ball accelerates downward in response.
Those are all the secondary models that you want to look at.
So on a very basic biological level, plants are often more complex than animals are.
It can be called corn syrup, glucose, dextrose brown rice syrup, or even evaporated cane juice.
The outcome is the factor or the feature that we think varies along with what I call a key predictor.
The second question you wanna ask is do we understand the sampling process?
First, there may be uncertainty in your estimates from the data that are not accounted for.
Why does a dropped ball fall downward?
This is the leading candidate based on all the current information you have and based on any kind of exploratory analysis that you've done.
Now, but the real question is when do you stop, because none of us live forever.
I can see that the daily sales change as you go in and out of the campaign.
So that's bad.
It's the value of y when x and z are zero.
But beforehand, we don't necessarily divide them into different groups based on importance.
So the goal is to minimize the size of this gray area using some set of features that you can collect.
And many times you'll find that actually your primary analysis was not right and you'll focus on a different model and you make that your primary model.
And importantly, you may need to eventually argue for more resources or more time, and you're going to have to persuade other people in those cases.
Have a fantastic journey of Chinese learning.
So in the first seven days you have an average about 200 dollars per day.
Okay.
What if it was possible, to design and test business models, in a way, that gives your idea a better shot at rising to the level of the great businesses that are shaping our world.
If the directionality changes then that may call into question whether you've got a reasonable modeling strategy or if even that association actually exists in the population.
Here's a definition of what is machine learning as due to Arthur Samuel.
Either in another dataset or a different population of kind of data points.
So these are things that tend to be, they are associated with your key predictor and they're also associated with their outcome.
The on post execute method is called, and this provided the result as its parameter and is you can see.
There are examples available, showing you how to use MQTT protocol and clients against the IBM Internet of Things platform.
One, two, three.
Just get an answer.
It's the story of how they succeeded and if you're a entrepreneur or anyone with a great idea it could easily be your story too.
Which means that if you actually have bad credit, the probability that the algorithm will classify you as such is only about 2.6%, so it's very low.
So what conclusion you make ultimately may depend on outside factors like cost, or kind of timing issues.
Concretely, suppose we have three different learning algorithms. So actually, maybe these are three different learning algorithms, maybe these are the same algorithm but just with different values for the threshold.
That's another indicator that you're probably out of data.
And now I'll launch the networking Android http client JSON application.
So why is machine learning so prevalent today?
But if you go out on to the Internet, you'll find a myriad of MQTT implementations.
There are six types of questions you can ask in a data analysis, and in this lecture, we'll talk about each of them and what they mean for interpreting your results.
This time however, the data will be summarized and presented in a list view.
It's a force exerted on the ball by the earth's gravity and it, the ball, the ball itself responds to that weight by accelerating, by falling.
It's a polynomial model.
And in addition to having the tools of machine learning at your disposal so knowing the tools of machine learning like supervised learning and unsupervised learning and so on, I hope that you now not only have the tools, but that you know how to apply these tools really well to build powerful machine learning systems.
But your interpretation can be made separate from the evidence that's provided.
And likewise, if you were expecting something negative, you can see whether the result was negative.
And in this case what we would have is going to be a higher recall classifier, because we're going to be correctly flagging a higher fraction of all of the patients that actually do have cancer.
And so that's okay, but the problem is that the algorithm's specificity is very poor.
The pandas Toolkit is a fundamental in Python data science, and provides a data structure for thinking about data in a tabular form.
Welcome to an introduction to Data Science with Python.
And this model can be more or less complex depending on the type of question that we're trying to answer.
There are other trends in the background that are kind of messing up your relationship.
To the extent that you might be able to control the audience that you present to, you wanna make sure that you have the right people or the right person, so that you can get the most efficient answers to your question.
So defining the population is far and away the most important task.
So far, our example application has requested data, And then just displayed that data in a text view.
And so for the same time period you can see this has also a seasonal structure to it.
And so, some of these steps are covered by different parts of the class, and so reproducible research synthesizes where those parts go together.
In this class you learn about state-of-the-art machine learning algorithms.
The key is called earthquakes, and the value is an ordered list.
And I wanted to say: Thank you very much for having been a student in this class.
I'll start first with an inferential question.
Another possibility is that, of course, Apple is on a stock market, and there's other stocks that trade on a stock market.
And so they will help you sort that out.
Hi, welcome to what a plant knows. If by chance you clicked on the wrong course, go back and get to the course you want because this is what a plant knows.
And do you have the right question, or does it need to be refined a little bit?
But for a healthy child, it's usually best to stick with plain water from a clean, reliable source.
Now, so that's a very, that's a physical type of model.
You just basically have to look at the metadata, things like the variable names and the number of rows, for example.
And so, you can make that conclusion very quickly.
And if you look in your data set, and you have a measurement on that same feature, and it looks like the average is around ten.
In other words, you should be able to explain the mechanism for how things work.
What is the task T in this setting?
How would that affect your life?
The second thing you wanna think about is your primary results.
Then you're analysis will be greatly simplified because you won't have to worry about making statements about things that you don't actually observe because you actually observe everything.
And I won't get into that very much right now, but my only point to make here is that you should try the easy solution first.
Now you will notice there is a big clump of points in the range of the x-axis.
I'm just kidding, I'm Indian by origin but I do consider myself to be a bit of a world citizen.
And we also spent a lot of time talking about debugging learning algorithms and making sure the learning algorithm is working.
So just as a very quick, simple running example, here's a basic population of ten penguins. Okay, and each of these penguins has a turquoise hat or a purple hat.
So you thought that the restaurant was cheaper than it actually was.
Some of the key questions that you wanna think about before engaging in an exploratory data analysis process, are basically, do you have the right data to answer the question that you’re interested in?
Isn't business the opposite of happiness?
So this lecture we'll talk about figuring our whether you've got the right data for the job.
Total sales and the ad campaign, while adjusting for other confounding factors, like this potential background trend.
So comparable to what we saw with the JSON format there's an element called eqid and it's value is an earthquake id.
And the sampling process that we use results in our dataset, okay.
And just like with other aspects of data analysis, everything you do is based on setting expectations and kind of matching the data that you collect with the information that you collect to those expectations.
You're not gonna want to use very technical jargon that will only be understood by a small subset of people, try to use language that'll be understandable to a broader range of people.
The sixth type of question that we are interested in is a mechanistic type of question.
But now, suppose we want to predict that the patient has cancer only if we're very confident that they really do.
The first thing you want to look at is the effect size.
So they're all kind of independent of each other.
This is, you know, this looks awfully like the standard logistic regression model, except that I now have a capital theta instead of lower case theta.
So the basic idea is that we expect days with higher temperature to have higher ozone levels.
But the most important thing, excuse me, is that you're gonna want to make sure you keep track of all the tuning parameters you set and the process through which you set these tuning parameters.
So once you've gone through this process, you've looked at the data, you've checked to see that everything's valid, you wanna be able to follow up.
Thank you so much for taking the course, and I look forward to getting to know you better in the coming six weeks.
So this picture shows two simulated variables with a linear relationship.
The association between the two variables is zero.
Let's take a look at an example application that gets this data from the internet, and then processes it to create a more human readable display.
And sometimes its okay to even literally just draw it with your hand.
You've made maybe a simple plot just to kind of visualize the data, and you've validated with one external data source.
But even if customers love your value proposition, you can fail if your business model is not scalable and financially sustainable.
And that has an implementation of an MQTT client and again it's in a number of languages.
In the next lesson, we'll be looking into this node in a lot more detail.
And I know that many of you have worked hard on this class and that many of you have put a lot of time into this class, that many of you have put a lot of yourselves into this class.
And here you can see I have the page running.
If I make slides with all the information on it, no one is going to be listening to me, because they're thinking, "Oh, I'll get the slides later." But if I can make slides that use the lazy rule, people are going to be like, "Wait.
So, for example tons of Silicon Valley companies are today collecting web click data, also called clickstream data, and are trying to use machine learning algorithms to mine this data to understand the users better and to serve the users better, that's a huge segment of Silicon Valley right now.
The actual collecting of the data involves going to the restaurant and getting the check.
Or let's say, look at it from a more biological point of view, how could we reproduce, if we couldn't move.
But we're actually not interested in those parameters.
So, you often think, see things like, you know, big data and cloud management and cloud and big data tend to go together and that's because for some data sets they're so big that you can't analyze them on your local laptop computer.
And you can see that there's a gray area here that I've highlighted in the plot that's kind of near the middle.
One primary model and two secondary models.
You've probably use a learning algorithm dozens of times a day without knowing it.
The specialization is of an intermediate level or difficulty, and we expect that you have studied some basic programming and statistics in the past.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment