# Data Collection

10.04.2018

Episode #1 of the course Introduction to statistics by Polina Durneva

Welcome to the course!

Recalling their high school or college years, people usually cringe when they hear words like “stats” or “math.” In this course, you will realize how highly enjoyable and entertaining statistics can be! I will explain the basics of statistics in the simplest terms and use fun examples to do so.

We will start our course with the introduction of data. As you may realize, no statistics can be done without data. Let’s discuss the structure of data and the importance of sampling from population.

Structure of Data

It is important to understand that data consists of cases and variables. When we obtain information about anything, we collect cases. For instance, you might want to collect information about different academic degrees that people in Austria hold. Each person about whom you collect data is a separate case. Once we obtain information about cases, we describe this information using variables. Each variable is a separate characteristic of a case. In our above example, the type of an academic degree held by a person is a variable.

The most popular types of variables are categorical variables and quantitative variables. As you may guess from the names, categorical variables divide cases into different categories or groups, and quantitative variables provide you with a numeric measurement to describe a case. For instance, if you collect data about high school students, you might want to categorize them by years, such as freshman, sophomore, junior, and senior. Student year will be a categorical variable. If you want to collect these students’ test scores or GPA, either of those will be a quantitative variable.

Sampling from Population

Let’s proceed to another important part about data: sampling from population. We collect data to primarily draw an inference from it. However, most of the time, data collection is a tough, challenging, and costly process. For example, let’s say you are in a huge fast food restaurant, and you want to find out how many people in this restaurant are eating burgers with fries and how many of them are eating chicken wings with buffalo sauce right now. If you want to find the exact number, you will have to go to each person, ask them your question, record their answer, and proceed to the next person. Obviously, such a process is incredibly tedious and tiring: You will have to walk a lot and make sure you keep track of people coming in and out in order to not have duplicate observations.

Statisticians have a great solution for such problem. You can get a subset of the population and make inferences about the whole population using just the sampled subset. In our fast food restaurant, you will randomly select different customers and ask them what they are eating. Choosing random observations is important because you want to avoid bias.

But what is bias? If, for instance, you come up to only one group of people in our fast food restaurant and ask only them, your sample is biased. This group might consist of people who always eat burgers and fries, and thus, you will make quite an inaccurate inference that the entire population of our fast food restaurant eats only burgers and fries.

That’s it for today! Tomorrow, we will discuss different ways to evaluate and visualize quantitative variables.

See you soon,

Polina

Recommended book

The Signal and the Noise: Why So Many Predictions Fail-But Some Don’t by Nate Silver

Share with friends