Data Sets
A Data set is a set or collection of data. This set is normally presented in a tabular pattern. Every column describes a particular variable. And each row corresponds to a given member of the data set, as per the given question. This is a part of data management.
Data sets describe values for each variable for unknown quantities such as height, weight, temperature, volume, etc of an object or values of random numbers. The values in this set are known as a datum. The data set consists of data of one or more members corresponding to each row.
Table of Contents:
Data Sets Meaning
A data set is an ordered collection of data. While handling the data, the data set can be a bunch of tables, schema and other objects. The data are essentially organized to a certain model that helps to process the needed information. The set of data is any permanently saved collection of information that usually contains either case-level, gathered data, or statistical guidance level data.
Also, read:
Types of Data Sets
In Statistics, we have different types of data sets available for different types of information. They are:
- Numerical data sets
- Bivariate data sets
- Multivariate data sets
- Categorical data sets
- Correlation data sets
Let us discuss all these data sets with examples.
Numerical Data Sets
The numerical data set is a data set, where the data are expressed in numbers rather than natural language. The numerical data is sometimes called quantitative data. The set of all the quantitative data/numerical data is called the numerical data set. The numerical data is always in the numbers form, such that we can perform arithmetic operations on it.
- Weight and height of a person
- The count of RBC in a medical report
- Number of pages present in a book
Bivariate Data Sets
A data set that has two variables is called a Bivariate data set. It deals with the relationship between the two variables. Bivariate dataset usually contains two types of related data.
Example: To find the percentage score and age of the students in a class. Score and age can be considered as two variables
- The sales of ice cream versus the temperature on that day. Here the two variables used are ice cream and temperature.
(Note: In case, if you have one set of data alone say, temperature, then it is called the univariate dataset)
Multivariate Data Sets
A data set with multiple variables. When the dataset contains three or more than three data types (variables), then the data set is called a multivariate dataset. In other words, the multivariate dataset consists of individual measurements that are acquired as a function of three or more than three variables.
Example: If we have to measure the length, width, height, volume of a rectangular box, we have to use multiple variables to distinguish between those entities.
Categorical Data Sets
Categorical data sets represent features or characteristics of a person or an object. The categorical dataset consists of a categorical variable also called the qualitative variable, that can take exactly two values. Hence, it is termed as a dichotomous variable. Categorical data/variables with more than two possible values are called polytomous variables. The qualitative/categorical variables are often assumed to be polytomous variable unless otherwise specified.
Example:
- A person’s gender (male or female)
- Marital status (married/unmarried)
Correlation Data Sets
The set of values that demonstrate some relationship with each other indicates correlation data sets. Here the values are found to be dependent on each other.
Generally, correlation is defined as a statistical relationship between two entities/variables. In some scenarios, you might have to predict the correlation between the things. It is essential to understand how correlation works. The correlation is classified into three types. They are:
- Positive correlation – Two variables move in the same direction (Either both are up or both or down)
- Negative correlation – Two variables move in opposite directions. (One variable is up and another variable is down and vice versa)
- No or zero correlation – No relationship between two variables.
Example: A tall person is considered to be heavier than a short person. So here the weight and height variables are dependent on each other.
Mean, Median, Mode and Range of Data-Sets
The mean, median and mode along with range are the major topics in Statistics. Let us get through with respect to data sets here.
Mean of a data-set is the average of all the observations present in the table. It is the ratio of the sum of observations to the total number of elements present in the data set. The formula of mean is given by;
Mean = Sum of Observations / Total Number of Elements in Data Set
Median of a data-set is the middle value of the collection of data when arranged in ascending order and descending order.
Mode of a data-set is the variable or number or value which is repeated maximum number of times in the set.
Range of a data set is the difference between the maximum value and minimum value.
Range = Maximum Value – Minimum Value
Properties of Dataset
Before performing any statistical analysis, it is essential to understand the nature of the data. We can use different Exploratory Data Analysis (EDA techniques), which helps to identify the properties of data, so that the appropriate statistical methods can be applied on the data. With the help of EDA techniques, we can check the following properties of the dataset.
- Centre of data
- Skewness of data
- Spread among the data members
- Presence of outliers
- Correlation among the data
- Type of probability distribution that the data follows
Data Sets Example
Example 1:
Find the mean, mode, median and range of the given data set.
{2, 4, 6, 8, 2, 10, 12}
Solution:
Given, {2, 4, 6, 8, 2, 10, 12} is a set of data.
Mean = 2+4+6+8+2+10+12/7 = 44/7
To find median we have to first arrange the given data in ascending or descending order
So, {2,2,4,6,8,10,12}. Thus,
Median = 6
Mode = 2
Range = 12-2 = 10
Example 2:
Find the mode for the given data set: 2, 3, 3, 4, 6, 7
Solution:
Given data set: 2, 3, 3, 4, 6, 7
We know that the mode is the frequently repeated value in the data set.
From the given data set, it is observed that the data “3” is repeated twice.
Hence, the mode for the given data set is 3.
Practice Problems
Solve the following problems:
- Find the mean for the dataset: 5, 3, 1, 6, 8, 9.
- Find the median for the dataset: 6, 2, 4, 5, 7.
- Find the mode and range for the following dataset: 3, 9, 12, 23, 7, 16, 5.
Frequently Asked Questions on Data Set
What is meant by dataset?
The set or the collection of data is called a dataset. In other words, the dataset is the ordered collection of data.
What are the different characteristics used to measure the dataset?
In statistics, the different characteristics used to measure the dataset are mean, median, mode, range, and so on.
How to calculate the range of the given dataset?
The range of the given data set is the difference between the maximum and minimum value of the data set.
What are the different types of dataset?
The different types of dataset are:
Numerical dataset
Bivariate dataset
Multivariate dataset
Categorical dataset
Correlation dataset
What is the median of the dataset?
The median is the middle value of the dataset, in which the data are arranged in ascending order.