Introduction of data analysis

With data pre-processed, the next step is to do data analysis. The purpose of data analysis is to gain the value from data. As mentioned, data is burden if we don't have capability to get value out of it. And data analysis gets insights from data, which can be used in later decision making and other applications. Therefore, it's a crucial step in the data chain.

While both data analysis and data modelling can gain the value from data, data analysis is more about getting some useful result from data, and data modelling is more about building the function between the data and its result. Therefore, with $y=f(x)$, in which $x$ is the data, $y$ is the result, and $f()$ is the function. Data analysis cares more about $y$, and data modelling cares more about building $f()$.

We need to mention an related concept of data analysis, which is data mining. Although we can say there are some minor difference between mining and analysis (e.g., some believe mining is more about exploring in the data). In most cases, they are interchangeable. Therefore, in this course, we treat them as the same ([1] gives a comparison between data analysis and data mining, but I don't agree many conclusions in it).

When we talk about data analysis, it can refer to many different types of methods. A classic classification has four types: descriptive analytics, diagnostic analytics, predictive analytics, and prescriptive analytics [2]. We mainly introduce the methods belong to the first two types, which are more about getting information from data about what happened in past. For the last two types, I believe it's more about data modelling, which we will discuss later.

Methods

The data analysis methods introduced here are mostly to statistical methods. I put these methods to two categories: statistical analysis for one variable, and statistical analysis for multiple variables.

Statistical analysis for one variable

Statistical summarization

Statistical summarization summarizes the information about the data. Common methods includes:

Measures of replication: the count $n$ of data $X$.
Measures of centrality: Arithmetic mean, Geometric mean, Trimmed Mean, Median and Mode, as described before.
Measures of dispersion: The dispersion of data refers to how spread out the values are around the average. As shown in the figure below, if the values are close to the average, then the data has low dispersion. If the values are widely scattered about the average, the data has high dispersion.

  The most common measure of dispersion is variance:

$var(X) = \sigma^2 = \frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2$

with $\mu$ as the mean, for data $X$ with $n$ values. $\sigma^2$ is called variance, and $\sigma$ is called standard deviation.

If the distribution of $X$ is known, the variance is then:

$\sigma^2 = \sum_{i=1}^NP(x_i)(x_i-\mu)^2$, for discrete distribution.

$\sigma^2 = \int P(x)(x-\mu)^2 dx$, for continuous distribution.

 A good visualization to present variance is box-plot, which you can find [an short introduction here](<https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/box-plot/>).

Measure of range: The difference between the highest value and the lowest value in a data set:

(Image source: https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/range-statistics/)