Data assets and data economy

With efficient modeling, we can extract value from data. The "data chain" is a value-added chain, transforming data from raw material to products and services. People have recognized the potential value of data, and make analogies between data and raw material, or commodity, or assets. Let's look at these analogies:

Data as raw material

Data is commonly viewed as a new type of raw material and resource. It's natural because as we explained in "data chain", data needs to be processed to make it usable and mine the value within it. And so-called "data economy" is built on the mined value from data. Some even believe data is becoming the most important resource in future, since it's the fuel of AI, and AI would be the key factor of productivity promotion.

However, data has some special characteristics, compared with traditional raw materials like oil. First of all, usage of this type of raw material doesn’t exhaust it. Secondly, since it’s not exhaustible, it can be shared by many users, rather than being exclusively used. Thirdly, data is very heterogeneous, each piece of data can have its unique value.

Data as commodity Some people view data as a kind of commodity, since we need to pay efforts to create or product data. However, basically I think data is not commodity. The most important reason is that data cannot be used solely by the end users. Data is typically “consumed” by the models to provide services and products. We can view a model built on data as a commodity, but it’s not the case for the standalone data.

In addition, there are some other feature of data that are different to commodities. Firstly, the replication cost for data is nearly zero. Secondly, the marginal utility of data is unstable.

Data as asset

Data can be viewed as a new type of asset, since 1) it can create profit for its owner; 2) it can be controlled by its owner. This type of assets is becoming the essential assets of the modern economy, enabling a branch of new business models. However, compared with traditional assets, data has its own unique characteristics: 1) The expected revenue from data is difficult to predict in many cases, since we can only understand its value after building model with it; 2) The ownership of data is difficult to protect and transfer.

These characteristics makes it challenging to price and trade data. Since the value of it can hardly be understand before modeling, the seller will be reluctant to offer a high price. And because the data can be replicated with no cost, once it’s traded, its ownership isn’t actually transferred to the new owner, but just enlarges its coverage.

I argue that these issues regarding to data pricing and trading are essential to build sustainable data economy. Otherwise, data economy is just a buzz word.

Privacy protection and data pricing

Since data is now important raw material or even assets, its ownership should be seriously protected. The recent years data privacy has gained intensive public attention. But I argue it will become more centric problem of the society. As we are entering AI era, more wealth will be created by AI, and the data asset behind it. Therefore, the current paradigm in which personal data are collected, utilized and monetized by big companies will make a great wealth in balance, and polarization. And it’s actually happening at the moment.

Therefore, it’s important to establish a new paradigm, in which personal data is protected and reasonable priced, so that it can share the revenue of AI. And such paradigm cannot disable the usage of data in AI.

To achieve this, there are several technologies we can adopt:

  1. Federated learning [1, 2]: Federated learning is a modeling paradigm in which modeling is distributed to participatory nodes, and trained local models are aggregated to create a global model. In federated learning, data is kept locally at participatory nodes, instead of collected by the centric modelling, therefore, it can basically eliminate the data privacy issue. And federated learning has become a very active research area in the past couple of years, since the intensive attention on user privacy and the commencement of regulations like GDPR. And I identify some key challenges to be solved in current federated learning, including bottleneck of the central server, non-IID data, etc., which are the main research topics I'm now working on.
  2. Blockchain: Blockchain is a remarkable technology (it's actually a subtle combination of existing technologies). The distributed ledger of blockchain enables collaboration and organization through a decentralized way. More introductions of blockchain (and smart contracts) can be found in further reading. I believe blockchain can facilitate the decentralizing the processing and modelling on users' data. And we are now working on a github project to build federated learning framework on blockchain.
  3. Data usage pricing: A fair pricing mechanism is crucial to incentivize parties to participate in a collaborative modelling effort. There have emerged some methods from a game-theoretic viewpoint [3, 4, 5]. A practical way is to price its usage instead of ownership. We believe the data value various upon the applications, so the model's utility should be priced firstly. The main idea is to treat collaborative modelling as a cooperative game and thus the contributions of parties can be estimated with Shapley Value (SV) [6].

Further reading