What is data chain?

With increasing amount of data and more advanced computation and modeling techniques, the PA (and other social sciences) research and practice will be more "data-driven". In data driven approach, data becomes the centric factors. It organizes the actions around it, and we identify such a new workflow as "data chain". As data flows through this "data chain", it is transformed from raw resource to application with utilities. This is also a value-added process for data. I will introduce some useful methods in each step of this data chain.

Please note that although I name it as a chain, it's actually iterative process, with 5 possible places for inner loop, marked as A-E in the figure.

Data Chain

Data Chain

Data collection

Data collection is the starting point of a data chain. A data driven project needs data firstly. Although some believes data is abundant in nowadays, it still costs some (or even great) efforts to collect data for a data driven application or research. And below are some ways of collecting data:

  1. User survey and questionnaire: This is the most common way of data collection in PA and other social sciences. The advantages are obvious: it’s direct (you can ask for what you want to know), easy to conduct, and subjective (so that you can measure people’s opinions, which is difficult for other methods). However, this way of data collection can be easily biased, due to the method of user sampling, the way of asking questions, etc. You need trying to avoid these biases before applying user survey and questionnaire. I also suggest to use new forms of user survey for specific cases and contexts, such as the case studies I introduced in my lecture.
  2. Open data: The volume of open data from the governments, organizations, and companies is increasing rapidly. It's easy to use and free of charge in most cases. Good examples include London Data Store, World Bank Open Data, and Kaggle. The typical problems of open data include the inconsistent data quality, lack of maintenance, etc. And I believe a big issue of open data is that it only suits the data free for distribution, while a lot of valuable and sensitive data cannot be shared in this way.
  3. Existing data: In many occasions, the data you need exists already. Therefore, instead of collecting new data, you’d better try to find these existing data firstly. For example, a lot of data has been accumulated within the government (especially in the platform like “City Brain”) or an enterprise.
  4. Sensors: Sensors are the physical devices to measure the state of observable target. There are various types of sensors in nowadays, such as cameras on the streets and sensors in your mobile phones. Although sensors are straightforward way to get your data, they have obvious constrains. Firstly, sensors cost (even if a single sensor was inexpensive, you normally need to purchase a lot of them to form a network); Secondly, sensors, either deployed in public or private places, need to be approved by either the governments or the users, and we need to make sure it doesn’t violate privacy. Therefore, as I recommend in my lecture, it’s better to simulate some “virtual sensors” to observe your target, without physical investment and deployment.
  5. Crowdsourcing: When it’s time-consuming to collect all data by yourself, you can try ask the crowd users to contribute their data to you. A good example is PatientsLikeMe. Normally, crowdsourcing brings cost (e.g. Amazon MechanicalTurk). However, you can design certain game mechanism to encourage users to contribute their data without paying them.
  6. Web: Web can be viewed as the biggest database in the world. However, this database is heterogeneous and decentralized. It’s not difficult to crawl the web content with tools like Scrapy, but you might spend much more time to do data cleaning, structuring, fusion and many other trivial works. Another challenge is to get data from social networking sites, such as Twitter and Facebook, which are getting more closure these days (although you can get some data from them with their APIs).

Data management

With data collected, the next step is to store and manage them. To store data, we use file and database. File is flexible and is suitable to store multimedia data (images, audios, videos, etc.) and unstructured data (logs, emails, etc.). If the data is structured, it’s better to use database, therefore, it’s normally more flexible to store data in a database than in a excel file.

Database is a system (or a “container”) to organize data. Traditional, database system is dominated by SQL database, which uses SQL (Standard Query Language) to manipulate data (Insert, Update, Delete and Query) in the database. Another characteristic of SQL database is its conceptual structure of data collections, or called “data schema”. You can read this tutorial for an introduction of SQL database.

Although SQL database is good at organizing structured data, it has its limitations. I think the most serious limitation is it doesn’t support heterogeneous data very well, which is required by big data analysis and modeling. SQL database needs a good design of data schema before collecting data, and a traditional table in SQL database cannot store records with different schemas.

To overcome these limitation, a group of new database systems (such as MongoDB, HBase, CouchDB) have emerged, under the name of "NoSQL database". These database is more flexible (with less constrain on data schema, or even without schema) and is more suitable for certain applications (e.g. Neo4J is for graph data).

I recommend to try these NoSQL databases if your data are heterogeneous. One popular choice is MongoDB. MongoDB is an easy-to-use and flexible document based database.

In MongoDB, a piece of data is stored as a JSON document. For example, this JSON document represents a football player: