Data Lake | Karnataka Data Lake

This component is meant to process datasets and perform operations like data cleaning, semantic resolution, canonicalization, etc. A set of scripts ingest a wide variety of data such as structured, semi-structured, unstructured, or binary data into a data lake from disparate open data sources. It is equipped with a flat architecture, where every data element in the lake is given a unique identifier and tagged with a set of metadata information.

The DAS comprises three operational layers, which are described as follows.
The ingestion layer acts as a storage layer for raw data that is incoming to the system. Data is ingested via connectors from various open data sources like the following: Open Government Data (OGD) Platform India (OGD: https://data.gov.in), e-National Agriculture Market (ENAM: https://enam.gov.in/), Central Control Room for Air Quality Management (NAQI: https://app.cpcbccr.com/AQI_India/), and
United Nations Statistics Portal (UN statistics portal: http://data.un.org/). Multiple approaches of ingestion are supported including batch uploads, real-time subscriptions, and one-time loading of datasets. The layer also offers an option to apply schema or meta-data information to the incoming data.

The caching layer is temporarily or permanently stores processed or pre-processed data and materialized views. The data in this layer is either ready for visualization and consumption by external systems or is prepared for further processing. Applications residing in the processing layer will take data from the ingestion layer, process it, structure it, and store them back in the data lake.

The processing layer or consumption layer is to offer one or more platforms for distributed processing and analysis of large data sets. It can access data stored in both the ingestion and caching layers. This pre-processed data is pushed to the latter stages to perform data-driven, model-driven, and knowledge-driven analysis.

One of the main functionalities of the DAS is to curate the data in a way that is accessible for further downstream operations. Incoming data are of different varieties including tabular data, natural language text, social media posts, tweets, audio recordings, videos, etc. First, all multi-media datasets are converted to text, with semi-automatic transcription using available tools. Once all forms of data are converted into texts, they are further cleaned and represented in one of two different forms: a collection of tables, and a knowledge graph.

A set of scripts for semantic resolution are also used to identify semantically meaningful entities from the data, by resolving them with ontologies from the Linked Open Data (LOD) cloud (https://lod-cloud.net/).
An application-wide knowledge graph is also built, by connecting entities with other entities through labeled, directed edges, and characterizing each entity with its attributes. This process is called \textit{entity twinning}. Each node in the knowledge graph represents an entity of interest like a district, village, crop, industry, etc. It is characterized by the attributes found in the ingested data. Each entity has also associated a set of tables where this entity may be found, a set of models where this entity participates, and a set of data stories about the entity.