MOSIP Analytics - Populate Metrics

MOSIP currently has an Analytics Framework as described in https://github.com/mosip/reporting/tree/develop

We will consider only the real time data processing into the scope of this project and enhance the same.

MOSIP will pump Anonymous profile data into postgresql database and Debezium will pump this into kafka broker.

A custom Spark Stream job need to be developed to process this stream of profile data and generate various metrics and push into Elastic search

Part 1 - Setup Infrastructure

1.1 Setup following dockerized components and configure them

PostgreSQL
- Enable WAL
- Create a table for storing Anonymous profiles as JSON string
Debezium Kafka-Connector
- Configure to listen to data changes in anonymous profiles table
- Publish to a new topic ( Profiles)
Apache Kafka
Elastic Search
Python 3.x
Spark Streaming cluster

1.2 Prepare a script which will do the automatic setup of all these required components/configurations

1.3 Create few data records in the JSON format as specified in the Annexure A and insert into above table

Part 2 - Build Spark Jobs in python to process metrics from stream data

Identify possible analytical metrics which could be derived from the given profile data schema
Develop the Spark jobs to populate these metrics as indexes in ElasticSearch
Also develop a job to populate the raw data as a separate index

2.1 Identified metrics

3. Annexure A

"AnonymousStoredProfile": {

"processName": "", //process as new or update. Correction is not included here

"old": {

"yearOfBirth": "", //Only the year of birth is kept.

"gender": "", // Confidential, Female, Male, Transgender, ...

"location": [""], //hiearchy maintained as per the array. JSON array remembers the order

"preferredLanguages": [""], // list of preferred languages

"channel":[

{

"hashedchannel": "hashed phone or email ",//Please note all values should be hashed after normalization

"name": "channel name eg: phone"

}

"exceptions": [""], // list of exceptions

"verified":[""] // list of all the verified id schema atribute names

},

"new": {

"yearOfBirth": "",