Analytics From Anonymous Profiles
MOSIP Analytics - Populate Metrics
MOSIP currently has an Analytics Framework as described in https://github.com/mosip/reporting/tree/develop
We will consider only the real time data processing into the scope of this project and enhance the same.
MOSIP will pump Anonymous profile data into postgresql database and Debezium will pump this into kafka broker.
A custom Spark Stream job need to be developed to process this stream of profile data and generate various metrics and push into Elastic search
Part 1 - Setup Infrastructure
1.1 Setup following dockerized components and configure them
PostgreSQL
Enable WAL
Create a table for storing Anonymous profiles as JSON string
Debezium Kafka-Connector
Configure to listen to data changes in anonymous profiles table
Publish to a new topic ( Profiles)
Apache Kafka
Elastic Search
Python 3.x
Spark Streaming cluster
1.2 Prepare a script which will do the automatic setup of all these required components/configurations
1.3 Create few data records in the JSON format as specified in the Annexure A and insert into above table
Part 2 - Build Spark Jobs in python to process metrics from stream data
Identify possible analytical metrics which could be derived from the given profile data schema
Develop the Spark jobs to populate these metrics as indexes in ElasticSearch
Also develop a job to populate the raw data as a separate index
2.1 Identified metrics
"AnonymousStoredProfile": {
"processName": "", //process as new or update. Correction is not included here
"old": {
"yearOfBirth": "", //Only the year of birth is kept.
"gender": "", // Confidential, Female, Male, Transgender, ...
"location": [""], //hiearchy maintained as per the array. JSON array remembers the order
"preferredLanguages": [""], // list of preferred languages
"channel":[
{
"hashedchannel": "hashed phone or email ",//Please note all values should be hashed after normalization
"name": "channel name eg: phone"
}
"exceptions": [""], // list of exceptions
"verified":[""] // list of all the verified id schema atribute names
},
"new": {
"yearOfBirth": "",
"gender": "", // Confidential, Female, Male, Transgender, ...
"location": [""], //hiearchy maintained as per the array. JSON array remembers the order
"preferredLanguages": [""], // list of preferred languages
"channel":[
{
"hashedchannel": "hashed phone or email ",//Please note all values should be hashed after normalization
"name": "channel name eg: phone"
}
] , // Used for computing how many have this number
"exceptions": [""], // list of exceptions
"verified": [""] // list of all the verified id schema atribute names
}
}