/
Analytics From Anonymous Profiles

Analytics From Anonymous Profiles

MOSIP Analytics - Populate Metrics

MOSIP currently has an Analytics Framework as described in https://github.com/mosip/reporting/tree/develop

 

We will consider only the real time data processing into the scope of this project and enhance the same.

MOSIP will pump Anonymous profile data into postgresql database and Debezium will pump this into kafka broker.

A custom Spark Stream job need to be developed to process this stream of profile data and generate various metrics and push into Elastic search

 

Part 1 - Setup Infrastructure

 

1.1 Setup following dockerized components and configure them

 

  • PostgreSQL

    • Enable WAL

    • Create a table for storing Anonymous profiles as JSON string

  • Debezium Kafka-Connector

    • Configure to listen to data changes in anonymous profiles table

    • Publish to a new topic ( Profiles)

  • Apache Kafka

  • Elastic Search

  • Python 3.x

  • Spark Streaming cluster

 

 

1.2 Prepare a script which will do the automatic setup of all these required components/configurations

 

 

 

1.3 Create few data records in the JSON format as specified in the Annexure A and insert into above table

 

Part 2 - Build Spark Jobs in python to process metrics from stream data

  • Identify possible analytical metrics which could be derived from the given profile data schema

  • Develop the Spark jobs to populate these metrics as indexes in ElasticSearch

  • Also develop a job to populate the raw data as a separate index

 

2.1 Identified metrics

 

 

 



 

3. Annexure A

"AnonymousStoredProfile": {

"processName": "", //process as new or update. Correction is not included here

"old": {

"yearOfBirth": "", //Only the year of birth is kept.

"gender": "", // Confidential, Female, Male, Transgender, ...

"location": [""], //hiearchy maintained as per the array. JSON array remembers the order

"preferredLanguages": [""], // list of preferred languages

"channel":[

{

"hashedchannel": "hashed phone or email ",//Please note all values should be hashed after normalization

"name": "channel name eg: phone"

}

"exceptions": [""], // list of exceptions

"verified":[""] // list of all the verified id schema atribute names

},

"new": {

"yearOfBirth": "",

"gender": "", // Confidential, Female, Male, Transgender, ...

"location": [""], //hiearchy maintained as per the array. JSON array remembers the order

"preferredLanguages": [""], // list of preferred languages

"channel":[

{

"hashedchannel": "hashed phone or email ",//Please note all values should be hashed after normalization

"name": "channel name eg: phone"

}

] , // Used for computing how many have this number

"exceptions": [""], // list of exceptions

"verified": [""] // list of all the verified id schema atribute names

}

}