Performance Workstream

Wednesday, March 11, 2020

Performance Goals:

Current HW system achieving stable 1k TPS, peak 5k and proven horizontal scalability
1. More instances = more performance in almost linear fashion.
2. Validate the minimum infrastructure to do 1K TPS (fin TPS)
3. Determine the entry level configuration and cost (AWS and on-premise)

Test the impact or a direct replace of the mysql DB with an shared memory network service like redis (using redlock alg if locks are required)

Test a different method of sharing state, using a light version of event-drive with some CQRS

What Kafka metrics (client & server side) should we be reviewing? - Confluent to assist
Explore Locking and position settlement - Sybrin to assist
1. Review RedLock - pessimistic locking vs automatic locking
2. Remove the shared DB in the middle (automatic locking on Reddis)
Combine prepare/position handler w/ distributed DB
Review node.js client and how it impact kafka, configuration of Node and ultimate Kafka client - Nakul
Turn back on tracing to see how latency and applications are behaving
Ensure the call counts have been rationalized (at a deeper level)
Validate the processing times on the handlers and we are hitting the cache
Async patterns in Node
1. Missing someone who is excellent on mysql and percona
2. Are we leveraging this correctly
What cache layer are we using (in memory)
Review the event modeling implementation - identify the domain events
Node.js/kubernetes -
Focus on application issues not as much as arch issues
How we are doing async technology - review this (Node.JS - larger issue) threaded models need to be optimize - Nakul

Technology has been put in place, hoped the design solves an enterprise problem
Community effort did not prioritize on making the slices of the system enterprise grade or cheap to run
OSS technology choices

Optimize current system
Make it cheaper to run
Make it scalable to 5K TPS
Ensure value added services can effectively and securely access transaction data

Jmeter initiates payer request
Legacy simulator Receives fulfill notify callback
Legacy simulator Handles Payee processing, initiatives Fulfillment Callback
Record in the positions table for each DFSP
- a. Partial algorithm where the locking is done to reserve the funds, do calculations and do the final commits
- b. Position handler is Processing one record at a time
Future algorithm would do a bulk

Reduced settlement costs
Can control how fast DFSPs respond to the fulfill request (complete the transfers committed first before handling new requests)

System need to timeout transfers that go longer then 30 seconds
- Any redesign of the DBs
- Test Cases
Financial transaction
- End-to-end
- Prepare-only
- Fulfil only
Individual Mojaloop Characterization
- Services & Handlers
- Streaming Arch & Libraries
- Database
- What changed: 150 to 300 TPS?
How we process the messages
Position handler (run in mixed mode, random
- Latency Measurement

System handler contention
- Where the system can be scaled
If there are arch changes that we need to make we can explore this
- Consistency for each DFSP
- Threading of info flows - open question
Sku'ed results of single DB for all DFSPs
Challenge is where get to with additional HW
- What are the limits of the application design
Financial transfers (in and out of the system)
- Audit systems
- Settlement activity
- Grouped into DB solves some issues
- Confluent feedback
Shared DB issues, multiple DBs
Application design level issues
Seen situations where we ran a bunch of simulators/sandboxes
- Need to rely on tracers and scans once this gets in productions
- Miguel states we disable tracing for now

Don, Joran (newly hired perf expert) - Coil
Sam, Miguel, Roman, Valentine, Warren, Bryan, Rajiv - ModusBox
Pedro - Crosslake
Rhys, Nakul Mishra - Confluent
Miller - Gates Foundation
In-person: Lewis (CL), Rob (MB), Roland (Sybrin), Greg (Sybrin), Megan (V), Simeon (V), Kim (CL)

Last updated 3 years ago

Was this helpful?