Performance Workstream
Wednesday, March 11, 2020
Performance Goals:
Current HW system achieving stable 1k TPS, peak 5k and proven horizontal scalability
More instances = more performance in almost linear fashion.
Validate the minimum infrastructure to do 1K TPS (fin TPS)
Determine the entry level configuration and cost (AWS and on-premise)
POCs:
Test the impact or a direct replace of the mysql DB with an shared memory network service like redis (using redlock alg if locks are required)
Test a different method of sharing state, using a light version of event-drive with some CQRS
Resources:
Slack Channel:
#perf-engineering
Action/Follow-up Items:
What Kafka metrics (client & server side) should we be reviewing? - Confluent to assist
Explore Locking and position settlement - Sybrin to assist
Review RedLock - pessimistic locking vs automatic locking
Remove the shared DB in the middle (automatic locking on Reddis)
Combine prepare/position handler w/ distributed DB
Review node.js client and how it impact kafka, configuration of Node and ultimate Kafka client - Nakul
Turn back on tracing to see how latency and applications are behaving
Ensure the call counts have been rationalized (at a deeper level)
Validate the processing times on the handlers and we are hitting the cache
Async patterns in Node
Missing someone who is excellent on mysql and percona
Are we leveraging this correctly
What cache layer are we using (in memory)
Review the event modeling implementation - identify the domain events
Node.js/kubernetes -
Focus on application issues not as much as arch issues
How we are doing async technology - review this (Node.JS - larger issue) threaded models need to be optimize - Nakul
Meeting Notes/Details
History
Technology has been put in place, hoped the design solves an enterprise problem
Community effort did not prioritize on making the slices of the system enterprise grade or cheap to run
OSS technology choices
Goals
Optimize current system
Make it cheaper to run
Make it scalable to 5K TPS
Ensure value added services can effectively and securely access transaction data
Testing Constraints
Only done the golden transfer - transfer leg
Flow of transfer
Simulators (legacy and advance) - using the legacy one for continuity
Disabled the timeout handler
8 DFSP (participant organizations) w/ more DFSPs we would be able to scale
Process
Jmeter initiates payer request
Legacy simulator Receives fulfill notify callback
Legacy simulator Handles Payee processing, initiatives Fulfillment Callback
Record in the positions table for each DFSP
a. Partial algorithm where the locking is done to reserve the funds, do calculations and do the final commits
b. Position handler is Processing one record at a time
Future algorithm would do a bulk
One transfer is handler by one position handler
Transfers are all pre-funded
Reduced settlement costs
Can control how fast DFSPs respond to the fulfill request (complete the transfers committed first before handling new requests)
System need to timeout transfers that go longer then 30 seconds
Any redesign of the DBs
Test Cases
Financial transaction
End-to-end
Prepare-only
Fulfil only
Individual Mojaloop Characterization
Services & Handlers
Streaming Arch & Libraries
Database
What changed: 150 to 300 TPS?
How we process the messages
Position handler (run in mixed mode, random
Latency Measurement
5 sec for DB to process, X sec for Kafka to process
How to measure this?
Targets
High enough the system has to function well
Crank the system up to add scale (x DFSPs addition)
Suspicious cases for investigations
Observing contentions around the DB
Shared DB, 600MS w/ out any errors
Contention is fully on the DB
Bottleneck is the DB (distribute systems so they run independently
16 databases run end to end
GSMA - 500 TPS
What is the optimal design?
Contentions
System handler contention
Where the system can be scaled
If there are arch changes that we need to make we can explore this
Consistency for each DFSP
Threading of info flows - open question
Sku'ed results of single DB for all DFSPs
Challenge is where get to with additional HW
What are the limits of the application design
Financial transfers (in and out of the system)
Audit systems
Settlement activity
Grouped into DB solves some issues
Confluent feedback
Shared DB issues, multiple DBs
Application design level issues
Seen situations where we ran a bunch of simulators/sandboxes
Need to rely on tracers and scans once this gets in productions
Miguel states we disable tracing for now
Known Issues
Load CPU resources on boxes (node waiting around) - reoptimize code
Processing times increase over time
Optimization
Distributed monolithic - PRISM - getting rid of redundant reads
Combine the handlers - Prepare+Position & Fulfil+Position
What are we trying to fix?
Can we scale the system?
What does this cost to do this? (scale unit cost)
Need to understand - how to do this from a small and large scale
Optimized the resources
2.5 sprints
Need to scale horizontal
Add audit and repeatability -
Attendees:
Don, Joran (newly hired perf expert) - Coil
Sam, Miguel, Roman, Valentine, Warren, Bryan, Rajiv - ModusBox
Pedro - Crosslake
Rhys, Nakul Mishra - Confluent
Miller - Gates Foundation
In-person: Lewis (CL), Rob (MB), Roland (Sybrin), Greg (Sybrin), Megan (V), Simeon (V), Kim (CL)
Last updated