Around 18 months ago, I was living in Iran and working for SNAPP.ir. A ride-hailing company like Uber or Lyft. We had high numbers of requests for our pricing application, and we were suffering from a high response time on our peak time (around 2 seconds per request).
Let’s have some background about how things were back in the old days; one monolithic Laravel application holds most business logic. Layers of code written by more than hundreds of engineers. The business was growing very fast in the early days, and it was somehow understandable for me why those codes look like that way, and I was also hired to fix those issues.
We were using an APM to understand how current codes are working, and we found out when our pricing endpoints hit a certain number of concurrent requests, they started to have higher response-time exponentially. We also could see that we are sending lots of queries toward our database instance. Dealing with the database and how it was done was the main issue based on APM reports in different situations. We monitored the endpoints for some time to make sure the data we are gathering is not polluted. (The way people were using the app to request cars or even a national holiday would have unwanted effects on our numbers, so we had to be sure about those things)
The first step for us was to understand the code and the calculation process. It wasn’t easy. We asked a colleague who was in the company from the old days and knew a lot of stuff about the system, to help us with understanding the flows and also to save us some time. Then we had to discuss some things with the team responsible for using our software to define how price should be calculated to understand the process better.
Based on those discussions, we managed to have a better understanding of how things should look like in the new system. Looking at the team, we decided to use Lumen for the new application, and we started to discuss the architecture and how to implement different stuff to make the new service simpler and also flexible enough for future changes and features.
We spent the next couple of weeks having the new codes in place, and also, with the help of our QA team, we started to test over a hundred test cases, and we had to make sure the new code does the calculation exactly like the old one but much faster.
To give some numbers, we found out some nasty loops inside the old code, and with a small change in the indexing of the related table inside the database, we managed to reduce the number of queries from 70 to 2 queries per request. We were thrilled with this result.
We spend some time with our QA team to test all possible cases, and also we found some edge cases to cover in the new application. Then after around three months of hard work, we had something to test in the production environment.
Although we tested many things, we couldn’t just replace this application with the old pricing endpoints. Before releasing the app to that public, we first needed to try and test the application with the production data.
To do so, we made some changes to the new application to log everything from top to down regarding the way of processing the numbers and the data coming from users and the final result so we can, later on, compare them to the old app. we made some changes to the old code to send the payload to our new application as well. For every request coming to the old app, a copy of the request also went toward the new application, and everything was logged for us to compare later.
From the results, we could see some differences, and we expected them to happen (shit always happen and that was why we couldn’t just replace those endpoints with the new app). We started to work on those cases, fixed them, and continued the testing for some time. By the time we were confident to go live with the new application, we started to replace the new application with the old endpoints gradually, so we began to handle 10% of the requests with the new application for the first few days. Then we increased the percentage until it completely replaced the old endpoints.
To give some more numbers, on the first release with a reduced number of queries toward the database and some other performance optimization we made, we managed to reduce the response time from 2 seconds on the old application in the peak time to ~300 ms. After a while, with some more optimization, this number decreased to 60~70ms. The response time from 2 seconds to 60~70ms was a win for us because we could serve more customers.
This image belongs to the time that we send all the traffic toward our new application. (the numbers improved step by step after this day…)
It was my first experience doing optimization on this scale, and I enjoyed every bit of it.