Celebrating 2020 Days of Razorpay

Today is an interesting day for us. Razorpay turns 2,020 days young on 29 June 2020!

This means that over 5 years have passed since we began powering your financial systems so that you could continue challenging the status quo with new ideas, products and experiences. 

We have helped businesses across industries create an impact. From helping collect donations worth over Rs 237 million for Kerala floods and COVID-19 relief measures, to providing capital support to over 2,500 startups, the journey has been a remarkable one. 

The infographic below captures this journey in a beautiful way.

When we think of this time in terms of two thousand and twenty days, it seems like a long period of time. But it is time that has flown by us. The last couple of years have gone by at a particularly fast pace, because Razorpay has also evolved at a very fast pace during this time.

From a payments solution provider to a neobank.

We are now present at every stage of a business’s money management processes, powering their finances to help them grow their business. 

So what does day 2,021 and beyond look like for Razorpay? Well, stay tuned. Cliched as it may sound, there are actually miles to go before we sleep.

A big thank you to the Razorpay Team for this wonderful effort. Isha, Anuj and Anupriya from the Design team, Khushali and Sreya from the Content and Social Media teams as well as Sharath and Shreyas from the Product Analytics team for putting this together.

The Day of the RDS Multi-AZ Failover

On a fateful Friday evening on December 2019, when a few of us were looking forward to packing their bags and going home, we got an alert from the internal monitoring tool that the system has started throwing unusually high numbers of 5xx errors.

The SRE team quickly realized that one of our main applications (called “API”) was not able to connect to its RDS (MySQL) database. By the time we could make any sense of the issue, the application came back up automatically and the alerts stopped.

Looking at the RDS logs, we realized that the master instance has gone through a Multi-AZ failover.

According to the SLA’s provided by AWS, whenever an instance marked as multi-AZ goes through a failure (whether it is a network failure, disk failure, etc); AWS automatically shifts the traffic to its standby running on a separate AZ on the same AWS region. The failover can take up anytime between 60 and 120 seconds, and this was the reason our master instance automatically came back up after around 110 seconds and the application started working without any manual intervention.

Replication failure

The API master instance has a set of 5 replica instances which are used to query different sets of workloads in various applications.

While the application stopped throwing errors and started working, we received another set of alerts stating that the replication on all the replicas had failed.

All the replicas displayed a duplicate-key error message. We immediately shifted all the traffic going to these replica instances to the master instance so that the application does not receive any stale data and display incorrect data to the users.

The drawback of moving all the traffic to the master was that all the heavier selects were also moved to the master, and the CPU load on the master instance increased by 50%. Hence, our immediate move was to recreate all the 5 replicas so that we can move the replica load back as soon as possible.

The new replica creation process internally creates a snapshot from the master instance of the current data and then starts a DB instance from that snapshot. The very first snapshot from a particular machine takes the snapshot of the entire data until now, but the subsequent snapshots are incremental in nature. In the past, we had noticed that these incremental snapshots take around 15-20 minutes for the API database.

While taking the snapshot, we experienced another reality-check. The process was taking more than the usual time that day. After an hour or so, when the snapshot creation was still in progress, we were forced to contact AWS tech support to check why the whole process was taking much longer that day.

The AWS tech-support informed us that since the master instance has gone through a multi-AZ failover, they had replaced the old master with a new machine, which is a routine. Since the snapshot was being taken from the new machine then and was the very first snapshot from the new machine, RDS would take a full snapshot of the data.

So, we had no other option but wait for the snapshots to finish and keep monitoring the master instance in the meantime. We waited six hours for the snapshot to complete and only then were we able to create the replicas and redirect the traffic back to them.

Once the replicas were in place, we assumed that the worst was over, and finally called it a night.

Data loss

Next day, on our follow-up calls with RDS tech support, we were told that it was not a usual occurrence that the replication crashes in scenarios of multi-AZ master failover, and there must be more to the incident than what meets the eye.

This is when we started looking at various reasons on why the replication crashed. After matching the database and the application trace logs for the time around the incident, we found that a few records were present in the trace logs, but not in the database. This is when we realised that we had lost some data at the time of failover.

Being a fintech company, losing transactional data actually meant losing money and the trust of our customers. We began digging the binary logs for that time frame and matching them with the data in the store. We finally figured out that the RDS database had been missing 5 seconds of data. Right after these 5 seconds, we had started receiving 5xx errors on our application logs.

Luckily, we could dig the exact queries from the binary logs, go through the sequence of events from the application trace logs and after an 8-hour marathon meeting, were able to correct the data stored in the RDS.

How Multi-AZ replication works

It was time for us to investigate why we even fell into this situation in the first place. To solve the puzzle, we had to find the answer to the following questions:

  • How does the RDS Multi-AZ replication work?
  • What steps does RDS take at the time of a multi-AZ failover?
  • Why was the data missing in the database?
  • Why did the replication crash?

We got on tens of calls with a number of RDS solution architects over the next week, and were finally able to connect all the dots.

 

In a Multi-AZ setup, the RDS endpoint points to a primary instance. Another machine in a separate AZ is reserved for the standby, in case the master instance goes down. 

The MySQL installed on the standby instance is in shutdown mode; and the replication happens between the two EBS. i.e., as soon as the data is written to the primary EBS, it is duplicated to the standby EBS in a synchronized fashion. This way, RDS ensures that any data written to the master EBS is always present on the standby EBS; and hence, there will be no data loss in the case of a failover.

At the time of a failover, the RDS goes through a number of steps to ensure that the traffic is moved over to the standby machine in a sane manner. These steps are listed below:

  1. Primary MySQL instance goes through a cut-over (networking stopped). Client application goes down.
  2. MySQL on the standby machine is started.
  3. RDS endpoint switches to the standby machine (new primary).
  4. Application starts connecting to the standby machine. Client application is up now.
  5. Old primary machine goes through a host-change (hard-reboot).
  6. EBS sync starts from the new primary instance to the new standby instance.

Looking at the process of failover, it seems pretty foolproof; and the replicas should’ve never gone through any duplicate errors and we should’ve never had any data loss. 

So, what went wrong?

Incorrect configuration variable

We found a MySQL configuration parameter innodb_flush_log_at_trx_commit, which is very critical for the seamless process of a failover.

InnoDB data changes are always committed to a transactional log which resides in the memory. This data is flushed to the EBS disk based on the setting of innodb_flush_log_at_trx_commit. 

  • If the variable is set to 0, logs are written and flushed to the disk once every second. Transactions for which logs have not been flushed to the disk can be lost in case of a crash.
  • If the variable is set to 1, logs are written and flushed to the disk after every transaction commit. This is the default RDS setting and is required for full ACID compliance.
  • If the variable is set to 2, logs are written after each transaction commit, but flushed to disk after every one second. Transactions for which logs have not been flushed to the disk can be lost in case of a crash.

For full ACID compliance, we must set the variable to 1. However, in our case, we had set it to 2. This means even though the logs were written after every commit, they were not getting flushed to the disk immediately.

After learning about this variable, everything suddenly became crystal clear to us. Since, we had set it to 2, the data was committed to the master instance but was not flushed to the primary EBS. Hence, the standby (new primary) never received this data; which is why we could not find it in the master instance after the failover.

But, why did the replicas fail? And, why was the data found in the binary logs?

Apparently, there is another variable called sync_binlog which when set to 1, flushes the data to binary logs immediately. As we had set it to 1 (which is correct), the data got written to the binary logs and replicas were able to read that data. Once the data was read, replicas ran those DML queries onto them and became in sync with the old master.

 

Let’s say, the auto-increment value of one of the tables was X. Application inserted a new row which got auto-increment-id as X+1. This value X+1 reached the replica, but not the standby machine. So, when the application failed over to the standby machine, it again entered a new row with auto-increment-id as X+1. This insert, on reaching the replica, threw the duplicate-key error and crashed the replication.

We went back to our old snapshots (incidentally, we had kept the snapshots of the old replicas before deleting them); and were able to prove that the lost data was present in the replicas.

Once our theory was proved, we immediately went to the master instance and changed the value of innodb_flush_log_at_trx_commit from 2 to 1; and closed the final loop.

Final thoughts

In retrospect, we’re glad that we dug deeper into the incident and how we were able to reach the root of the problem. The incident showed us that we were always vulnerable to data loss because of an incorrect setting of a configuration variable. 

The only silver lining is, however, that we learnt a lot about how RDS manages the Multi-AZ setup and its failovers. And, of course, we gained an interesting tale to tell you all!

Business Continuity in Wake of COVID-19

The situation with COVID-19 is changing quickly. Many of you are looking to Razorpay to uphold your business continuity plan. Others are looking to us as a potential solution—you’ve found your business unexpectedly disrupted and are wondering if Razorpay can help you restore business continuity. 

We are taking this opportunity to provide more visibility and clarity into Razorpay’s business continuity strategy so you can be confident that we will be available throughout this disruption. 

Razorpay has had a business continuity plan in place and It is regularly reviewed and updated, both on a periodic cadence and as needed to address significant changes. 

Razorpay has implemented the following precautionary measures from our BCP plan, effective as of Monday, March 16, 2020.

  1. System availability: Our entire infrastructure, deployed on AWS, scales automatically based on demand and does not entail any human based server management
  2. Data protection: Razorpay has audited access management and approval workflows in place to ensure that your data is protected irrespective of our workforce being remote or not. Razorpay continues to be an ISO and PCI certified organisation and our BCP ensures that we stay that way
  3. Employee availability
    • We have enforced a work from home policy for all our employees to limit the spread of infection among our workforce
    • Virtual meetings and remote work-enabled collaboration tools have always been a part of our workforce’s DNA. So, we are confident that 100% remote work will not hamper our productivity in any way whatsoever
    • We have also started cross-training workers and established backup arrangements up to level 3, in case critical resources fall sick, to minimize disruptions

There is no mistaking the challenges of these times. We do not yet know with certainty when the greatest risks will be behind us. But we can assure you that we are prepared to ensure that your business continuity is not affected.

At the same time, we are taking proactive steps to ensure not only our services continue uninterrupted but our teams can continue to deliver newer products and features to supercharge your business.

For any further information regarding this or any help that Razorpay can provide in these testing times, please feel free to reach out to Razorpay Support.

Do visit our website for information on how our products can help you.restore business continuity.

Data Engineering at Scale – Building a Real-time Data Highway

At Razorpay, we have data coming into our systems at an extremely high scale and from a variety of sources. To ensure and enable that the company can operate by placing data at its core, enabling data democratization has become essential. This means, that we need to have systems in place that can capture, enrich and disseminate the data in the right fashion to the right stakeholders. 

This is the first part of our journey into data engineering systems at scale, and here, we focus on terms of our scalable real-time data highway. 

In the subsequent articles, we will be sharing our ideas and implementation details around our near real-time pipelines, optimizations around our storage layer, data warehousing and our overall design around what we call a “data lake”.  

Understanding data classification

The idea of building a data platform is to collate all data relevant to Razorpay in every way possible in a consolidated place in its native format, that can be later processed to serve different consumption patterns. 

And, in this regard, the data platform needs to handle a variety of things (included but not limited to) like data governance, data provenance, data availability, security, integrity, among additional platform capabilities. 

In order to do any of the above, we need to understand the nature of the data. At a bird’s eye level, data within Razorpay, will broadly fall into 2 categories: 

  1. Entities: Capturing changes to our payments, refunds, settlements, etc will happen at an entity level where we maintain the latest state (or even store all the states) of each entity in our storage in multiple manifestations that can serve all kinds of consumers later
  2. Events: Applications (internal, external, third party) sending messages to the data platform, as part of any business processing. And by this, we broadly mean any and every system that ever interacts with the Razorpay ecosystem, which can potentially end up sending data to the data platform. As much as the respective databases can only answer the final state, the events help us understand, how each system/service reached its final state

Evolution and need for a scalable real-time data highway

To understand why we need to build a real-time data highway, and with the explosive growth we have seen, we have constantly been in the quest to answer some of the following questions:

  • What has been the result of some experiments we do?
  • What is the success rate of different gateways, payment methods, merchants etc?
  • How do we make our internal business metrics available to all the respective business and product owners?
  • How can our support, operations, SRE and other teams monitor and setup alerts, around the key metrics across different products and services?
  • Ability to slice and dice all our products, across 100s of different dimensions and KPIs

What was before?

Before we jump into solving some of our above asks, let us briefly look at what we used to have to answer some of these. We built a traditional ETL pipeline, that queries our application database (MYSQL) on a batch interval and updates an internal elasticsearch cluster. 

Not only does this power our customer facing analytics dashboard, but also was fronted by an authenticated kibana dashboard for doing all the above activities. For a certain set of business folks, the data was piped into tableau over s3/athena. For managing the ETL pipeline, we had written a framework on top of apache beam to pull the respective tables, with the joins and transformations, in a composable  ETL pipeline. What this meant was simply a matter of updating a few configurations for a new pipeline. 

At a very high level, the architecture of such a system, looks like the following: 

data engineering razorpay

  1. Input data is read through MySQL in a window period and make a PCollection of payments with payment ID and details as <K-V> pair
  2. In the next transform, we fetch key merchants and use payments formatter to get output data PCollection
  3. In the final step, we write the PCollection to elasticsearch.
  4. Kibana is used as a BI tool to monitor the payment success rates, dashboards

And to serve our customer facing analytics dashboard, we wrote an internal python framework and an API layer that translates an SQL query to an elasticsearch query. As of today, elasticsearch versions 7 and above support built in SQL query. We have, however, been running this framework, successfully in production for over 2 years(much before such a feature was available on elasticsearch) and is serving all our merchant analytics, straight, using the above.

Even with the recent versions of elasticsearch, some our aggregations cannot be directly translated into elasticsearch SQL query format. So, in essence, the merchant/customer dashboard, queries our internal analytics API, using a rest endpoint with the SQL like query, which is converted internally into an elasticsearch query, with the respective aggregations run and presented back to the front end layer for building the visualizations. 

This only solved the need for physical database related changes. In addition to the above, our applications, also emitted events specific to different use cases. 

To initially get this working, after trying several expensive tools, we settled at using newrelic insights to power all our events use cases. We have been using newrelic for all our APM use cases and we ended up powering our events and other metrics using insights. 

As much as it worked for over 2 years, it started becoming quite expensive. In addition, detailed funneling and long term queries became extremely difficult. More than all, it couldn’t be easily correlated to our database changes, primary due to the fact that the events were real time, while the data capture was in batch mode. Also, joining visualizations across newrelic and kibana was turning out to be painful. In essence, the architecture for this system looked like the below.

data engineering razorpay

The following were some of the additional issues we saw with newrelic:

  • Data is locked with newrelic, not easily exportable, data retention for 3 months only (retention is calculated based on usage)
  • Some of our funneling queries, produce incorrect results for old data
  • Subqueries are not possible
  • The number of results capped at 1000 rows max
  • Negative funnels are not possible
  • Reference to a property from a top-level query in a bottom-level query for the funnel is not possible
  • Order of events is not regarded in funnels. While creating funnels, if your funnel says A -> B -> C, even those sessions will be counted for the funnel when the actual order of events was C -> A -> B
  • Since newrelic is an external system, any data enrichment(e.g.: key mappings, customer mappings etc) cannot be applied on the fly. Data enrichment cannot be done, post facto. This poses a heavy challenge when multiple services want to enrich a request that spans across different services
  • In addition, we cannot maintain any specific lookup tables(if needed) to enable custom enrichments(e.g.: geo ip lookup, mobile device mapping, user agent mapping etc)

What was the problem with the above?

While the above system has been serving all the above needs, it presented us with the following challenges:

  • As it is a traditional batch system, we will have delays in terms of being able to slice and dice in real time
  • Scaling elasticsearch for heavy business queries was challenging. As a result, we had to setup multiple elasticsearch clusters(for internal and customer facing use cases). In addition, tuning elasticsearch for our needs became a constant challenge
  • Data governance: We had to build a whole lot of access control mechanisms on top of kibana to ensure role based access control. Elastic search only supported search guard, which came with its own performance issues
  • Joins: Some of our dashboards required us to join across a variety of databases and tables. Elasticsearch, inherently does not support joins. So, the above means, we had to make constant modifications to our ETL pipelines, to ensure we are able to keep our indexes, upto date, based on these every growing needs
  • Schema Evolution: In addition to the above, our internal application schema is constantly evolving and for every such evolution, we had to rely on elastic search index versioning and aliasing strategies to ensure data correctness. In addition, this required us to backport data across different indexes
  • Cross join events with db changes: As mentioned above, we couldn’t easily do causation-correlation analysis at any given point easily. We had to export reports from each of the above systems(newrelic, tableau, elasticsearch) and needed manual intervention to understand any issues at hand
  • Availability: We also wanted all of this data, in some fashion, to be available to our data scientists and that also was turning to be cumbersome. This again, needed multiple different kinds of exports. In addition, the data governance rules become worse to deal with, for all these situations

In addition to the above, we had multiple BI solutions being used internally for different stakeholders:

  • Engineering wanted to query through SQL like interface
  • Product Analysts preferred custom dashboards
  • Business analysts wanted richer visualizations
  • Marketing wanted other integrations around Hubspot, Google Analytics etc

In essence, there was a strong need to converge all our BI use cases into a single unified platform. The above issues, were inhibiting us, in terms of exploring and analysing the data within the entire ecosystem. Earlier this year, our product team arrived at a single BI tool, to which all data will be tied to.  

Evolving to a real-time data pipeline

Sometime early this year, post the decision on unifying the BI tool, the data engineering team was given the task of building a real time pipeline, served through the unified BI tool for handling the above issues. 

The data engineering team was already building a scalable data lake for resolving some of the above issues. However, with the need to handle some of our peak load transactions and improve our operational excellence, the product team prioritized having a real time capability that needed to be exposed to all our internal stakeholders, within the lake. 

The long-term idea is to expose these capabilities to our customers, on a real time basis, thereby eliminating our older version of the analytics dashboard. The data engineering team started having a close look at the scale of the problem to be handled. Here is a high level summary of our findings:

  • We do several million transactions per day(~100M)
  • With just a small fraction of our application stack integrated into the data engineering platform, we are generating close to 0.5 billion events a day
  • The compressed size of our data within the lake, at this point was close to 100+TBs.

All of the above, just within a few months of building the data lake!

Lets understand the above in a little more detail, before we present the solutioning here:

  • We have a variety of micro services that run as part of our core payment gateway system to handle a single successful payment
  • Post a successful payment, there are a variety of other services that handle different post payment processing activities like refunds, settlements etc
  • In addition to the above, we have other products that directly and indirectly use the core payment gateway services like subscriptions, invoices etc
  • Our front end and mobile SDKs emits a variety of events into our system. We cannot use third party systems like google analytics etc, as per PCI norms and other CORS issues. So, all these events have to be piped into the lake
  • Over and above these, our internal micro services also emit events during different stages of their processing lifecycle

To solve all the above issues, we divide our discussion into real time entities and real time events. 

Real time entities

Writing to a database is easy, but getting the data out again is surprisingly hard. If you just want to query the database and get some results, that’s fine. But what if you want a copy of your database content in some other system like data lake for real-time analytics?

If your data never changed, it would be easy. You could just take a snapshot of the database (a full dump, e.g. a backup), copy it over, and load it into the data lake. This poses 2 different kinds of problems:

  1. Most of the data goes through a state machine and hence, the state of the data changes rapidly
  2. Getting the up-to-date view of this data is challenging in real time.

Even if you take a snapshot once a day, you still have one-day-old data in the downstream system, and on a large database, those snapshots and bulk loads can become very expensive, which is not great.

So, what does the above mean?

  • We will need to incrementally load data into a real time streaming pipeline that directly manifests into the lake
  • We cannot expose our internal primary database to our BI tool as it stores a lot of sensitive information
  • We want our real time stream to be as performant as possible
  • We do not want to keep the data in our real time stream for eternity, as its primary use case is around instantaneous monitoring, visualization and alerting

Keeping the above in mind, the data team had made the following assumptions:

  • We do not need all of this data for eternity, unlike our traditional OLTP store. So, we decided to store the data as a rolling window update over seven days(1 week)
  • We will still want to maintain some basic governing facts loaded here for eternity(e.g. Merchants, customers, card bins etc)
  • We will want this system to be extremely performant and being able to query as fast as possible
  • Some of the rolling aggregations are fairly complex and needs to be computed with as much data as possible to achieve the desired latency
  • We will want the change data to be captured here, as soon as possible
  • In essence, all operations on this store will only be upsert operations, as we do not want to keep a copy of any older/stale data

At a very high level, our architecture for solving this problem looks like the following:

data engineering razorpay

 The flow of data will look something like this:

  • MySQL Read Replica instance used to pull the data
  • We use maxwell to handle the CDC(change data capture) and also ensure, we filter out sensitive information before reading the bin log
  • A Maxwell daemon detects change data capture (CDC)  to this DB and pushes them to a Kafka Topic
  • A spark consumer will now keep reading from the kafka stream and keep batching updates every few seconds(note: the minimum batch duration available in spark is 100 ms)
  • Finally, Change data is pushed to the real time data store, where the queries can be executed from the BI tool.

Choice of real-time data store

We did a variety of evaluations on some of the existing data stores for the real-time use case. In essence, we wanted SQL capabilities to be used by the unified BI tool. Most folks within the organization are comfortable with SQL and hence, we wanted something that fits the bill. 

After evaluating a bunch of OLAP engines, we arrived at timescaledb as a choice of this engine. Timescaledb is an underlying postgres engine with a timeseries extension. This gives us the ability to not compromise on the SQL like capabilities and also gives some of the advantages over rolling aggregate computation etc.  In addition, we will want the operational cost to be extremely lesser with self-healing and auto-scaling abilities possible. 

We didn’t want to spend large amounts of money investing in a paid solution like memsql etc to solve these problems. Considering all the above, TimescaleDB seems like a reasonable place to start, simple enough to set up and maintain and seems to meet all the respective criteria.

Real time events

As mentioned above, as of today, only a small fraction of all our workloads(front end systems, mobile SDKs and a few core transactional apps) are pushing events into the data lake. Despite this, the data lake is receiving close to 0.5B events per day. 

As you would’ve guessed, with all the existing services pushing events, this number is only going to grow significantly. For a long while, we had an internal event ingestion pipeline(codename: lumberjack), written in go,  which primary relays incoming events from multiple producers into desirable targets. 

In essence, all that is needed for any app, to tie its events into the lake, just needed to register itself through a configuration. The reason for choosing go over java or others, is to achieve an extremely high level of concurrency, with minimal operating metrics(cpu, memory etc). In addition, this was designed as a heavy I/O bound application, as most work was simply processing, doing minimal validation/enrichment and transmitting events. 

We already discussed some of the challenges we had with events being pushed to newrelic. So, we wanted to move all of the events, into a central store, from where we could query using our unified BI tool. 

We started making minor architectural changes to our event ingestion pipeline to arrive at the following:

data engineering razorpay

Lumberjack workers: We were originally pushing to aws SQS. We wanted streaming capabilities and SQS was only supporting long poll. So, we decided to move this to Kafka streaming. Kafka streaming gave us the ability to replay and manage offsets effectively. 

Consumer : We completely removed the task of pushing events to newrelic. This way, we got rid of the Kafka consumer, which was running on the lumberjack side. We moved to this operation to a spark streaming job, which will read messages from kafka in order of appearance and stream this to an S3 bucket. 

Sink – S3: Spark streaming job will sink data for every micro-batch interval, which is configurable. Currently, we have set it to 1 min. Every micro-batch is accumulated in memory, so we can configure the sink interval based on data size. Again, the minimum micro batch interval supported by spark is 100ms

Query execution: We are using presto for query execution. The advantage we get here is sub second responses for a few million records. 

S3 – Partition: In order to further speed up the query execution of the events across multiple days, we create daily partitions(msck repair) to ensure the users can query using the created_date as the primary partition key. This has been configured into our BI tool. 

Infrastructure setup

Our entire infrastructure for all of Razorpay has been deployed and operated via kubernetes. In essence, except for the spark setup, we run and manage all the other aspects via kubernetes. 

So, in essence, maxwell has been running as a deployment, kafka is running as a kubernetes daemonset, exposed to the spark pipelines and timescaledb also has been setup using a kubernetes daemonset backed with a remote AWS EBS volume. Connectivity from the BI tool is enabled to the timescaleDB over NLB and the AWS Security group associated with timescaledb, ensures security over the setup.

 The above aside, the spark cluster has been exposed to our BI tool, controlled again via AWS security group and only allows presto queries to be executed. We use prometheus for all our internal metrics. 

Currently, since spark doesn’t support out of the box metrics to be injected into prometheus, we have funneled the metrics to lumberjack from spark, which is directly scraped by prometheus and exposed on our dashboards. 

Databricks has an upstream patch on spark, but that’s not yet merged into spark core, for pushing prometheus metrics into a push gateway. (TBD: we might need a separate section around metrics here and also add diagrams for infra).

The major challenges

Real-time data challenges:

  1. Since Pipeline has to handle DDL and DML both logs, so the order of committing the statement to the data lake is very crucial which was a major challenge for pushing data in the same order as it was generated. We have implemented custom logic to create the order by considering the bin log file name and offset of that file. We have an internal schema registry deployed again on kubernetes, to manage the same. This allows us to track schema evolution over a period of time and also ensures we can keep multiple copies of the data, on the lake
  2. Kafka has slowed down periodically due to limited partitions. This leads to a lag in the data lake, which was fixed by partitioning on unique IDs 
  3. The Dashboard queries performance is bad so we implemented a custom user defined function which aggregates the data in a rolling time window and caches the old aggregate data
  4. Because high transactions happen in the DB system for humongous tables such as payments, orders, etc. and how transaction happen in small tables like marchent we can not distribute load uniformly across partitions. This leads to Data write performance skew
  5. Mysql GTID also cannot be used around sequencing in certain cases, and we have built custom sort and de-duplication mechanics to handle out of order events
  6. Replication delays: In order to avoid AWS inter AZ data transfer cost, and to avoid pressure on the primary databases, we have designed maxwell to read from the replica. As a result, at peak times, if there is a replication lag, our real time pipelines expect the same delay on processing and transmission
  7. Scaling challenges around timescaledb: At the moment, timescaledb inherently dosen’t support clustering options. We plan to move this layer either using kubedb into a clustered mode, or perhaps use other mechanisms to ensure we have better clustering / MPP kind of execution
  8. In addition, we can cut down the spark processing time, by moving this pipeline into flink, which can directly stream kafka to timescaledb endpoint

Real-time entities challenges:

  1. Since the events are pushed in small micro batches, this leads to a lot of cost overhead on S3. In addition, during query execution, we were bitten by hadoop’s small file problem. We are still balancing the right micro batch interval
  2. In addition, we wanted to have a unified way of keeping this data. So, we plan to move the immediate events into the real time data store and eventually sync up into the partitioned tables, on a daily basis
  3. With the above change, we can quite simply move the spark processing to flink processing, where the flink jobs can directly stream to the timescale db endpoint and spark process the daily batches with partitioning.

Learnings and pitfalls

  1. To replicate MYSQL DB transaction in the correct order on a Non-MySQL datastore, for ordering the DB transactions and replay the events a combination of GTID, XID, event types (commit start and end ) need to be used
  2. Spark streaming has a lot of overhead and doesn’t play well when used with small batch sizes (millisecond level, that’s why we moved to seconds level batch)
  3. Running SQL queries from spark carries a lot of overhead. We need to instrument the right metrics, analyze queries in a timely fashion and enable the right kind of caching for optimizing the queries
  4. A large portion of our data lake is built on aws s3. This comes at a significant cost, if not tuned well. For instance, the s3 data transfer cost, bit us quite badly a few months back. As a result, we had to go through significant infra optimization, enable vpc endpoints among others. Cost optimization, continues to be an ongoing exercise
  5. Optimizing S3 by itself, has posed enough challenges for us. As we mentioned earlier, in the subsequent posts, we shall enlist our learnings, observations and the work we have done to optimize these

The road ahead

As much as we have been able to build some of these things at an extremely efficient scale and operationalize it, our journey doesn’t stop here. 

It has in fact, just begun. 

In the subsequent posts, we shall talk around the journey of our data platform, data lake, non real time use cases, optimization techniques adopted among a variety of subjects. 

Our journey thus far, on the data side, hasn’t really been that smooth. We have failed, learnt and recovered. On the other side, some of the most challenging problems we have faced, has been a lot of fun to solve too. We wish to learn and share our learnings through these.

If you are interested in working with us or solve some exciting problems, please reach out to hiring@razorpay.com or visit our careers page.  

Authors: Birendra Kumar (Head of Data Engineering, Razorpay) and Venkat Vaidhyanathan (Architect , Razorpay)

2019 – The Year That Was for Razorpay

2019 year end review razorpay

As 2019 draws to a close, it’s a good time to sit back and relive how the year has been for us. In many ways, 2019 was a big year for not only Razorpay, but the fintech industry as well. 

We have captured how the year was for us in the infographic below, but allow me to talk about some highlights here as well.

  • We served 960k plus businesses on our platform this year
  • IFTA announced us as the most innovative payments startup and we made it to Y-Combinator’s list of top 100 companies as well
  • Our co-founders, Harshil Mathur and Shashank Kumar, brought home the Young Alumnus Award 2019 award from their alma mater, IIT Roorkee
  • Razorpay acquired two businesses this year – Thirdwatch & Opfin 
  • We processed payouts volume worth $3 billion on RazorpayX

Our product portfolio also grew by leaps and bounds, notably:

  • Payment Pages allows businesses to accept payments without a website or app
  • Support for freelancers, consultants and unregistered businesses
  • Current accounts and corporate credit cards on RazorpayX 

Personally, I love the fact that we were able to save transaction time worth nearly 6 years through our card saving feature! And of course, we pulled off the country’s biggest fintech event in FTX 2019.

So yes, 2019 has been an outstanding year for Razorpay. As we set foot into the new decade, we have big plans laid out to help businesses #OutgrowOrdinary. Through our acquisitions, we’ll help e-commerce businesses fight fraud and streamline their payroll processes as well. 

But until then, thank you for your support and faith in us. Here’s wishing you and your business a very happy new year, filled with boundless growth opportunities.

2019 year end review razorpay

Razorpay Announces its First Acquisition – Thirdwatch

razorpay thirdwatch

It gives us great pride to announce the acquisition of Thirdwatch – an AI-powered company specializing in big data to reduce return-to-origin and fraud orders for e-commerce businesses.

According to industry estimates, the Indian e-commerce industry is expected to reach $150 billion by FY20. Fraud and RTO can result in losses of over $5 billion for these companies, since 4-5% of all orders fall under these categories. 

Thirdwatch’s AI-driven solution can empower e-commerce businesses to fight fraud at scale and reduce RTO by a significant margin. Our goal has always been to actively solve business problems through innovative technological solutions that can transform the Indian economy.

The Thirdwatch acquisition will further help us in building core competencies in big data and AI to solve unique business problems across the industry. 

Here’s what Razorpay CEO and my Co-founder, Harshil Mathur, has to say about the acquisition.

“This acquisition is a perfect fit. Our war is against cash, hence we want to address all problems surrounding it through new integrated data science technologies. Fraud has been the albatross around ecommerce companies’ necks for the longest time and we believe through this acquisition we will empower ecommerce businesses to digitally transform and disrupt, by improving their response and redressal mechanisms of combating fraud.”

He added, “The team at Thirdwatch comes with an exceptional understanding and expertise in AI, machine learning and data sciences and together we envision a future where AI will help e-commerce firms not just combat fraud but maintain a competitive advantage and significantly improve profitability. Together, I believe we can help reduce frauds by 30-40% by next year.” 

At the same time, Shashank Agarwal, Founder of Thirdwatch said, “We’ve always believed in developing technologies that will not just limit e-commerce transactions to be secure and seamless but also make the systems intelligent with real-time insights through AI. A similar commitment was echoed by the Razorpay team and that’s what impressed us and brought us together.”

He added, “Fraud has been one of the largest and longest concerns for e-commerce companies. Most of their systems frequently fail while identifying fraudulent patterns and therefore lack the capability of differentiating between genuine customers and fraudsters. There is a dire need for a data-driven solution to help identify these patterns and reduce losses of any kind, to help the marketplace function at an optimal level. We are really excited to be part of the Razorpay family, and together, we hope to make AI accessible and helpful to every business, amplifying human resourcefulness with intelligent technology” 

In India, digital transactions are growing at a fast pace of over 50% every year. While this spells good news for the e-commerce industry, it also means that the threat of fraud and RTO becomes even greater. Indian e-commerce businesses need to nip this problem in the bud and together with Thirdwatch, Razorpay aims to aid them to do just that. 

RTX Bangalore – Titbits on User Acquisition and Retention

razorpay rtx bangalore user acquisition user retention

Our 2nd RTX in Bangalore, and the 4th overall began on a very interesting note.

The primary topics up for discussion were “Customer Acquisition” and “Customer Retention,” but we broke the ice by asking the panel about that one product and one service that they had been recommending to everyone around them.

Some of the names that came out were the usual suspects like Dunzo and CRED, but the really interesting one was Google Photos.

Anshul Agrawal of Urban Ladder and Manu Prasad of Scripbox both said they recommend Google Photos to the people they know because the app’s tagging system makes it easy for them to find the exact picture they’re looking for.

Aravindh Radhakrishnan of Zoomcar was the first to recommend a product from the offline world–Puma. “Puma has carved a niche for itself, especially among the community of football fans,” he said.

Another product from the “real world” to be recommended was Milano Ice Cream, by Lizzie Chapman of ZestMoney. Thanks to her, all of us were craving some right away.

But the ice cream was soon forgotten when we got down to the business at hand. This RTX event was a gathering of product leaders from companies like Urban Ladder, Myntra, HealthifyMe, Scripbox, Zoomcar, CRED, and ZestMoney.

The idea was for everyone to share their insights on how they acquire and retain users–insights that the others could then take back with them and maybe, implement the very next day.

Anjan Bhojarajan of HealthifyMe began by talking about how user testimonials helped them acquire new users.

When someone sees real-life examples of how a service has benefitted others, they are more likely to try that service out, Anjan said

Another important point he talked about was reducing pricing as a barrier for new users.

HealthifyMe has diet plans that could be expensive for many users, but this price is justified by the fact that the plans are specially customized by an expert for every individual user.

But because we had so much data, we were able to build AI-powered diet plans for specific use cases, said Anjan.

These plans were cheaper and helped HealthifyMe remove pricing as a friction point for new users.

razorpay rtx event anjan

For Urban Ladder, pricing is not a barrier because their customers are the ones who are willing to make big-ticket purchases.

When customers are buying expensive items, they are fine with the purchase processes being offline also, said Anshul, citing the example of their tie-up with Bajaj Finserv. He also said that referral programs helped them acquire users. But a good referral program can get abused by one person using different numbers and email addresses,” he cautioned.

razorpay rtx event anshul

Talking about programs, Sudhakar Pandey of Myntra said that Try & Buy has been a great customer acquisition lever for them. But at the same time, bringing down returns remains a high priority.

More loyal customers tend to return more because they are more likely to try more categories and be more adventurous in their choice of products, brands, etc, said Sudhakar. And COD customers paradoxically return less often than prepaid customers. However, they are more likely to refuse to accept a shipment.

razorpay rtx event sudhakar pandey razorpay rtx event sudhakar

This is because customers in India seek value and the best product. “They have zero loyalty to marketplaces that compete only on price,”  opined Lizzie Chapman of ZestMoney. While that opened up an entirely new debate, Lizzie also said that for ZestMoney, targetted ads on social media were more effective than mass SMSes. “Our customers have lesser concerns about data privacy, compared to in other markets,” she said. “This is a concern, so we are careful not to take data from them without their consent and education.”

razorpay rtx event lizzie

Giving their customers something useful is a big part of CRED’s user acquisition as well.

As Rahul Harkisanka of CRED said, We have built CRED with a premium and tech-savvy customer in mind, and we make sure that customers feel that the support representatives understand, and appreciate the nuance of their problems.

Full-page ads in leading English dailies worked well for CRED to acquire new users.

razorpay rtx event rahul

Even I downloaded the CRED app after seeing that ad in The Economic Times, said Manu of Scripbox.

For Scripbox, giving their customers to the facility of investing through UPI has been a game-changer.

We expect 50% of all transactions to happen through UPI in the coming year, said his colleague, Ashish Malhotra.

razorpay rtx event ashish razorpay rtx event manu

With that, the discussion moved towards customer retention. For any business, retaining customers is probably more important than acquiring them. And one bad experience can lead to a customer going away for their lifetime.

Both Anshul and Aravindh vouched for this. Service, then, becomes a big differentiator for all kinds of businesses. The panel was unanimously nodding heads on that one.

razorpay rtx event aravindh

The panel was also in agreement that customer service also becomes a major focal point for retaining customers.

For Zoomcar, said Aravindh, customer service is important because every car rental is tied to a life event like family gatherings or a friend’s wedding.

What he meant was that when there is a service disruption, it’s important for product leaders to understand the impact it has on the users’ lives. And build customer care to empathise with it.

By that time, the beers and pizzas were out and everyone’s attention shifted towards more casual conversations. As is the case with such events, networking becomes an important part.

The idea behind RTX is to put forth a forum where industry peers can interact with one another and help each other with insights and ideas they can use. And this RTX had no shortage of it either.

Razorpay 2.0 – A Year In Numbers

A little over a year ago, we launched our suite of converged payment solutions – which we called ‘Razorpay 2.0‘.

The products came into being because of a single insight – that businesses in India needed a complete payment solution to handle all aspects related to the flow of money, right from the moment when a payment is initiated to the point it is fully reconciled and disbursed to the final destination.

One year later, the five new products we launched in our 2.0 suite have all seen huge growth and increased adoption.

And we’re just getting started. The future of payments has only just begun!

Razorpay - 2.0 - growth numbers

Razorpay 2.0 - Infographic - Statistics

Razorpay Checkout: How We Design for Faster Payments

razorpay-checkout-design-blog-image

If there’s something every Indian e-commerce shopper hates, it is complications at the final step of the purchase. A great user experience at the checkout stage is a problem that many payment players in the industry have been trying to solve, for a very long time.

Why you ask? Because the checkout is the final confirmation of your customer’s intent. We know that CoD orders have a high chance of being cancelled. The aim for every e-commerce business should then be to nudge their customer towards making an online payment.

Sales, offers, cash-backs – these are some ways of ensuring an online payment. A good checkout design, however, is still paramount.

But it’s not easy designing checkouts for the Indian e-commerce universe!

This is because of the unique characteristics of the Indian payments space. Hundreds of payment methods, various modes of authentication, the involvement of multiple entities, elaborate policies and regulations, users who are new to online shopping and are still discovering its nuances; and to top it all, slow internet speeds!

Jokes aside, the Indian payments industry is as challenging as the Apollo mission was. E-wallets, for instance, are more than a decade old, and at last count, there were about 80 e-wallets in the country. Compare this to our next-door neighbour China where the wallet industry is primarily binary (AliPay and WeChat), and you get the difference!

razorpay-checkout-design-blog-payments-simplified

But here’s the fact of the matter, as I see it. Digital payments are complex, yes. But, they should be complex for payment providers like us; Razorpay, and not for the end users. For someone looking to order their nightly grub, an online transaction should be as simple as wiping out the universe in the blink of an eye.

At Razorpay, it has been the joy and pride of my team of designers and I, to work on simplifying the user journey from desire to purchase. This is what our hard work, and design genius, has achieved.  

Razorpay Checkout: The Evolution of Simplified Online Payments

Razorpay Checkout is the magical form which converts your money into a late night dessert fix. Also, the form which helps you pay for that Ginger Chai on Chai Point, within 30 seconds. In essence, the checkout is a single line of Javascript code in the clients’ website or app that has been simplifying payments for well over the last three years.

In its first avatar, the form offered support for Card, NetBanking and Wallets, but did not have an option for EMI, or to save cards.

Its USP: it would open up on the merchant’s native website, without redirecting to another page. This lowered payment failure rates by removing one hop in the process, and saved precious moments during checkout – and we all know how even a split-second delay can change a user’s mind.

The ‘Retry’ button on our checkout was also aimed at reducing drop-offs. Unlike other payment gateways at that time, the Razorpay gateway did not erase the data entered during checkout, and customers could retry a failed transaction with just the press of a button instead of re-entering every piece of information.

This version quickly got old, though. And we went on to create newer versions with updated UI elements and the latest online payment modes.

razorpay-checkout-design-evolution

So, what’s special about this?

Glad you asked! A checkout form looks really easy to design, but the devil – as always – lies in the details. Allow me to take you through a user’s payment journey in detail, and explain how the Razorpay checkout makes every step easy for the customer.

Step 1: The checkout form opens.

Think of this scenario. You’re on Chai Point. Everything is their dark green colour. Suddenly, you see a blue-coloured Razorpay checkout while paying. Odd, right?

A customer picks their favourite products from an e-commerce website and is then directed to a new site for payment. This redirection on the checkout page creates confusion, lowers the customer’s trust, and leads to drop-offs. The solution: a checkout form that opens natively on the website.

What if the checkout form looks closer to Chai Point’s actual website? What if the design makes the user feel that the payment process is part of the website they are buying from?

That’s exactly what the colour logic we devised in-house does. It takes the merchant’s colours, adds them to the checkout form, and makes sure everything looks and feels like one cohesive information flow. Since our form uses iFrame, it opens up within the website the user is on; without any redirects.

razorpay-checkout-design-blog-colour-logic

Step 2: Filling your payment details on our checkout form.

The first rule of good checkout design is: No One Likes to Fill Endless Details! And Razorpay’s intuitive checkout design handles multiple user touch points without ever making the experience feel overwhelming.

So, while the user has a plethora of payment options to choose from, each of these options has its own UX tweaks to make the experience better. Let me explain these in detail here.

Cards: Making a payment online via cards is simple, right? You fill in the card information, authenticate using an OTP or password and money is out of your pockets, right?

Well, it sounds simple but when it comes to the experience, we have done a lot to make it much simpler than it usually is:

  • We make the input fields auto-focus right after you’re done filling in the previous field. Seems obvious and trivial but this little tweak, when overlooked, creates a lot of annoyance.
  • Maestro cards with no CVV? No worries! A card with an option to authenticate via the ATM PIN? Bring ‘em. Every edge case is meticulously handled by our intuitive design so that the pain points for every user is minimized.
  • We’ve come up with a Flash Checkout feature to help customers save their card details on Razorpay. This has reduced the total time for every transaction by as much as 60%. The best part, you save your cards once on Razorpay and enjoy this feature across all of our 80k+ merchant base. The data is encrypted and kept safe, but it does help lessen transaction times.
  • Features like OTP auto-fill allow us to reduce the transaction times even further. We hate waiting for those OTP messages too 😅 So, we auto read them and make the user’s life even easier.

Power Wallets: When a user makes a transaction via a wallet, the normal payment flow takes them to the wallet or app’s page, where they enter their OTP/login authentication to complete the transaction. To shorten the transaction time and lower drop-offs, we’ve teamed up with wallets like Mobikwik, OLA Money and Freecharge to come up with ‘Power Wallet’, a hassle-free no-redirection payment flow for wallets.

UPI: As a new player in the market, there was a huge confusion regarding UPI among the users. But not with Razorpay! Our intent solution to UPI is so straightforward that the users don’t need to remember their UPI ID at all. Just one click on your favourite UPI app and you are one step closer to completing your payment.

Curious how do we do this? You can read more on UPI intent and how we leverage it in our design. Also, before I forget, we were the first in the industry to help users with UPI and it’s complexities 😉 *not bragging at all*

Step 3: Complete the payment. Get your ginger chai!

What’s next?

Digital design is like painting, except the paint never dries, said Neville Brody.

Designing a checkout form feels like wet paint at times, because of the ever-changing landscape of the Indian payments industry.

But in this lies the fun; for the need to redesign an already perfect design can be a great teacher. What I have learnt over the years doing this very thing calls for a blog of its own!

The constant, though, has been this – that as more customers connect to the online world than ever, they are increasingly spoilt for choice and expect quality personalized experiences from the fintech industry.

The best way to keep track of your audience’s wants is to look into your consumer data and find trends and patterns that help you determine intent and expectations. And use these insights to create UX design awesomeness!

At Razorpay, we do this every day.

Not only are we aware of the latest trends in B2B design, and the new payment modes to add to our checkout; we also always leverage data to design for our end customers’ specific needs, and solve industry pain points.

And this, at the end of the day is what good UX design does, – it solves, informs, captivates, and allows users to buy into your brand (or in this case, buy from your brand 😛).

razorpay-checkout-design-blog-hiring

Inspired by what you read? Interested in solving payment problems with design? We’re hiring designers across levels! Shoot a mail to hiring.design@razorpay.com

Want to Make Your Company a Great Place to Work? Here Are 4 Powerful Secrets!

Razorpay is now GPTW certified.

There is a wonderful blog on the Harvard Business Review that begins by stating that workplace culture begins once the CEO walks out of the room. (The title is hilariously apt, and haven’t we all been in a similar situation?) At Razorpay though, we have always tried to subvert the norm.

Workplace culture, as a popular phrase goes, is what people do when no one is watching; both instinctively and of their own choice. For us, it was always creating and nurturing an environment where the company’s desire to be a market leader is also a personal goal for every employee.

This belief has been the cornerstone of all things Razorpay and why being recognized by the GPTW as one of the great places to work at in India means so much to us!

The GPTW award proves that our version of ‘workplace culture’; with its unique mix of transparency, autonomy, and the people-first approach, is a winning formula after all.

Before I tell you how we built this culture at Razorpay, let us take a look at the GPTW Institute’s process of choosing awardees.  

Let’s understand what this award truly means…

The Great Place to Work Institute is a world leader in workplace culture assessment. For over thirty years now, this institute has conducted seminal research into what makes great office culture; and helped businesses succeed by creating stimulating and cooperative workplaces.

How? The GPTW award is a culmination of a series of surveys and employee interviews that are anonymous, and which the executives of the company have no participation in. The institute’s thorough, perfected-over-years process of assessing employee satisfaction is a unique measurement of workplace health.

What the GPTW has learnt over three decades is that one of the core drivers of company culture is trust. And the surveys and phone calls and interviews the institute conducts are time-honed processes that probe into these very feelings.

And the process validated the tenets we follow at work…

It felt both reassuring and exhilarating that the questions asked by the GPTW panel were questions we ask of ourselves every day. And while there is no ‘secret sauce’ to creating good workplace culture, here are four ingredients from Razorpay’s treasure trove that I know will add to your existing recipe:

1. Focus on your ‘tribe’

As an organisation, our prime focus has always been to create an environment and a workplace culture that nurtures every individual who walks in. The ‘employee handbook’ for us is not just a set of rules and guidelines, but also the Holy Bible of our company culture. And the basic directive is but this – ‘People First’.

By this, I mean that as a company, it is imperative that you remain interested in your employees’ growth. Trust and transparency help. So does having a workplace culture where talking about the elephants in the room is not considered taboo. And this is where the devil lies in the details.

Small things matter, and at Razorpay we do a lot of these small things- like sponsoring employee attendance at events that they would like to be a part of, and having open-door policies for all our meetings and also Slack conversations.

2. Hire right

Culture is easy when you are a group of friends working together. It gets harder with each new person you bring to the table. At Razorpay, we are extremely selective about who we invite to the tribe! Our interview process is very thorough, and we encourage cross-team interviews so every stakeholder contributes towards the hiring process.   

At the same time, we try our best to make the interviewee comfortable. I firmly believe that it is when people are feeling comfortable, that they show you what they are really made of.

Everyone prepares for a hiring interview; but does everyone prepare to allow their human side to shine through? No! And if you can get through to them and understand how they are more than just a sum of their skills and experiences, you can add the best team members to your ilk.

3. Find your buzzwords. And then make them real

We foster ideation and creativity at Razorpay. Yes, they seem like buzzwords; but for us, they are our daily to-dos. We encourage people to try new things, make their mistakes and learn in the process. This creates a platform which allows everyone to play to their strengths and gives them the freedom to be their best selves at work.

When you take away the fear of failing; you create space for innovation and creativity – our prime goal from day zero!

4. Don’t delay defining your ‘culture’

Some startups start thinking about company culture when they start to reach the triple digits in terms of employee strength. I came to Razorpay when it was just a bunch of ten fresh out of IIT grads trying to create a product they believed in, and yet they knew what their workplace would be like.

And that, I believe, has played an important role in the way our culture has evolved and stayed true to its genesis.

From then until now, little has changed. The ‘People Ops’ team (we don’t call ourselves HR here!) works day in and out to ensure everyone on the floor has a fulfilling time at work.

Be it the TGIFs, company-wide support practice, WhatsUp sessions, or the open ‘All Hands’, our brief has always been to keep everyone happy. Because happy employees are employees who are invested in your company’s growth and give you the best of themselves, willingly.

So, that is how you go about creating a winning workplace culture…

As we scale up; and we have done a lot of it this year, the need to grow without changing our ideals is that much more important. Our goals are big, but we want to achieve it without compromising on the things that make us, us. And that includes our workplace culture.

And being recognized by the Great Place to Work Institute just tells us that we have been successful in doing what we set out to do – create groundbreaking products, with a fantabulous team of achievers and dreamers in tow.

Did I also tell you that we are currently adding to this glorious bunch of doers? If you’re interested, check this out and contact us right away! We also love our coffees, and sharing them with like-minded people 🙂