The Day of the RDS Multi-AZ Failover

On a fateful Friday evening on December 2019, when a few of us were looking forward to packing their bags and going home, we got an alert from the internal monitoring tool that the system has started throwing unusually high numbers of 5xx errors.

The SRE team quickly realized that one of our main applications (called “API”) was not able to connect to its RDS (MySQL) database. By the time we could make any sense of the issue, the application came back up automatically and the alerts stopped.

Looking at the RDS logs, we realized that the master instance has gone through a Multi-AZ failover.

According to the SLA’s provided by AWS, whenever an instance marked as multi-AZ goes through a failure (whether it is a network failure, disk failure, etc); AWS automatically shifts the traffic to its standby running on a separate AZ on the same AWS region. The failover can take up anytime between 60 and 120 seconds, and this was the reason our master instance automatically came back up after around 110 seconds and the application started working without any manual intervention.

Replication failure

The API master instance has a set of 5 replica instances which are used to query different sets of workloads in various applications.

While the application stopped throwing errors and started working, we received another set of alerts stating that the replication on all the replicas had failed.

All the replicas displayed a duplicate-key error message. We immediately shifted all the traffic going to these replica instances to the master instance so that the application does not receive any stale data and display incorrect data to the users.

The drawback of moving all the traffic to the master was that all the heavier selects were also moved to the master, and the CPU load on the master instance increased by 50%. Hence, our immediate move was to recreate all the 5 replicas so that we can move the replica load back as soon as possible.

The new replica creation process internally creates a snapshot from the master instance of the current data and then starts a DB instance from that snapshot. The very first snapshot from a particular machine takes the snapshot of the entire data until now, but the subsequent snapshots are incremental in nature. In the past, we had noticed that these incremental snapshots take around 15-20 minutes for the API database.

While taking the snapshot, we experienced another reality-check. The process was taking more than the usual time that day. After an hour or so, when the snapshot creation was still in progress, we were forced to contact AWS tech support to check why the whole process was taking much longer that day.

The AWS tech-support informed us that since the master instance has gone through a multi-AZ failover, they had replaced the old master with a new machine, which is a routine. Since the snapshot was being taken from the new machine then and was the very first snapshot from the new machine, RDS would take a full snapshot of the data.

So, we had no other option but wait for the snapshots to finish and keep monitoring the master instance in the meantime. We waited six hours for the snapshot to complete and only then were we able to create the replicas and redirect the traffic back to them.

Once the replicas were in place, we assumed that the worst was over, and finally called it a night.

Data loss

Next day, on our follow-up calls with RDS tech support, we were told that it was not a usual occurrence that the replication crashes in scenarios of multi-AZ master failover, and there must be more to the incident than what meets the eye.

This is when we started looking at various reasons on why the replication crashed. After matching the database and the application trace logs for the time around the incident, we found that a few records were present in the trace logs, but not in the database. This is when we realised that we had lost some data at the time of failover.

Being a fintech company, losing transactional data actually meant losing money and the trust of our customers. We began digging the binary logs for that time frame and matching them with the data in the store. We finally figured out that the RDS database had been missing 5 seconds of data. Right after these 5 seconds, we had started receiving 5xx errors on our application logs.

Luckily, we could dig the exact queries from the binary logs, go through the sequence of events from the application trace logs and after an 8-hour marathon meeting, were able to correct the data stored in the RDS.

How Multi-AZ replication works

It was time for us to investigate why we even fell into this situation in the first place. To solve the puzzle, we had to find the answer to the following questions:

  • How does the RDS Multi-AZ replication work?
  • What steps does RDS take at the time of a multi-AZ failover?
  • Why was the data missing in the database?
  • Why did the replication crash?

We got on tens of calls with a number of RDS solution architects over the next week, and were finally able to connect all the dots.


In a Multi-AZ setup, the RDS endpoint points to a primary instance. Another machine in a separate AZ is reserved for the standby, in case the master instance goes down. 

The MySQL installed on the standby instance is in shutdown mode; and the replication happens between the two EBS. i.e., as soon as the data is written to the primary EBS, it is duplicated to the standby EBS in a synchronized fashion. This way, RDS ensures that any data written to the master EBS is always present on the standby EBS; and hence, there will be no data loss in the case of a failover.

At the time of a failover, the RDS goes through a number of steps to ensure that the traffic is moved over to the standby machine in a sane manner. These steps are listed below:

  1. Primary MySQL instance goes through a cut-over (networking stopped). Client application goes down.
  2. MySQL on the standby machine is started.
  3. RDS endpoint switches to the standby machine (new primary).
  4. Application starts connecting to the standby machine. Client application is up now.
  5. Old primary machine goes through a host-change (hard-reboot).
  6. EBS sync starts from the new primary instance to the new standby instance.

Looking at the process of failover, it seems pretty foolproof; and the replicas should’ve never gone through any duplicate errors and we should’ve never had any data loss. 

So, what went wrong?

Incorrect configuration variable

We found a MySQL configuration parameter innodb_flush_log_at_trx_commit, which is very critical for the seamless process of a failover.

InnoDB data changes are always committed to a transactional log which resides in the memory. This data is flushed to the EBS disk based on the setting of innodb_flush_log_at_trx_commit. 

  • If the variable is set to 0, logs are written and flushed to the disk once every second. Transactions for which logs have not been flushed to the disk can be lost in case of a crash.
  • If the variable is set to 1, logs are written and flushed to the disk after every transaction commit. This is the default RDS setting and is required for full ACID compliance.
  • If the variable is set to 2, logs are written after each transaction commit, but flushed to disk after every one second. Transactions for which logs have not been flushed to the disk can be lost in case of a crash.

For full ACID compliance, we must set the variable to 1. However, in our case, we had set it to 2. This means even though the logs were written after every commit, they were not getting flushed to the disk immediately.

After learning about this variable, everything suddenly became crystal clear to us. Since, we had set it to 2, the data was committed to the master instance but was not flushed to the primary EBS. Hence, the standby (new primary) never received this data; which is why we could not find it in the master instance after the failover.

But, why did the replicas fail? And, why was the data found in the binary logs?

Apparently, there is another variable called sync_binlog which when set to 1, flushes the data to binary logs immediately. As we had set it to 1 (which is correct), the data got written to the binary logs and replicas were able to read that data. Once the data was read, replicas ran those DML queries onto them and became in sync with the old master.


Let’s say, the auto-increment value of one of the tables was X. Application inserted a new row which got auto-increment-id as X+1. This value X+1 reached the replica, but not the standby machine. So, when the application failed over to the standby machine, it again entered a new row with auto-increment-id as X+1. This insert, on reaching the replica, threw the duplicate-key error and crashed the replication.

We went back to our old snapshots (incidentally, we had kept the snapshots of the old replicas before deleting them); and were able to prove that the lost data was present in the replicas.

Once our theory was proved, we immediately went to the master instance and changed the value of innodb_flush_log_at_trx_commit from 2 to 1; and closed the final loop.

Final thoughts

In retrospect, we’re glad that we dug deeper into the incident and how we were able to reach the root of the problem. The incident showed us that we were always vulnerable to data loss because of an incorrect setting of a configuration variable. 

The only silver lining is, however, that we learnt a lot about how RDS manages the Multi-AZ setup and its failovers. And, of course, we gained an interesting tale to tell you all!

The “Hype” is Real in Luxury Mobility

Hype Luxury Mobility

Fast, flashy and chic luxury cars, among all the other machines, are a part of our genetic make-up, which makes them unavoidable. 

I mean, who wouldn’t love to drop a gear and disappear, right? Most of us would, but the option of renting a luxury car was only available to only a few. 

Dealing with unknown vendors, non-transparent pricing, safety issues, and poor customer experience was a perpetual problem, says Raghav Belavadi, CEO, Hype – Luxury Mobility.

Hype – Luxury Mobility was founded in 2017 to eradicate the above problems faced by people. The company offers self-drive luxury car rentals, chauffeur-driven and long-term rental services.

With over 2,800 luxury cars in 7 cities of India, Hype provides the most extensive fleet size with the broadest range of selection to premium customers.

As a tech company selecting a competent tech-driven partner was our primary focus.

Razorpay offered excellent technological innovation, price model, security to Hype’s high-value transactions and customer management. 

The multi-level payment process, updated Github code, and the ability to choose many functions like security deposit, wallet partners, EMI, and Pay Later immensely helped Hype’s customers. 

Later, the company piloted it’s high-value transactions on Razorpay for 30 days and observed 100% efficient systems and 0% downtime on their national and international transactions. 

Razorpay payment gateway

The integration was flawless and boosted our transaction speed by 40%, POD efficiency by 23% and reduced the pricing overhead by 72%.

Hype has grown over 6X in 2019 and expects to expand its operations in two more countries by 2022. The company aims to become the largest luxury mobility company on land, water and air globally. 

Q&A with Raghav Belavadi, CEO, Hype – Luxury Mobility

What was the idea behind your company’s inception? 

Hype was founded in 2017, Bengaluru to address the perpetual problem associated with luxury car rental. People who wanted luxury cars for specific occasions got tangled with various issues like renting from unknown vendors, non-transparent pricing and zero customer service.

Luxury cars have always been eye candy for people, but everybody cannot own a luxury car. Also, there was no option for luxury car rental or self-drive. 

 Tell us about Hype’s offerings. 

Hype offers car rentals in self-drive, chauffeur driven and long-term rental options. With over 2,800 luxury cars in seven cities of India, Hype provides the broadest range of selection to premium customers.

We have started a new luxury yacht rental service in Goa and plan to expand to other regions like Mumbai and Kerala. One hundred twenty thousand hours of rentals, 465 crores of assets driven, and zero incidences reflect the supreme technology-driven operations management of Hype.

Why did you choose Razorpay as your payment service provider?

We are a technology company at the core, and selecting a competent and tech-driven partner is always the focus of our operations. 

There’s a small story behind it, too. One fine day, Harshil Mathur, CEO and Co-founder, Razorpay rented a sports car from Hype. And as a happy customer, he talked to me about the advantages of partnering with Razorpay. 

The timing was right! Excellent technology innovation, price model, customer management and easy integration made Razorpay our natural selection. 

Power your payments with Razorpay Payment Gateway

Why Are Payment Gateway Settlements Not Instant?

Every business or enterprise expects its customer’s transaction to reflect in their account directly. But even today, in the era of digital-first platforms, these settlements are not instant. To give you more clarity on the payment gateway settlement process, let us show you what happens behind the scenes. 

Settlement process: A business’s perspective

Settlements is a process through which a business receives a certain amount paid by their end-users via online transactions for a particular product or service.

In this case, an individual pays via a payment gateway on the business’s website or app, and that amount is transferred to the business’s account by the payment gateway. 

Here’s what a typical settlement process looks like:

Step 1: The cardholder inputs their bank account or card details on the Razorpay checkout form to pay for a product or service 

Payment Checkout

Step 2: After successful authentication via OTP or 3D secure, the money is debited from the cardholder’s account, and the individual receives a confirmation notification for the same

Payment Checkout

Step 3: The transaction amount is routed via the card networks to Razorpay’s acquiring banking partners

Payment Checkout

Step 4: Once Razorpay receives the amount, it is settled to the seller’s bank account after the deduction of a specific fee

Payment Checkout

Why don’t payment gateways settle instantly? 

Payment settlements may look like a very straight-forward process, but they are not. The primary reason why settlements are not instant is that money movement at every step of the process is not immediate.

Also, the underlying complexities of reconciliation is an additional challenge that stands in the way of instant settlements. 

Well, reconciling transactions can be a nightmare for accountants, and here’s why:  

  • Every bank has or offers a different settlement cycle. Hence, the time-frame for the acquirers can vary. For example, Bank A may usually settle within a day, whereas Bank B may take two days to pay. Keeping track of these timelines and consolidating it all in a single sheet does take time
  • Bringing along a whole new level of complexity are refunds and chargebacks. In case of refunds, a customer expects them immediately. But refunding the money without reconciliation puts the business at risk
  • The business owner can end up paying back an amount that was never received. Also, it would be a disaster when the bank statements and reconciliation documents would fail to show a clear picture

When does Razorpay settle a business?

For every business on the Razorpay Payment Gateway, we define a settlement schedule–the time from the date of payment capture to when the business should receive the due amount.

As per the schedule established for the business, the settlement is created only for captured payments and refunds requested for the captured payments. 

Earlier, the complete process used to consume T+3 business days for domestic transactions (T being the date of payment capture).

Today we are glad to announce that customers can unlock faster settlements at no additional costs. Yes, the settlement cycle has been upgraded from T+3 to T+2 by default. 

Note: Razorpay doesn’t schedule settlements on weekends.

Here’s the Razorpay T+2 settlement chart:

Razorpay Settlements

Note: Maximum settlement time – four days

To settle this

Settlements usually depend on the multiple intermediary hops. In the case of cards, the acquiring bank, card network and issuing bank are involved.

In case of a payment method like UPI, along with the acquiring and issuing bank, NPCI also plays a role. This is why the amount of time taken to process the settlements also depends on the mode of payment used to make the transaction. 

The entire money movement process in online payments happens via nodal accounts. This means that payment gateways cannot earn interest from the money they hold or move on behalf of their customers.

At Razorpay, we are genuinely devoted to providing the best payment experience, not just for the businesses that work with us, but also for their end customers.

We truly believe that if online payments have to replace cash, then it has to provide the same ease of use, which is what we strive to achieve. 

To make settlements even smoother, and to help businesses manage their working capital requirements efficiently, we also provide 24x7x365 On-Demand Instant Settlements. Read about that here.

Click here to accept payments with Razorpay Payment Gateway.

Will Your Salary Structure Change this April Due to the New Tax Regime?

new income tax slabs

I have to admit that when our Finance Minister announced the new personal income tax slabs in Budget 2020, I felt like it was a very good initiative to reduce tax outgo of the middle class and thereby, increase their disposable incomes.

However, after having taken time to compare the new income tax slabs with the current ones, I believe that the new tax regime might not be such a great idea. We also crunched some numbers from our Opfin payroll management solution to check if taxpayers will be able to save tax under the new regime. Unfortunately, data shows they won’t.

This is the first of the three reasons why we believe that most taxpayers should continue with the current regime.

No decrease in income tax outgo

Razorpay acquired a payroll startup called Opfin last year, which helps startups and enterprises manage tax filing and compliance for their employees. We looked at the data of salaried employees on this platform, anonymously, of course, to compute their income tax liability under the current and new regimes.

Here’s what we found:

  • 78% of these employees will get taxed more by an average of Rs 25,000
  • 85% of employees with an annual income of less than Rs 10 lakh will have to pay higher taxes, averaging Rs 13,750
  • 60% of the people in the income range of Rs 10 lakh to Rs 20 lakh, will pay an average of Rs 47,677 more tax
  • Among those earning more than Rs 20 lakh, 58% will have to shell out an average Rs 89,208 more

These calculations were carried out on taxpayers earning different levels of incomes and availing different types of tax exemptions. We believe this sample set is an accurate representation of Indian taxpayers, and data shows that the new regime will make them pay substantially more taxes.

Tax computation will become complicated

In her Budget speech, Nirmala Sitharaman mentioned that the new income tax regime has been introduced to ease the process of income tax filing for individual taxpayers. However, it seems like the new regime will only further complicate things.

Firstly, the current regime has 4 income tax slabs, while the new regime will have 7. This itself will require a higher number of calculations. Taxpayers will also need to compute their tax outgo in the current as well as new regimes to determine which one to select at the time of filing tax returns.

I don’t see how this will make tax computation any easier. Income tax filing is anyway something that taxpayers dread because of the various slabs, exemptions and forms involved in the process. Instead of incentivizing more people to file tax returns, the new regime might just end up deterring them. CAs will be happy, though.

Long-term savings will get discouraged 

Young India doesn’t save. Definitely not as much as it should. In fact, the younger generation is borrowing more than it can afford to pay back. Data shows that household borrowing is at an all-time high, while household savings are going in the opposite direction.

At a time when the government should promote long-term savings, it is discouraging taxpayers by giving them the option to forgo tax saving exemptions under the new regime. By design, tax saving investments encourage long-term investing because of the lock-in periods they come with. Taxpayers can also diversify their portfolio by using different tax-saving investments like ELSS funds, PPF, NPS, etc.

The forced nature of investing and staying invested through tax-saving investments is actually good for taxpayers over the long-term. But under the new regime, taxpayers won’t be incentivised to make tax-saving investments because they can’t avail exemptions. This will only hurt their long-term financial health.

Overall, we believe that the government needs to rethink the new income tax regime. With the economy doing poorly, it makes sense to increase disposable incomes. But that shouldn’t come at the cost of a decrease in long-term savings.

A version of this article was first published in YourStory.

Data Engineering at Scale – Building a Real-time Data Highway

At Razorpay, we have data coming into our systems at an extremely high scale and from a variety of sources. To ensure and enable that the company can operate by placing data at its core, enabling data democratization has become essential. This means, that we need to have systems in place that can capture, enrich and disseminate the data in the right fashion to the right stakeholders. 

This is the first part of our journey into data engineering systems at scale, and here, we focus on terms of our scalable real-time data highway. 

In the subsequent articles, we will be sharing our ideas and implementation details around our near real-time pipelines, optimizations around our storage layer, data warehousing and our overall design around what we call a “data lake”.  

Understanding data classification

The idea of building a data platform is to collate all data relevant to Razorpay in every way possible in a consolidated place in its native format, that can be later processed to serve different consumption patterns. 

And, in this regard, the data platform needs to handle a variety of things (included but not limited to) like data governance, data provenance, data availability, security, integrity, among additional platform capabilities. 

In order to do any of the above, we need to understand the nature of the data. At a bird’s eye level, data within Razorpay, will broadly fall into 2 categories: 

  1. Entities: Capturing changes to our payments, refunds, settlements, etc will happen at an entity level where we maintain the latest state (or even store all the states) of each entity in our storage in multiple manifestations that can serve all kinds of consumers later
  2. Events: Applications (internal, external, third party) sending messages to the data platform, as part of any business processing. And by this, we broadly mean any and every system that ever interacts with the Razorpay ecosystem, which can potentially end up sending data to the data platform. As much as the respective databases can only answer the final state, the events help us understand, how each system/service reached its final state

Evolution and need for a scalable real-time data highway

To understand why we need to build a real-time data highway, and with the explosive growth we have seen, we have constantly been in the quest to answer some of the following questions:

  • What has been the result of some experiments we do?
  • What is the success rate of different gateways, payment methods, merchants etc?
  • How do we make our internal business metrics available to all the respective business and product owners?
  • How can our support, operations, SRE and other teams monitor and setup alerts, around the key metrics across different products and services?
  • Ability to slice and dice all our products, across 100s of different dimensions and KPIs

What was before?

Before we jump into solving some of our above asks, let us briefly look at what we used to have to answer some of these. We built a traditional ETL pipeline, that queries our application database (MYSQL) on a batch interval and updates an internal elasticsearch cluster. 

Not only does this power our customer facing analytics dashboard, but also was fronted by an authenticated kibana dashboard for doing all the above activities. For a certain set of business folks, the data was piped into tableau over s3/athena. For managing the ETL pipeline, we had written a framework on top of apache beam to pull the respective tables, with the joins and transformations, in a composable  ETL pipeline. What this meant was simply a matter of updating a few configurations for a new pipeline. 

At a very high level, the architecture of such a system, looks like the following: 

data engineering razorpay

  1. Input data is read through MySQL in a window period and make a PCollection of payments with payment ID and details as <K-V> pair
  2. In the next transform, we fetch key merchants and use payments formatter to get output data PCollection
  3. In the final step, we write the PCollection to elasticsearch.
  4. Kibana is used as a BI tool to monitor the payment success rates, dashboards

And to serve our customer facing analytics dashboard, we wrote an internal python framework and an API layer that translates an SQL query to an elasticsearch query. As of today, elasticsearch versions 7 and above support built in SQL query. We have, however, been running this framework, successfully in production for over 2 years(much before such a feature was available on elasticsearch) and is serving all our merchant analytics, straight, using the above.

Even with the recent versions of elasticsearch, some our aggregations cannot be directly translated into elasticsearch SQL query format. So, in essence, the merchant/customer dashboard, queries our internal analytics API, using a rest endpoint with the SQL like query, which is converted internally into an elasticsearch query, with the respective aggregations run and presented back to the front end layer for building the visualizations. 

This only solved the need for physical database related changes. In addition to the above, our applications, also emitted events specific to different use cases. 

To initially get this working, after trying several expensive tools, we settled at using newrelic insights to power all our events use cases. We have been using newrelic for all our APM use cases and we ended up powering our events and other metrics using insights. 

As much as it worked for over 2 years, it started becoming quite expensive. In addition, detailed funneling and long term queries became extremely difficult. More than all, it couldn’t be easily correlated to our database changes, primary due to the fact that the events were real time, while the data capture was in batch mode. Also, joining visualizations across newrelic and kibana was turning out to be painful. In essence, the architecture for this system looked like the below.

data engineering razorpay

The following were some of the additional issues we saw with newrelic:

  • Data is locked with newrelic, not easily exportable, data retention for 3 months only (retention is calculated based on usage)
  • Some of our funneling queries, produce incorrect results for old data
  • Subqueries are not possible
  • The number of results capped at 1000 rows max
  • Negative funnels are not possible
  • Reference to a property from a top-level query in a bottom-level query for the funnel is not possible
  • Order of events is not regarded in funnels. While creating funnels, if your funnel says A -> B -> C, even those sessions will be counted for the funnel when the actual order of events was C -> A -> B
  • Since newrelic is an external system, any data enrichment(e.g.: key mappings, customer mappings etc) cannot be applied on the fly. Data enrichment cannot be done, post facto. This poses a heavy challenge when multiple services want to enrich a request that spans across different services
  • In addition, we cannot maintain any specific lookup tables(if needed) to enable custom enrichments(e.g.: geo ip lookup, mobile device mapping, user agent mapping etc)

What was the problem with the above?

While the above system has been serving all the above needs, it presented us with the following challenges:

  • As it is a traditional batch system, we will have delays in terms of being able to slice and dice in real time
  • Scaling elasticsearch for heavy business queries was challenging. As a result, we had to setup multiple elasticsearch clusters(for internal and customer facing use cases). In addition, tuning elasticsearch for our needs became a constant challenge
  • Data governance: We had to build a whole lot of access control mechanisms on top of kibana to ensure role based access control. Elastic search only supported search guard, which came with its own performance issues
  • Joins: Some of our dashboards required us to join across a variety of databases and tables. Elasticsearch, inherently does not support joins. So, the above means, we had to make constant modifications to our ETL pipelines, to ensure we are able to keep our indexes, upto date, based on these every growing needs
  • Schema Evolution: In addition to the above, our internal application schema is constantly evolving and for every such evolution, we had to rely on elastic search index versioning and aliasing strategies to ensure data correctness. In addition, this required us to backport data across different indexes
  • Cross join events with db changes: As mentioned above, we couldn’t easily do causation-correlation analysis at any given point easily. We had to export reports from each of the above systems(newrelic, tableau, elasticsearch) and needed manual intervention to understand any issues at hand
  • Availability: We also wanted all of this data, in some fashion, to be available to our data scientists and that also was turning to be cumbersome. This again, needed multiple different kinds of exports. In addition, the data governance rules become worse to deal with, for all these situations

In addition to the above, we had multiple BI solutions being used internally for different stakeholders:

  • Engineering wanted to query through SQL like interface
  • Product Analysts preferred custom dashboards
  • Business analysts wanted richer visualizations
  • Marketing wanted other integrations around Hubspot, Google Analytics etc

In essence, there was a strong need to converge all our BI use cases into a single unified platform. The above issues, were inhibiting us, in terms of exploring and analysing the data within the entire ecosystem. Earlier this year, our product team arrived at a single BI tool, to which all data will be tied to.  

Evolving to a real-time data pipeline

Sometime early this year, post the decision on unifying the BI tool, the data engineering team was given the task of building a real time pipeline, served through the unified BI tool for handling the above issues. 

The data engineering team was already building a scalable data lake for resolving some of the above issues. However, with the need to handle some of our peak load transactions and improve our operational excellence, the product team prioritized having a real time capability that needed to be exposed to all our internal stakeholders, within the lake. 

The long-term idea is to expose these capabilities to our customers, on a real time basis, thereby eliminating our older version of the analytics dashboard. The data engineering team started having a close look at the scale of the problem to be handled. Here is a high level summary of our findings:

  • We do several million transactions per day(~100M)
  • With just a small fraction of our application stack integrated into the data engineering platform, we are generating close to 0.5 billion events a day
  • The compressed size of our data within the lake, at this point was close to 100+TBs.

All of the above, just within a few months of building the data lake!

Lets understand the above in a little more detail, before we present the solutioning here:

  • We have a variety of micro services that run as part of our core payment gateway system to handle a single successful payment
  • Post a successful payment, there are a variety of other services that handle different post payment processing activities like refunds, settlements etc
  • In addition to the above, we have other products that directly and indirectly use the core payment gateway services like subscriptions, invoices etc
  • Our front end and mobile SDKs emits a variety of events into our system. We cannot use third party systems like google analytics etc, as per PCI norms and other CORS issues. So, all these events have to be piped into the lake
  • Over and above these, our internal micro services also emit events during different stages of their processing lifecycle

To solve all the above issues, we divide our discussion into real time entities and real time events. 

Real time entities

Writing to a database is easy, but getting the data out again is surprisingly hard. If you just want to query the database and get some results, that’s fine. But what if you want a copy of your database content in some other system like data lake for real-time analytics?

If your data never changed, it would be easy. You could just take a snapshot of the database (a full dump, e.g. a backup), copy it over, and load it into the data lake. This poses 2 different kinds of problems:

  1. Most of the data goes through a state machine and hence, the state of the data changes rapidly
  2. Getting the up-to-date view of this data is challenging in real time.

Even if you take a snapshot once a day, you still have one-day-old data in the downstream system, and on a large database, those snapshots and bulk loads can become very expensive, which is not great.

So, what does the above mean?

  • We will need to incrementally load data into a real time streaming pipeline that directly manifests into the lake
  • We cannot expose our internal primary database to our BI tool as it stores a lot of sensitive information
  • We want our real time stream to be as performant as possible
  • We do not want to keep the data in our real time stream for eternity, as its primary use case is around instantaneous monitoring, visualization and alerting

Keeping the above in mind, the data team had made the following assumptions:

  • We do not need all of this data for eternity, unlike our traditional OLTP store. So, we decided to store the data as a rolling window update over seven days(1 week)
  • We will still want to maintain some basic governing facts loaded here for eternity(e.g. Merchants, customers, card bins etc)
  • We will want this system to be extremely performant and being able to query as fast as possible
  • Some of the rolling aggregations are fairly complex and needs to be computed with as much data as possible to achieve the desired latency
  • We will want the change data to be captured here, as soon as possible
  • In essence, all operations on this store will only be upsert operations, as we do not want to keep a copy of any older/stale data

At a very high level, our architecture for solving this problem looks like the following:

data engineering razorpay

 The flow of data will look something like this:

  • MySQL Read Replica instance used to pull the data
  • We use maxwell to handle the CDC(change data capture) and also ensure, we filter out sensitive information before reading the bin log
  • A Maxwell daemon detects change data capture (CDC)  to this DB and pushes them to a Kafka Topic
  • A spark consumer will now keep reading from the kafka stream and keep batching updates every few seconds(note: the minimum batch duration available in spark is 100 ms)
  • Finally, Change data is pushed to the real time data store, where the queries can be executed from the BI tool.

Choice of real-time data store

We did a variety of evaluations on some of the existing data stores for the real-time use case. In essence, we wanted SQL capabilities to be used by the unified BI tool. Most folks within the organization are comfortable with SQL and hence, we wanted something that fits the bill. 

After evaluating a bunch of OLAP engines, we arrived at timescaledb as a choice of this engine. Timescaledb is an underlying postgres engine with a timeseries extension. This gives us the ability to not compromise on the SQL like capabilities and also gives some of the advantages over rolling aggregate computation etc.  In addition, we will want the operational cost to be extremely lesser with self-healing and auto-scaling abilities possible. 

We didn’t want to spend large amounts of money investing in a paid solution like memsql etc to solve these problems. Considering all the above, TimescaleDB seems like a reasonable place to start, simple enough to set up and maintain and seems to meet all the respective criteria.

Real time events

As mentioned above, as of today, only a small fraction of all our workloads(front end systems, mobile SDKs and a few core transactional apps) are pushing events into the data lake. Despite this, the data lake is receiving close to 0.5B events per day. 

As you would’ve guessed, with all the existing services pushing events, this number is only going to grow significantly. For a long while, we had an internal event ingestion pipeline(codename: lumberjack), written in go,  which primary relays incoming events from multiple producers into desirable targets. 

In essence, all that is needed for any app, to tie its events into the lake, just needed to register itself through a configuration. The reason for choosing go over java or others, is to achieve an extremely high level of concurrency, with minimal operating metrics(cpu, memory etc). In addition, this was designed as a heavy I/O bound application, as most work was simply processing, doing minimal validation/enrichment and transmitting events. 

We already discussed some of the challenges we had with events being pushed to newrelic. So, we wanted to move all of the events, into a central store, from where we could query using our unified BI tool. 

We started making minor architectural changes to our event ingestion pipeline to arrive at the following:

data engineering razorpay

Lumberjack workers: We were originally pushing to aws SQS. We wanted streaming capabilities and SQS was only supporting long poll. So, we decided to move this to Kafka streaming. Kafka streaming gave us the ability to replay and manage offsets effectively. 

Consumer : We completely removed the task of pushing events to newrelic. This way, we got rid of the Kafka consumer, which was running on the lumberjack side. We moved to this operation to a spark streaming job, which will read messages from kafka in order of appearance and stream this to an S3 bucket. 

Sink – S3: Spark streaming job will sink data for every micro-batch interval, which is configurable. Currently, we have set it to 1 min. Every micro-batch is accumulated in memory, so we can configure the sink interval based on data size. Again, the minimum micro batch interval supported by spark is 100ms

Query execution: We are using presto for query execution. The advantage we get here is sub second responses for a few million records. 

S3 – Partition: In order to further speed up the query execution of the events across multiple days, we create daily partitions(msck repair) to ensure the users can query using the created_date as the primary partition key. This has been configured into our BI tool. 

Infrastructure setup

Our entire infrastructure for all of Razorpay has been deployed and operated via kubernetes. In essence, except for the spark setup, we run and manage all the other aspects via kubernetes. 

So, in essence, maxwell has been running as a deployment, kafka is running as a kubernetes daemonset, exposed to the spark pipelines and timescaledb also has been setup using a kubernetes daemonset backed with a remote AWS EBS volume. Connectivity from the BI tool is enabled to the timescaleDB over NLB and the AWS Security group associated with timescaledb, ensures security over the setup.

 The above aside, the spark cluster has been exposed to our BI tool, controlled again via AWS security group and only allows presto queries to be executed. We use prometheus for all our internal metrics. 

Currently, since spark doesn’t support out of the box metrics to be injected into prometheus, we have funneled the metrics to lumberjack from spark, which is directly scraped by prometheus and exposed on our dashboards. 

Databricks has an upstream patch on spark, but that’s not yet merged into spark core, for pushing prometheus metrics into a push gateway. (TBD: we might need a separate section around metrics here and also add diagrams for infra).

The major challenges

Real-time data challenges:

  1. Since Pipeline has to handle DDL and DML both logs, so the order of committing the statement to the data lake is very crucial which was a major challenge for pushing data in the same order as it was generated. We have implemented custom logic to create the order by considering the bin log file name and offset of that file. We have an internal schema registry deployed again on kubernetes, to manage the same. This allows us to track schema evolution over a period of time and also ensures we can keep multiple copies of the data, on the lake
  2. Kafka has slowed down periodically due to limited partitions. This leads to a lag in the data lake, which was fixed by partitioning on unique IDs 
  3. The Dashboard queries performance is bad so we implemented a custom user defined function which aggregates the data in a rolling time window and caches the old aggregate data
  4. Because high transactions happen in the DB system for humongous tables such as payments, orders, etc. and how transaction happen in small tables like marchent we can not distribute load uniformly across partitions. This leads to Data write performance skew
  5. Mysql GTID also cannot be used around sequencing in certain cases, and we have built custom sort and de-duplication mechanics to handle out of order events
  6. Replication delays: In order to avoid AWS inter AZ data transfer cost, and to avoid pressure on the primary databases, we have designed maxwell to read from the replica. As a result, at peak times, if there is a replication lag, our real time pipelines expect the same delay on processing and transmission
  7. Scaling challenges around timescaledb: At the moment, timescaledb inherently dosen’t support clustering options. We plan to move this layer either using kubedb into a clustered mode, or perhaps use other mechanisms to ensure we have better clustering / MPP kind of execution
  8. In addition, we can cut down the spark processing time, by moving this pipeline into flink, which can directly stream kafka to timescaledb endpoint

Real-time entities challenges:

  1. Since the events are pushed in small micro batches, this leads to a lot of cost overhead on S3. In addition, during query execution, we were bitten by hadoop’s small file problem. We are still balancing the right micro batch interval
  2. In addition, we wanted to have a unified way of keeping this data. So, we plan to move the immediate events into the real time data store and eventually sync up into the partitioned tables, on a daily basis
  3. With the above change, we can quite simply move the spark processing to flink processing, where the flink jobs can directly stream to the timescale db endpoint and spark process the daily batches with partitioning.

Learnings and pitfalls

  1. To replicate MYSQL DB transaction in the correct order on a Non-MySQL datastore, for ordering the DB transactions and replay the events a combination of GTID, XID, event types (commit start and end ) need to be used
  2. Spark streaming has a lot of overhead and doesn’t play well when used with small batch sizes (millisecond level, that’s why we moved to seconds level batch)
  3. Running SQL queries from spark carries a lot of overhead. We need to instrument the right metrics, analyze queries in a timely fashion and enable the right kind of caching for optimizing the queries
  4. A large portion of our data lake is built on aws s3. This comes at a significant cost, if not tuned well. For instance, the s3 data transfer cost, bit us quite badly a few months back. As a result, we had to go through significant infra optimization, enable vpc endpoints among others. Cost optimization, continues to be an ongoing exercise
  5. Optimizing S3 by itself, has posed enough challenges for us. As we mentioned earlier, in the subsequent posts, we shall enlist our learnings, observations and the work we have done to optimize these

The road ahead

As much as we have been able to build some of these things at an extremely efficient scale and operationalize it, our journey doesn’t stop here. 

It has in fact, just begun. 

In the subsequent posts, we shall talk around the journey of our data platform, data lake, non real time use cases, optimization techniques adopted among a variety of subjects. 

Our journey thus far, on the data side, hasn’t really been that smooth. We have failed, learnt and recovered. On the other side, some of the most challenging problems we have faced, has been a lot of fun to solve too. We wish to learn and share our learnings through these.

If you are interested in working with us or solve some exciting problems, please reach out to or visit our careers page.  

Authors: Birendra Kumar (Head of Data Engineering, Razorpay) and Venkat Vaidhyanathan (Architect , Razorpay)

Installing Razorpay Thirdwatch for WooCommerce in 5 Simple Steps

WooCommerce is one of the biggest platforms in the world for setting up an online store and rightfully so, owing to its seamless functionalities and ease of use. Thirdwatch from Razorpay is a plugin designed to detect fraudulent orders and reduce RTO for e-commerce businesses. If you haven’t been aware of Razorpay’s entry into the e-commerce industry, allow us to explain to you what we’ve been up to and how to install Razorpay Thirdwatch in 4 simple steps.

What is Razorpay Thirdwatch?

Razorpay Thirdwatch is a first-of-its-kind solution for fraud prevention for e-commerce businesses. Thirdwatch is an AI-powered platform that enables online sellers to prevent Return-To-Origin (RTO) orders and reduce losses up to 30 percent. Thirdwatch’s AI engine evaluates every order in real-time and provides actionable results to weed out orders likely to result in RTO. 

One of the small, yet significant components of Thirdwatch is Buyer Action, a feature that automates confirmation from customers. This can significantly reduce manual intervention while keeping fraud at bay. Read more about Buyer Action and how it impacts business here.

How does Thirdwatch’s AI-engine work?

Once integrated, the solution captures 200+ parameters from your online store analytics. It leverages an ensemble of AI algorithms and graph algorithms to flag an order with a high risk of RTO and enables the seller to either cancel or take corrective actions.

What happens to the processed orders?

The processed orders transition into the following two states –

  • Red: If the order is marked red, then the seller can either decline the order or take corrective actions like updating the address or getting a confirmation from the customer on order quantity, etc
  • Green: If the order is flagged green, then the sellers can go ahead with the usual flow and ship the order

What is the basis of screening orders?

There are a variety of parameters used to judge whether an order is risky or not. Following are the key parameters that play a critical role in screening the orders:

  • Shipping Address Profile
  • Device Fingerprint
  • IP Address Profile
  • Buyer’s History
  • Buyer’s Navigation Behaviour
  • Network Effects

Are there any customization options available?

Razorpay Thirdwatch comes with a horde of options for easy customization. You can also customise the Thirdwatch plugin at the time of integration by accessing the open-source project, available here

What are the steps to install Razorpay Thirdwatch for WooCommerce?

To make it easier than ever for merchants to install Thirdwatch, we’ve made a step-by-step guide to make your installation process quick, easy and hassle-free. Let’s get started!

Type 1: Direct Installation

Step 1: Download WooCommerce plugin from WordPress store using this link.

Step 2: On your WordPress dashboard, click on “Plugins” on the left tab, and search for “Thirdwatch” on the search bar on the right side.

Step 3: Step 3: Install the Thirdwatch plugin and click on “Activate”. Once you’ve activated, register your business account on Thirdwatch Dashboard from here. If you’ve already created an account on Thirdwatch, log in to your account using your email address and password from here .

Step 4: On the Thirdwatch dashboard, click on “Settings” to get your API Key. To generate an API key, enter your online store’s URL. 

razorpay thirdwatch woocommerce free installation

razorpay thirdwatch woocommerce free plugin install

Step 5: Head over to WordPress dashboard–>Thirdwatch and enter your API key–> Check “Enable Thirdwatch Validation”–> Click on “save changes” (details of API key given below as well)

Type 2: Custom Installation

Step 1: Download the Razorpay Thirdwatch plugin from the WordPress Store, unzip the package and place the folder in the wp-content/plugins

Step 2: Now, click on the Plug-ins option in the left-hand bar on the WordPress dashboard. Under the Thirdwatch tab, click on activate.

Step 3: After successful installation of the plugin, click on the Settings button and check on Enable Thirdwatch Validation.

Step 4: To enter your API Key, you can sign up on the Thirdwatch dashboard for free. Upon signing up, you can find the API key in the Settings tab. Here’s a guide to fill the following details:

  • 🏁 Approve Status (Change order status when an order has been approved by Thirdwatch)
  • 🚩 Review Status (Change order status when an order is flagged by Thirdwatch)
  • ⛔️ Reject Status (Change order status when an order is rejected by Thirdwatch)
  • 💬 Fraud Message (Choose a custom message to be sent to the customer if their order has failed validation)

Step 5:  Head back to the WordPress dashboard–>Thirdwatch. Click on save changes, and you’re good to go!

Yes, it’s that easy to install Razorpay Thirdwatch! With all-new features like Buyer Action on Thirdwatch, it’s easier than ever to keep a check on fraud and the losses that come with it. 

Install Thirdwatch for WooCommerce today and supercharge your business like never before Start saving money by optimizing your e-commerce operations with Thirdwatch. If you have any questions, make sure to get in touch with us here, and we’ll be happy to help you with them. 

Everything You Should Know About Facebook’s Libra

Facebook Libra cryptocurrency - all you need to know about it - Razorpay payment gateway

With so much buzz about Facebook’s Libra all over the internet, we wanted to make things easier for you. We made a ton of research on the newest innovation, so you don’t have to!

Let’s jump right in.

Apparently, a couple of years ago, Mark Zuckerberg expressed his interest in cryptocurrencies in an interview in the subtlest manner. Seems like he was earnest about exploring opportunities in the financial services industry.

Facebook revealed Libra, a cryptocurrency, along with a consortium of 27 partners and associations earlier. Libra was conceptualised around a mission – to enable a simple global currency infrastructure that empowers billions of people. 

Let’s slow down and understand what the deal is all about.

What is Libra, exactly?

Facebook’s Libra is not just a cryptocurrency.  It’s a “reliable digital currency” all about delivering “the Internet of money” through an efficient infrastructure. The cryptocurrency is intended to be sent to any part of the world with a bare minimum of a fee. 

How is Libra any different from all the other cryptocurrencies out there? 

We thought you’d wonder. 

Since we’re all familiar with Bitcoin, let’s understand the whats and hows of Libra through painting a contrast.

Although built on the same fundamental axiom as Bitcoin, Libra aims to have a stable value, while it gets backed by a number of currencies. 

And, unlike Bitcoin, which is open in nature, Libra is not going to be so. A bank does not issue Bitcoin or manage it either. You can pretty much download a crypto wallet and get going. But Libra is more like digital money with traits of fiat money, if that makes sense.

Libra is also not as decentralised as the other cryptocurrencies. You can open and download the open source code for free in the case of Bitcoin. There are speculations about Libra having multiple central nodes instead of one centralised node, which can be controlled by a legion of stakeholders.

We all know how secure Bitcoin is because of its decentralised nature. It’s next to impossible to hack, being one among the most secure computer networks ever. We’ve already talked about Libra having multiple centralised nodes. This may create a few loopholes concerning security.

So, what’s the idea behind Libra anyway?

Libra is all about making financial services accessible for everyone, irrespective of their geographical location or financial background. The Libra case study talks about how people with less money end up spending way too much money on financial services. And, this is something that should not go unaddressed. 

With the belief that a low-cost money movement will create better economic opportunities, Libra is to charge a very insignificant amount as the fee for transactions.

Libra is also conceptualised to put an edge on advanced financial inclusion, ethical factors, and the integrity of the ecosystem. 

How does Libra work?

Let’s talk a little about the flow of events. 

Imagine you buy Libra. What happens next?

The money goes into a bank account and stays there. It won’t budge because the idea is to match the value of a Dollar or Euro. When Libra’s value is that of a currency, it’s immediately backed by a Dollar or a Euro in the bank.

Why so?

Because, the account holding of Libra in a bank will generate interest based on value, which can be used to return the initial investors of the cryptocurrency. 

Again, comparing with Bitcoin, Libra can be created without a limit on the number, unlike Bitcoin, which is said to have an upper limit of 21 million. And, creating Libra is also not as laborious as Bitcoin, since Bitcoin consumes a lot of electricity.

If Libra works the way it’s told to work, we should all be able to send it to any business on a global scale. 

The best part is, you can also convert Libra back to your preferred currency. Calibra, Facebook’s wallet will convert Libra at the current conversion rate and helps transfer money into a bank account.

What can you use Libra for?

Libra is built in a way that any organisation can accept a coin and make a wallet on top of it. So, Libra is not just limited to Facebook. The cryptocurrency is intended to be for all of Facebook’s users (about 2.7 billion), including Messenger. 

Libra can be used for multiple purposes. Since the partnership is branched out all the way to Uber, Spotify, and more, it’s expected that one can buy services on the partnered businesses through Libra. You can also run Facebook ads using Libra.

Facebook also went about setting up Calibra, a subsidiary. This is going to make Libra accessible to all users. The idea is to expand and build more financial services as a layer on Libra. 

You can set up a Libra wallet from any part of the world by providing identity proof. The only setback will be faced by regions that have limitations on the use of social networks. 

Libra is said to work all over the world from the year 2020.

How do things look for Libra in India?

We all know about the crypto-ban in the country and how there is a draft law that proposes a 10-year jail term for holding, selling or dealing in cryptocurrency.This could also mean that Libra may never make an entry at the Indian financial services landscape.

Considering how India is also moving towards a fintech revolution, it can do more good than bad if Libra were to set itself to work in the county. This is particularly great for Facebook since Indians are heavy users of Facebook and WhatsApp. 

There is a lot of hush-hush about the reputation though. Facebook has an underlying negative tension since it hasn’t safeguarded its user information to the best it could. Concerns about security since it’s money we’re dealing with, simply cannot be sidelined. 

Speaking of security, Libra is not as decentralised as a cryptocurrency usually is. This can pose a threat, or help Libra represent itself as “not a cryptocurrency but is like one” and find its way into the Indian market. 

Let’s say Libra does enter the Indian fintech. What could possibly happen? If you think about it, Libra can compete with our favourite UPI. Since UPI, digital payments have gotten way easier than ever, mobile payments had a breakthrough, and the country moved a step ahead in its fintech journey. Libra could give UPI a run for its money.

Libra could also become one of the prominent methods of online transactions since payment solution companies will also come forth to support the same.

We’ve talked about UPI contributions from various cities and states of the country. Now, let’s talk about rural areas. From our previous report, we know where tier-3 cities stand with UPI. But can we all agree upon the fact that tier-3 cities have WhatsApp and Facebook users? Of course.

If people from these areas aren’t really catching on to UPI, could it be possible that since they already have WhatsApp and Facebook, or either one among the two, they’re a step closer to financial inclusion? 



Online Payment Fraud: What Is It and How Razorpay Prevents It

FeaturedFraud Prevention for Online Businesses

This is the second blog in our series on online security and fraud prevention. To understand more about online safety (how to distinguish between a secure and non-secure website, how to ensure you are making a secure payment) read the first part here. To understand how online payment fraud occurs and the steps to prevent it, read on!

There is a reason why banks put up disclaimers announcing that their employees do not ask you for sensitive data, or that you should never reveal details like your OTP to an unknown person.

Online payment fraud is a reality of the internet age we live in and the numbers are only set to increase with the increasing digital adoption in India. According to a study by the credit information company

Experian and the International Data Corp (IDC), the fraud risk in India is currently pegged at 8.1 points; second only to Indonesia (8.7 points) and significantly higher than the average 5.5 points in the Asia Pacific region.

A 2016 consumer study conducted by ACI Worldwide places India at the fifth position in terms of total card fraud rates; behind Mexico, Brazil, United States, and Australia.

As they say, the best weapon against any problem is education; so let’s begin by understanding the different types of payment frauds that occur in India and how online sites and payment gateways like Razorpay prevent it.

Online Payment Fraud: The Different Types

The most common types of online fraud occur via phishing or spoofing, data theft, and chargeback or friendly fraud. We have explained these in detail below.

Online Phishing or Spoofing

Phishing is the process of accessing one’s personal information through fraudulent e-mails or websites that claim to be legitimate.  The information gathered this way can include usernames, passwords, credit card numbers, or bank account numbers.

The most widely used method for phishing is to redirect an online user (from an email or SMS) to an “official” website where they are asked to update their personal information.  You are thereby tricked into revealing personal information that you would ideally not reveal to anyone else.

Phishing can also occur via other electronic means such as SMS, instant messaging, and on email. You can be redirected to make a payment on a website that looks legitimate, but which is created to capture your card details so they can be used later.

According to reports, India is the third-most targeted country for phishing attacks, after the US and Russia.

Data Theft

Sometimes, dishonest employees or partners can steal credit card data from businesses and use this for committing fraud. Most online sites take stringent measures to ensure that such privacy breaches do not occur.

Instead of storing credit card details as is, for instance, websites and payment gateways use methods like tokenization and encryption to keep the data secure.

Razorpay takes data security very seriously. We are a certified ISO-27001 compliant organization, which means we undergo stringent audits on our data privacy processes.

Chargeback Fraud or Friendly Fraud

Let’s say a customer makes an online purchase. Later, they claim that the purchase was made fraudulently and ask for a chargeback – even though they made the purchase themselves! (A chargeback – in the simplest of terms – is an order from a bank to business, asking it to return the amount paid for a possibly fraudulent purchase.)

This is known as chargeback fraud or friendly fraud, where business processes a transaction since it seems legitimate; only to be issued with a chargeback later on.

Chargeback frauds cause GMV losses and are a hassle for any business. We have a Razorpay Chargeback Guide that will help you understand why chargebacks happen and take steps against fraudulent charges.

The Effect of Payment Fraud on Businesses

As per the current terms and conditions, a credit card issuer (i.e., the bank) does not consider the cardholder liable for any fraudulent activity; for both card-present and card-not-present frauds.

Therefore, payment frauds involving credit cards have a significant effect on the business community and a significant impact on a merchant’s bottom line. Every time a customer issues a chargeback, it leads to loss of both inventory and GMV. This is especially true for retail establishments, where the profit margins are usually small.

Regarding industry, the subscriptions industry continues to have the highest rate of fraud for two main reasons:

  • Subscriptions are essentially a card-dependent service; wherein the USP of the service is that the customer does not have to make manual payments. It is easy to claim that one’s card was used without knowledge in such a scenario.
  • Fraudsters and hackers use subscription services to ‘test’ cards. Online subscription services usually provide a one-month free trial, but one needs a credit card to initiate the trial period. Since the value is negligible, such payments usually go unnoticed by a card owner. If the card details are incorrect, the subscription business shares a detailed authorization error; thus making it easy for the hacker to modify their strategy and continue using the cards.

Razorpay: How We Help Businesses Reduce Fraud and Mitigate Risk

Apart from the mandatory protocols, Razorpay has its processes (developed in-house by our tech whizkids) to detect and prevent fraud and mitigate risk. As a payment gateway and a converged payments solution company, we take data security very seriously.

By delving into our data and analyzing patterns, we have been able to institute processes that ably discern between a ‘normal’ and a ‘suspicious’ transaction with credible accuracy. These systems are divided into two types:

a) Systems for detecting ‘Merchant Fraud’

Merchant fraud occurs when someone creates a fake or bogus company with no intention of selling any product to the customer. The business appears legitimate; but since it offers no actual goods or services, all users who make an online purchase only end up losing their money.

As a payment gateway, Razorpay has strict processes in place to vet every company which uses our gateway for processing payments. Some of the ways how we check for merchant fraud include:

KYC checks: Adhering to strict KYC norms even before we onboard a business is an integral part of fraud mitigation. We have an in-house ‘Risk and Activation’ team that runs background checks on new businesses and vets them before they are ‘live’ on our payment gateway.

At Razorpay, we take this check one level higher by monitoring all suspicious and potentially fraudulent businesses, and the transactions that originate from them.

Transaction monitoring: Razorpay Payment Gateway has an inbuilt ‘Risk’ logic which can sniff out a possible fraud faster than a K9 squad. Let’s say a merchant who gets 3-4 online orders in a day suddenly starts to get 300 daily orders.

A sudden spike in transaction velocity (number of transactions per minute/hour/day), volume (amount transacted for), or pattern (international orders for a local brand) is an indicator of fraud and our systems immediately flag such transactions for further investigations.

Our ‘Risk’ logic also has 72 odd rules for monitoring the thousands of transactions on our payment gateway on a daily basis. This logic is designed according to the merchant, and our logic pathway can easily differentiate between standard day-to-day transactions and those that carry a high probability of risk.

b) Systems for detecting ‘Customer Fraud’

Customer fraud occurs when a stolen or lost card is used for suspicious activities. It can also occur for other payment modes. Not only does this affect the user, but it is also detrimental to e-commerce websites as it increases cases of refunds and chargebacks, and leads to loss of GMV.

At Razorpay, we strive to protect both our merchants and our customers. Which is why we conduct extensive transaction monitoring as well to protect both their interests. How do we do it? Here’s a peek:

Checking for hotlisted cards: Every time a card is used for payment, our gateway connects with the card provider to check if the card has been hotlisted. (Hotlisting means that the card has been blocked temporarily or permanently for use). This is done in real-time so that a verified transaction is still completed within seconds, while the suspicious ones get flagged.

Pattern-based transaction monitoring: We also use geographical and pattern-based transaction monitoring (as for detecting merchant frauds) to identify suspect transactions. This helps us in preempting and preventing chargeback frauds and other types of customer frauds. We have a hit ratio of being able to identify 85% of fraudulent cases in advance.

Online Fraud Prevention: The Future

Online fraud will remain a contentious issue even in the days to come. The more we connect and transact online, the bigger the threat. Moreover, since we cannot eliminate it, the solution must be to remain on guard every single second. The only way to prevent online fraud is through vigilance and regulation.

A good example here is the 3D Secure (3DS) protocol that VISA had developed to keep its customers safe, and which has since been adopted by other card companies like American Express, MasterCard, and JCB International.

A similar process is the 2FA used in India, which is mandatory for all cardholders and card-issuing banks. The RBI has also mandated online alerts for all card transactions – even those where the cardholder physically swipes their card at a PoS system.

For all transactions considered suspicious, cardholders have the option to issue a ‘de-activation request’ immediately and hotlist their cards.

The Indian government’s decision to appoint a nodal agency for dealing with phone frauds – called the FCORD initiative – is another praiseworthy step. We at Razorpay are also in touch with the MHA, which has designated the FCORD as the Nodal Agency for reporting and preventing Cyber Crime frauds in India, regarding the same.

While a zero-fraud system will take some days to achieve, we are constantly building new processes to minimize fraud risk for all consumers.

The bottom line though remains this: If you are building an e-commerce website, remember to follow all the protocols mentioned above and minimize the risk of fraud. Alternatively, find a payment gateway (hello there!) that has stringent security protocols already in place. We’re just a click of a button away!

How Secure Are Your Online Payments?


At Razorpay we strive to make every transaction done via our payment gateway a secure payment. We’re a technology-first online payments company and online payment security is in our DNA. We employ a ‘no stones unturned’ approach to safeguarding the interest of both the online businesses who use our products, as well as their consumers.

We also understand the assurance of secure payments is one of the primary drivers behind the choice of a payment gateway.

With the growing number of e-commerce users and transactions in India,, it is important that we are all aware of the mandatory security protocols for e-commerce websites; so that we can avoid fraudulent situations. As the saying goes, prevention is better than cure.

In this article, let me walk you through the security protocols and processes followed at Razorpay, and which you should look for, too, every time you transact online.

online payment security architecture and information flow

1. TLS Encryption

Data security on e-commerce websites or an online payment system begins the moment a user lands on the site. The TLS Certificate tells users that the data transmitted between the web server and their browser is safe.

As a payment provider, Razorpay uses the highest assurance SSL certificate on its website which is the EV SSL (Extended Validity SSL) certificate.

Without TLS Encryption in place, all data sent over the Internet is unencrypted and is visible to anyone with the means and intent to intercept it. An easy way to check if the e-commerce websites you frequent are SSL certified is to look at the URL and see if it uses ‘http://’ or ‘https://’ protocol.

The additional ‘s’ signifies a secure e-payment system. You can also look for the padlock icon at the beginning of the URL. Modern web browsers in their race to make the Web secure by default are now following the opposite paradigm – mark HTTP sites as “insecure”.

2. PCI-DSS Compliance

The PCI Security Standards Council is a global organization that maintains and promotes compliance rules for managing cardholder data for all e-commerce websites and online payment systems.

The Payment Card Industry Data Security Standards (PCI-DSS) is in effect a set of policies that govern how sensitive cardholder information should be handled.

Fact: The PCI Security Standards Council was created as a joint initiative by the four major credit-card providers: American Express, Visa, MasterCard, and Discover, in the year 2004. Over the years, the PCI-DSS standard has become the guiding principle for online security across the globe.

For an e-commerce website or an online payment system to be PCI-DSS compliant they have to follow certain directives:

Maintain a secure network to process payments: This involves using robust firewalls which can protect against malicious security threats. Further, the website or payment gateway should not use default credentials like manufacturer provided PINs and passwords, and must allow customers to change this data as needed.

Ensure all data is encrypted during transmission: When cardholder data is transmitted online, it is imperative that it be encrypted. Razorpay encrypts all information you share using checkout via TLS (Transport Layer Security). This prevents data interception during transmission from your system to Razorpay.

Fact: On the Razorpay Payment Gateway, all the details entered by a user like their name, address, and credit/debit card information are used only to process and complete the order. Razorpay never stores sensitive information like CVV numbers, PINs etc.

Keep infrastructure secure: This directive involves keeping abreast of new PCI-DSS mandates and using updated software and spyware to protect against known software vulnerabilities, running regular system and software scans to ensure maximum data protection.

Restrict information access: An important part of securing online payments on e-commerce websites is restricting access to confidential information so that only authorized personnel will have access to cardholder data. Cardholder data must be protected at all times – both electronically and physically.

3. Tokenization

Tokenization is a process by which a 16-digit card number gets replaced by a digital identifier known as a ‘token’. This is done to ensure the safety of the original data while allowing payment gateways to securely access the cardholder data and initiate a secure payment.

Fact: Even if a website gets breached and the tokens stored are hacked, it is immensely difficult to reverse-engineer the actual card number from the token itself. To do this, one needs access to the logic used for tokenization, which is not publicly available.

Credit card tokenization helps e-commerce websites improve security, as it eliminates the need for storing credit card data, and reduces security breaches. For more on how tokenization works and impacts online payments, you can read our in-depth blog.

4. Two-Factor Authentication

Two Factor Authentication, aka 2FA, or two-step verification is an extra layer of security added by e-commerce websites to ensure a secure payment for a customer.

This is a customer-facing authentication process, mandated by regulatory bodies like RBI, in that the transaction is processed only after the user enters a detail that only they could know, or have at hand (like a physical token or a security key). Many banks and other e-payment gateways also use the 2FA for their own payment modes.

Fact: 2FA is not a newly-minted technology, but it has recently become the de-facto method of authentication in the digital age. In 2011, Google announced 2FA for heightening online security for its service. MSN and Yahoo followed suit.

When you use Net Banking for a transaction, you are first asked to enter your username and password. As a final confirmation, the bank sends you an OTP on your registered mobile number. This process has been mandated by the RBI, is divided into two levels of authentication:

What the user knows: In this step, users fill in their card/Net Banking details such as username and password. This helps the payment gateway recognize which bank the card belongs to.

What the user (and only the user) has: This step is known as ‘Authorization‘ and is done through the OTP/PIN/CVV. The bank (and the payment gateway) can then confirm that the request for payment is initiated by the rightful user.

5. Fraud Prevention

Apart from these mandatory protocols, most e-commerce websites and payment gateways have their own fraud and risk prevention systems. Big data analytics and machine learning play a huge role in devising these risk prevention and mitigation systems.

By delving into our customer’s data and analysing patterns, we at Razopray can discern between a ‘normal’ and a ‘suspicious’ transaction with credible accuracy. Apart from this, there is a lot that you as a customer can do to reduce the risk of fraud. 

Always remember that:  

– Anyone of importance will never ask for your card data/passwords up front. Banks and financial service providers have a safe protocol to gain admin access to an account if the need ever arises.

– Passwords are safer when you don’t write them down. Keep strong passwords that you can remember, change them frequently, and refrain from writing them down somewhere.

– You have the right to dispute suspicious charges on your card or accounts. Raise a chargeback request for any unidentified transaction on your card. You have a legal right to a resolution.

If you are building an e-commerce website, remember that fraud prevention requires that you follow all the above-mentioned protocols. Or find a payment gateway (hello there!) that has stringent security protocols already in place. We’re just a click of a button away!

UPI 2.0 – New Features, Missing Links, and the Effect on Indian Businesses

razorpay upi

When UPI was first launched in 2016, it was rightly heralded as a game changer. We couldn’t agree more, because the inherent structure of the NPCI’s flagship offering is designed to become the sole platform for seamless interoperability of PSPs (Payment Service Providers) in the country.

But for it to become a truly universal payment option, the existing UPI solution needs to be more than a peer-to-peer platform. And the UPI 2.0 is just that! The upcoming launch is a much better, revamped version of UPI that will go a long way in increasing digital adoption in the business sphere, and for peer-to-merchant transactions.

UPI 2.0 – The New ‘For Merchant’ Features

We all are aware that the upgraded UPI will have many new features; such as increased transaction limit of INR 2 lakhs. However, it is the ‘for merchant’ features – the one that will directly impact P2M transactions – that are of importance.

Use of overdraft accounts

Up until now, UPI payments were made only from saving accounts. But with overdraft accounts coming into play, merchants will be able to withdraw money even when there is a cash-deficit in their account. Business, therefore, does not have to stop just because of a temporary issue of insolvency.

Capture and hold facility

The facility to block a certain amount in user’s cards was already present due to a feature called ‘key auth’. Now, merchants accepting payments via UPI will also be able to do the same. Essentially, they will be able to block a certain amount of money on their user’s cards and debit/refund it a later date.

With this feature, UPI will become useful for a variety of business verticals (where it may not have been as popular before). Hotels, e-commerce companies, cab-booking services can block amounts on their guests’ credit cards as advance. This can also be done against security.

Businesses can then refund the same once the booking is completed. This will also be of use when buying stocks or IPOs and other such transactions.

Support for invoicing

Invoices, bills, or any other supporting documentation is not a necessity when making a peer-to-peer payment. A confirmation of receipt via mail or SMS is what most businesses look for.

However, in the P2M payment space, invoices are mandatory. So far, a merchant could only add a description of the payment asked. The support for invoices in UPI 2.0 means that businesses can use a single platform for sending invoices and receiving payments, instead of using separate mediums for the same.  

Easy resolution of refunds

Another reason why UPI had not permeated deeply into the business sector was that refunds were not a part of the initial core spec. So, if a merchant needed to refund money to their customer they would need to issue a fresh transaction. Now, UPI payments will also follow this mapping so that users and merchants can have clarity on the refunds made.

How the Upgrade Will Affect the Industry?

The increased popularity of UPI will also reduce the market for Wallets. Recent data shows that transactions on prepaid instruments like cards and Wallets have reduced by 14% between March 2017 and March 2018.

With UPI, this number will reduce further. Customers will want to use a platform that will allow direct bank transfer of money rather than uploading money into their Wallets.

UPI 2.0-adoption statistics-Razorpay

There is no doubt that the added features will open more use cases in the business sector, and allow for greater permeation. The capture and hold feature, by itself, can create hundreds of new use cases in the e-commerce industry.

Support for invoices will convert UPI from just a transactional medium to an informational medium as well. Refund mapping will solve a crucial industry pain point which will also translate into better user experience and more transparency.

What’s Missing From UPI 2.0?

As a comprehensive payment firm, Razorpay has been ready for UPI 2.0 for a while. We follow mandatory KYC procedures for all businesses, making acceptance of bank payments a breeze. We also have the industry knowledge and tech required to build on the new use cases offered by UPI.

However, there is one feature missing from the new launch, which is ‘mandates’. Mandates or standing instructions mean that UPI can become the go-to payment option for all recurring payments like SIPs, Mutual Fund payments, monthly subscriptions etc.

Also, the withdrawal of biometric Aadhaar-based payment feature will render it unusable by those who do not own a smartphone. The smartphone penetration rate in India will reach 28% by the end of 2018. This means leaving roughly 72% of the population deprived of the choice to use UPI, or any other digital payment solution.

Looking Ahead: What the Payments Industry Needs UPI to Be

Customers always prefer simpler, ubiquitous payment solutions that can reduce friction during online transactions. UPI transactions are direct and easier; as compared to loading money into, and withdrawing from, a prepaid instrument.

UPI can easily replace PoS solutions and make accounting and reconciliation easier for merchants.

For it to become the leader in digital transactions, it has to offer ubiquity. And it must drive large-scale adoption of digital payments. NPCI also needs to build a large merchant-acceptance network; both online and offline, because that is where the real push for ‘Digital India’ will come from.

**Originally published in Inc42.