How to provide auto-suggest, fuzzy search for a web application without breaking the bank of DynamoDB throughput

This is the fourth article in a series about implementing an AWS Serverless Web Application for global clients of a large enterprise. If you are lost, please read the first article.

Search is so ubiquitous, thanks to Google and others. Customers expect to find what they are looking for very quickly and very easily. They bring this expectation to your IT organization, and we must provide this very quickly and very easily (just as they expect). Search solutions are very well documented and are proven in more than one way. While not trivial, this is a battle that has been won in the on-premises IT infrastructure world. Of course, there is a catch. On premises, to provide a search solution through multiple data sets, quickly and easily means provisioning and managing a cluster of servers running SOLR, Elasticsearch or some other software. While these things are just as any other solution, there is some expertise needed to manage everything within cost and performance benchmarks. It can be hard if you do not know how to do it.

Enter “AWS Cloudsearch”. AWS Cloudsearch is a managed SOLR offering from AWS. This means, as an IT manager, one just needs to provision capacity, declare the intent (auto-suggest/fuzzy search) and use the system. AWS takes over the running of the system and provides a nice API interface for searching as well as monitoring the solution. In our case, we were able to provision search within a couple of days with a pipeline to propagate changes through production environments. The key input factors are size of data and intent. Both input factors are amenable as time goes by.

Today, I wanted to write very brief notes about how we implemented search for our needs. Here, I have refrained from discussing business use cases for search as these are well written about in a variety of media. Just as examples, Cloudsearch enables Faceted Search (think Zappos), Fuzzy Search (“did you mean …?”) and other ways your customers might like to find data.

Pipelines & Continuous Delivery AWS Cloudsearch is one of the few products that are NOT supported in AWS Cloudformation. This means automation is a little unusual. We still use AWS CodeCommit to manage Infrastructure as Code (IaC), but, this time, code is not cloudformation. Instead, it is a JSON file that can be used as input directly to the AWS CLI. This flows through various stages in a Pipeline to be deployed in varying ways in multiple environments.

Pipelines and Continuous Delivery for Search

Usage in the application Using Cloudsearch in the application has two aspects — just as any other data store would have. One aspect is realized when users use the system and create or update data in the system. At such times, the data first goes into a DynamoDB table and then is streamed to a Lambda function, which then makes a corresponding change in the Cloudsearch SOLR index. On the other side, when a user wants to search data, one could make API calls straight into Cloudsearch or do it through an API Gateway and Lambda solution (as described in other articles in this series).

Update the index from the web app

Failures and Inconsistencies The weakest link in our design is when user changes data through the application, but, it never makes it to our search index. In these cases, users are unable to search data they inserted. Further, normally, freshly changed data (aka “hot data”) is expected in search results more often than old data. To reduce failures and react to errors, the Lambda functions processing the Dynamo Streams must be retry capable and must be configured with a dead letter queue. When messages arrive in the dead letter queue, an AWS Cloudwatch alarm should trigger a notification or other automated processes. This is a must when designing enterprise grade systems.

Handling cloudsearch index failures

Performance & Cost Performance is dependent on the size of purchased Cloudsearch service as well as on the nature of data. So, it will vary in each case. We were able to performance test our solution and get latency (TTFB — Time to First Byte) numbers ranging from 10 ms to 120 ms with an average around 40 ms. That is way faster than our on-premises SOLR cluster (if and when it is feeling happy). There is a performance penalty when we re-index our data, so, we plan on upscaling and re-indexing if the need arises during high traffic situations in production.

Amazon Cloudsearch is priced on the size of the SOLR cluster as well as number of documents submitted for indexing. In general, if you run a single average size production cluster of 2 servers for a month, it will set you back by around $270. That, of course, is peanuts compared to on-premises cost when you include all costs (such as electricity, cooling, personnel, operations).

There is one MAJOR caveat to pricing as written here. If users change data very often (several times an hour) and also expect the latest data to be available immediately, then, the solution must make many frequent uploads to Cloudsearch. This can get very expensive very fast. To get around it, you must batch requests into Cloudsearch and allow for a delay of a few minutes before users can search latest data.

Lot more to it Once again, I leave you, the reader, with very little of the details. That is intentional. Amazon Web Services is very flexible, I guarantee that there are five ways for every one way that I can prescribe as “the” best practice. If there is anything else specific that you must do — it is IAM. AWS IAM allows fine-grained access control to Cloudsearch API and hence the data within it. Use IAM.