Designing a URL shortener is no easy task. The task requires good understanding of NoSQL, Apache Solr/Lucene, App servers like Tomcat and Nginx, CDNs (Content Delivery Networks), Caching techniques for both web pages as well as database lookups. Some knowledge on Lucene index formats like postings format and how it works can be handy too.
URL shortening has several unique components that need to be designed right to ensure the system does not buckle under load. These include:
-The key component which allows users to generate very short URLs is the hashing function used to map long links into shorter ones. While designing URL shortening systems, one needs to in mind that they should find a hash function that has the following properties:
-A short hash is easier to memorize for humans and also easy to generate. Forming a good hashing method takes time and expertise, since it involves finding a function that can be computed quickly but whose output appears random (i.e., no collisions). Hash functions used in URL shortening services include SHA-1, SHA-256 and MD5; most systems now favor SHA2 over MD5 as they are faster and more secure. Lucene's Murmur3 implementation is another popular choice because it runs faster than Java's big integer arithmetic (used by both MD5 and SHA2 implementations), which means Lucene uses less CPU power.
-The shorter the string we're hashing, the higher CPU it consumes. Therefore, each time we shorten a link using the same hashing function, there is some loss of CPU resources due to caching of the hashed output. This is why services like TinyURL use several hash functions as recommended by Murmur3's authors - an 'S' value from SHA-256 and an 'I' value from MD5. Using several values means TinyURL can calculate one or even two hashes without reading/writing to disk if those values are already cached in memory.
-The other components that impact URL shortening system design include:
Shortened URLs have many uses -- for example, they can be embedded into emails or tweets that contain links back to original articles on websites where the full content is available. However, shortened URLs have a bad reputation due to many reasons, including:
URL shorteners, when designed correctly with scale in mind, can be a very powerful tool for users and companies alike -- however they should be used sparingly only where the pros outweigh the cons. URL shortening services have been built by several companies such as Bitly, TinyURL and Goo.gl (Google's service); while each of these services work well and efficiently for their specific use case (offering different features and supporting multiple APIs), we will discuss how we could implement our own URL shortener system that takes care of these concerns and scales horizontally.
The below diagram illustrates how our proposed url shortener system works:
The system will contain several application servers that are connected with one or more Solr/Lucene clusters. A front-end load balancer is responsible for directing all the requests to the appropriate application server which contains a configuration file containing information about what shard of the index it should handle.
The reason we make use of multiple shards is to increase performance and allow the URL shortener system to scale horizontally. For example, if our shard size was 3, then each time a user shortened a link, that request would be sent to only one third of the servers in our cluster (assuming that there are 7 app servers). This way instead of 100% of users sending their requests directly towards one node, only 1/7