/dev/random

Software obsoleting faster than Hardware

2021-09-28T20:55:00.001+05:30

When I started my career, I was lucky to work on a legendary Operating System named Novell Netware. I got my first job because of a hacking adventure that me and my roommate Arvind did in our college on top of an upatched Netware installation. The reliability and robustness of Netware may make you believe in magic. But it was just a well-engineered, old-school product. One of the most popular instances of its robustness, was the epic uptime of 16 years as covered in arstechnica.

(Image courtesy: Arstechnica)

This was not an one-off situation either. We had multiple customers with years of uptime. In one of the academic institutes, the uptime was well into decades, that multiple sysadmins changed, but the netware box tirelessly worked on. At some point of time, nobody knew where the server was physically located, as nobody looked at it as everything worked fine.

In almost all the cases, the hardware failed before the software. The software was engineered so well that it would have run forever on superior hardware (albeit not so efficiently capable of using the modern hardware in its true potential). Those days, even the hardware was built to last for decades. It was the good old times before the planned obsolescence.

Fast forward to today, 2021. I have a Redmi 4 android phone, built by the mass manufacturer Xiaomi. I bought it on May 30th 2017 and still use it everyday. I always purchase things for long-term. I believe in BuyItForLife principles. I maintain my hardware properly (Fully discharge and then recharge, handle with care etc.). Even my prior phone, a Motorola E398 lasted me a about a decade, before the charger gave up.

Today during the lunch break, I went out looking in search of a home. My ever busy teammate Seshachalam sent a message to me on Slack at that time. I got a notification in the Android pull-down notifications. I tried to open Slack to see the message and got this error message:

It seems Slack will no longer work on Android phones that are just about 4 years old. Slack got acquired for 27.7 billion USD a couple of years back. They must have a lot of good engineers. Xiaomi is a cheap, chinese mass manufacturer, who sells hardware at a fraction of what iPhone costs. If the software slack is getting obsolete than a hardware optimised for cost, we engineers, as a [pseudo-]species, are doing something terribly regressive and getting worse. Netware was an entire operating system, with file, print sharing and they could provide an uptime of decades. Slack is a messaging app, which should not have problems running in < 5 year old hardware.

Slack has a responsibility to keep running on old hardware, if not for anything else, but just for our planet. Even if 0.5% of total number of slack users (which should be comfortably in tens of millions) have to update their perfectly functioning smartphones, just because slack won't work, imagine the e-waste that would be generated. Imagine the water cost of these smartphones. All these ewaste will be then dumped in poor third world nations, causing more harm to them.

Being green, environmentally responsible does not mean just optimising the server side cost. Companies like slack, zoom, etc. which have become indispensable, in the post-covid world, need to do better and maintain better longevity for their software.

If nothing else, Slack could just provide a lite version which has only plain text messages shown and no emoticons, rickrolling, etc. Even GMail has a basic HTML view.

Repairability

2020-12-27T13:01:00.000+05:30

My Macbook Pro

I have a Macbook pro retina 15 inch, that I bought in 2016. A few days back, the battery started bulking up and the laptop has totally stopped working. It has grown so big now that I cannot keep the laptop in a flat surface; it almost rocks like a see-saw. The touchpad panel is also feeling the bulge. The Macbook pro is not even switching on now, presumably to safeguard against battery explosions.

I bought the laptop 4 years back, for about 200,000 INR (~2700 USD / 2200 EUR). Electronics are very costly in India :( This is the pre-touchbar Macbook pro. I did not like the new keyboard in the then new Macbooks (the first edition with the touchbar). Luckily, I did not purchase the touchbar version which probably is one of the worst electronic devices ever manufactured.

I took my macbookpro today to a nearby Apple service center and I was told that the battery replacement would cost about 40,000 INR (~550 USD / 450 EUR). For such a costly laptop, it has a terrible battery longevity. I have no interest in paying this much for just a battery for an old laptop which is anyway not fun anymore to work on. I could purchase a new laptop with this money. For this money I could even setup a private cloud of a few Rasperry PIs and even launch kubernetes in them for fun.

I thought that I could purchase a battery offline and replace the battery. But to replace the battery, you need to disassemble almost everything (Harddisk, Speakers, CPU, fans, etc.) in a macbook and use chemicals (acetone). Thankfully Apple is not yet making cars, otherwise, to change Engine oil, we may have to dissasemble the headlamps, engine, transmission, differential, etc.

What is more evil is, The Macbook pro will not work without a battery even if it is plugged in an electric power supply. I do this for my old HP laptop (for parents+kids) whose battery is long gone. The magsafe charger of Macbook is another well-known disaster. I believe that Apple gets a lot of undeserving praise for their hardware. They have shiny aluminium body, a good screen and the best touchpad; But their Thermal management, Longevity, Repairability are all abysmally bad. I have replaced the charger three times in ~3 years because the plastic covering near the charging point goes bad in daily usage and the internal wire gets damaged. The wires also become very yellow and dirty in Indian climate, for some reason. Even people far more connected and influential than me, could not change Apple's behavior.

I personally, am never buying any apple device or Macbook pro ever again, for personal use. It's just because I do not like their greed and exploitation of vulnerable customers.

Thinkpads

The only reason I bought the Macbook pro in 2016 was because I had to do some iOS app development. Prior to that in my $DAYJOBs I have almost exclusively used Thinkpads right from the days they were owned by IBM (and had an extra-ordinary keyboard) until they were sold off to Lenovo (and have this chiclet non-sense). The old Thinkpads were a delight to have. We could replace the batteries, replace the fans, replace individual keys etc. We could also add RAM or Disk whenever we need, how much ever we need. All without needing anything more than a normal screwdriver set.

Good things are not meant to last. Just like how Macbooks have gone worse, Thinkpads too have gone worse. In my current $DAYJOB I use a Thinkpad E series (the cheapest version) and it is terrible. The management of Lenovo is either dumb and do not understand what its loyal customers want; or just plain evil (or capitalistic extremists) and embraced planned obsolesence.

The thinkpad now comes in many series L, T, X, P, E etc. and almost none of them have external batteries. Almost all of them have the RAM soldered and cannot be replaced. If the soldered RAM goes wrong, we need to throw away the laptop. Thinkpads were supposed to be the most developer friendly laptops. But even in 2020, we cannot get a single 32GB RAM in any of the medium cost thinkpad ranges. If you want anything more than 32GB RAM, you must shell out a lot of money and go for a ridiculously high cost series with 4k screen or some such luxury that I do not want. And their fingerprint readers never seem to reliably work on Linux for some reason, despite most kernel developers using Thinkpads.

E Waste and Green Earth

When I was in school, I have lived in a house with no electricity. I have then grown up and lived in Indian towns where 8-12 hours power cut per day was not unheard of. Luckily I now live in a big Indian city where powercuts are just an occasional weekly-few-hours affair. If it were not for the powercuts, I would happily purchase a desktop instead of a laptop. Atleast until now, desktops (Not those integrated all-in-one pieces) age better than laptops. But powercuts are a part of life where I live and I need battery backup.

Mobile Phones

Laptops and desktops are only a small part of the story. Now, with mobile phones coming on, the amount of e waste getting generated is exploding (literally in some sense). Android is a bigger culprit than Apple here. Even Google (which does not have "Do no evil" as a motto anymore) is refusing to push updates for pixel phones that are just 3 years old.

Once electric cars become more available, it is going to be worse for third world nations which import e waste. Rich billionaires and millionaires will claim to be more green by switching to electric cars and will send off the batteries to electronic graveyards in the other side of the Earth.

By making it difficult to replace/recycle batteries in laptops, phones, OEMs are making the world generate a lot of e waste. The first world nations worry about e-waste polluting their water and land, so what do they do ? They simply dump it out to third world nations, like India, Vietnam etc. As if, us people of these nations, do not have enough things to worry about on our own, now we have to accomodate tonnes of these e-wastes, which spoil our water and pollute our air.

We cannot even protest against these e-waste processing units that import world's junk, because most of the third world nations do not even have healthy democracies where citizens can opine against the Government/rulers, unlike the west.

What to do ?

The European Union atleast is trying to do something, while rest of the world seem to be not bothered. The USA especially has a lot of responsibility, because most of the OEMs like Apple, HP, Dell, etc. are walking the evil path of denying repairability and increasing sales, only to please the wallstreet and their $SHARE_SYMBOLs in the American stock market. It is pointless to spend millions for green-earth initiatives, if you do not produce re-cyclable / repairable electronic gadgets.

What can we Engineers do to combat such planned obsolecence and promote right-to-repair and recycling ?

Honestly, I do not have an answer. May be we could influence in our small circles of hardware purchase. When your $EMPLOYER is looking to update the company hardware and get laptops for everyone, insist them to get only hardware which can be easily repaired.

If you work for an e-commerce giant (like Amazon, Walmart, EBay, etc.) push your employer to provide "Repairability score" as a filter condition in the product pages (similar to 3*, 4*, 5* , etc.) of electronic devices. May be if enough companies/customers start demanding these, the OEMs will have a financial motivation to do the right thing.

Are there any other steps that you believe that we as individuals could do to bring a change ? If so, please comment.

What laptops do you like using that have a good repairability score, even in 2020/2021.

Thanks for coming to my TED talk ;-)

PS: If you know any good third party battery (which wouldn't explode in hot Indian weather) for Macbook Pro please let me know. If you refer a mechanic/shop who does the Macbook battery replacement in Chennai/Bangalore, India, that would be even better.

CLI Tools

2020-03-04T22:36:00.000+05:30

I spend a lot of time on terminal. I prefer using CLI Tools. Even when I like using Goland or Visual Studio Code (IDEs) for coding (instead of vim/emacs), I prefer to do my non-coding activities from a terminal using CLI tools.

I am quite happy with my zsh and its various plugins (git, kubernetes, docker, etc). Since in $DAYJOB I work with kubernetes a lot, I heavily use kubectl, kubectx, kubens etc. in combination with grep, jq, pipes, etc. and prefer these CLI tools always over clicking buttons or scrolling long pages in browser.

All these got me into thinking, if I were to write a command line application today (March 2020), which language / frameworks should I use ? Some self-imposed constraints:

The CLI application will be short-lived and will be invoked multiple times everyday by developers/users (such as grep, ls, cat, sed, etc.) and not daemons or long running processes. It won't matter if they leak memory too ;-)
It needs to be fast. Lightning speed.
The tool development may also be split into two parts, a library and a binary, if it could help in developing parallel client implementations (may be in future a GUI tool)
The tool is going to be a FOSS tool and would need some community presence in future.
Tool needs to be cross platform (Mac, Linux, Windows)

This post is a summary of various candidates and their current strengths and weaknesses in my perspective.

C

+ Probably the language that will guarantee the fastest tool.
+ Easy to write wrapper libraries/bindings for any language
+ There are libraries like glib which might help with achieving better platform portability than the default language.
- Manual memory management, crash prone
- Most young programmers of today may not bother contributing, even if they use. Heck, most do not even learn this language in most colleges anymore.
- If the tool has to work with network services, JSON etc., library support is not going to be easily available.
- i18n, Unicode support etc. may not be great out of the box

Note: C++ is just a more complicated and painful C in my personal experience. Even though I have heard nice things about SmartPointers etc., I am not really convinced that learning C++ is going to be beneficial in the long run and never really bothered to master it, after learning it at college. This is an intentional miss.

Java

+ Rich and mature libraries and ecosystem
+ Complete platform independence and is guaranteed to run on Linux, Macs or Windows alike. Datatypes, Files etc. behave properly everywhere.
- Perception of slowness. There may be some JVM tweaking etc. needed and that is an extra effort. IIUC this would be a non-issue for long running processes.
- Unclear licensing: may require to pay money to Oracle

Special mention: Kotlin is a humongously better language than the Java 8 (with lambdas) that I recently used. It has been mentioned that future releases of Java would incorporate sexy aspects of Kotlin. Related talk. (Ignore the clickbait title, excellent talk really, strongly recommended).

Python

+ Good library support and hopefully all will migrate to Python 3
- Not a fan of the language for various reasons. Whitespace for scope identification, Lack of static typing out of the box, type unsafety for variables, etc.

Javascript (With Typescript/Electron, etc)

+ The language with the most number of developers in today's scenario (biased as per my sample set)
+ Excellent support for multiple natural languages, glyphs, diacritics, etc. Mature frameworks, processes for i18n, l10n, etc.
- Slow and hungry for memory/cpu
- Using nodejs for CLI applications though not unheard of, is not popular either (compared to the adoption of GUI applications via Electron)

D / Haskell / Rust / C#

These languages may be great but these are niche and are not widely used by a large number of programmers in my circle.

Rust although is claimed to be gaining momentum for a long time now, I cannot recollect any popular application that I use everyday done with Rust. I have stopped using Firefox (in favor of Brave) and so not really sure what the current performance status of Rust is. I will however, keep a watch on Rust and would try to learn at some point of time, in near future.

Golang

+ Almost every new CLI tool that I have used fresh in the last 5 years or so, is done in Golang (docker, kube*, hugo, helm, etc)
+ Excellent libraries: Cobra and Viper (Thanks to spf13)
+ Super simple to distribute. Static binaries.
+ Automatic memory management and highly performant.
+ Strong community presence
- Lack of generics is a pain. It is not a problem when developing HTTP servers, REST services etc. but definitely irritates when writing libraries
- Platform independence may be questionable. If the CLI tool is for something mission critical (a large number of users, lives depend on it, etc.), some of the discrepancies mentioned here may be dangerous.

Conclusion

I will choose Golang if it is for dayjob. I would however choose Kotlin if the development is for a hobby/pet project to learn.

What will you choose, What merits/demerits do you see for these (or other) languages ? Also what libraries/frameworks will you choose for your language(s) of choice ?

HTTP Query Params 101

2019-02-07T15:03:00.001+05:30

Target Audience: Beginners / Novice

Summary

A long time ago, we had simpler lives with our monolithic apps talking to relational databases. SQL supported having myriad conditions with the WHERE clause and conditions. As time progressed, every application became a webapp and we started developing HTTP services talking JSON, consumed by a variety of client applications such as mobile clients, browser apps etc. So, some of the filtering that we were doing via SQL WHERE clauses now needed a way to be represented via HTTP query parameters. This blog post tries to explain the use of HTTP Query Parameters for newbie programmers, via some examples. This is NOT a post on how to cleanly define/structure your REST APIs. The aim is to just give an introduction to HTTP Query Parameters.

Action

Let us build an ebooks online store. For each book in our database, let us have the following data:

BookID - String - Uniquely identifies a book
Title - String
Authors - String Array
Content - String - Base64 encoded content of the book
PublishedOn - Date
ISBN - String
Pages - Integer - Number of pages in the book

Let there be an API to get a list of books. It would be something like:

GET https://api.example.com/books

The above API will return all the information about each of the book in our system except the Content. Though this would work for a small book shop, if you have like a billion books, this puts unnecessary stress on the server, client and the network bandwidth to hold all the book data when the user is probably not bothered to see more than, say 10 titles, in most cases. So our API could now needs a way to return only N titles. Also, from which position, we need to return the N titles also needs to be specified, say Mth position. These fields are called limit and offset usually. So our API becomes:

GET https://api.example.com/books?offset=5&limit=10

Here we have added two fields, offset and limit to our API. However there are two things that are unclear in this API definition.

The first ambiguity is: We do not know which field will be used for finding the sequence of the books. Is it the BookID ? Is it the PublishedOn Date ? The former is a string, how do we sort it to find the order (alphabetically in case-sensitive way or insensitive way). The latter is a date field and there can be multiple books which have the same published date. So, how do ensure that a book will always be in the same position in the sort order between two different HTTP requests ?

The second ambiguity is: What if these fields are not specified or if specified with invalid values ? How does the API handle it ?

To solve both of these ambiguities, our API docs need to become more precise. One possible solution (out of many solutions) to address the first ambiguity is, we will always generate only BookIDs with lower case strings and will always do a toLower conversion. We will always use BookID as the sort order and always will sort in ascending order. Our BookIDs field will always be monotonically increasing; IOW, once we have given a BookID of "abc" to a Book, we would never generate a BookID of "aba" again.

Instead of the String unique ID, we could also use numeric fields, which could directly map to a database AUTO_INCREMENT or BIGSERIAL field and ORMs can intelligently map the offset, limit fields automatically.

Note that it is not uncommon to add "sort_by" and/or "order_by" requirements, where the clients can choose to change the sort field (Publication Date instead of BookID) and also the sorting order (ascending or descending). There are multiple ways to represent this via the query parameters. Some examples are:

Sort by Title (default ascending):

GET https://api.example.com/books?sort_by=title

Sort by Title (explicitly ascending):

GET https://api.example.com/books?sort_by=asc(title)

GET https://api.example.com/books?sort_by=+(title)

GET https://api.example.com/books?sort_by=title.asc

Sort by Title (explicitly descending):

GET https://api.example.com/books?sort_by=desc(title)

GET https://api.example.com/books?sort_by=-(title)

GET https://api.example.com/books?sort_by=title.desc

Sort by Multiple Fields:

GET https://api.example.com/books?sort_by=asc(title),desc(published_on)

GET https://api.example.com/books?sort_by=title,-published_on

GET https://api.example.com/books?sort_by=+title,-published_on

For solving the second ambiguity, that we saw earlier, the safest solution is to make our HTTP APIs return 400 incase we come across invalid data, For example, if an invalid starting offset is given. We also need to explicitly document the default values for these query parameters, if nothing is specified.

Filters

We have used the GET above to get all the Books information and filter based on the cardinality. However, there may be a need for other filters. For example, we want to get the books by only a particular author. So we could add more parameters, such as:

GET https://api.example.com/books?author=crichton

Here the author is a query parameter which takes a string as an argument. This API will return any book whose author name matches "crichton". Also note that, these individual filters could be then combined with other filters, for example:

GET https://api.example.com/books?author=crichton&offset=0&limit=5

will return the first five books of author "crichton". So the API implementation in backend should apply the "limits" and "offset" after applying the author="crichton" filter. The API docs need to convey this in an unambigous way on what positions the "offset" would work if there are other filter conditions. The other choice is to return the books by "crichton" in the first five results in the list of all books. All your APIs need to be consistent and you can choose either of the practices, even though I prefer the former.

More Filter conditions

In the above API definition, we were returning the books whose author was "crichton" exactly. However, it may not be always possible to give an Exact Equals condition for our API. Our API may require to accept query parameters which should be loosely applied. For example, the author may be stored in our system as "Michael Crichton" so applying "crichton" may not be sufficient. Similarly, we may need to get a list of all books published after 2005 but before 2015.

Our query parameters, in addition to "equal-to" may need to support, less-than, less-than-or-equal-to, greater-than, greater-than-or-equal-to, not-equal-to, contains (for string matches), not-contains and so on.

Our query parameter need to pass these operators too in addition to the parameter name and the desired value(s). One possible approach for this could be to add these operator names to the query parameter. For example:

GET https://api.example.com/books?published_on[gte]="2005-01-01"&published_on[lte]="2015-12-31"

GET https://api.example.com/books?published_on.gte="2005-01-01"&published_on.lte="2015-12-31"

In the above two examples, we have added "lte" and "gte" to denote less-than-or-equal-to and greater-than-or-equal-to respectively, to the query parameter name. We use a standard separator to identify the operator from the parameter name. A [] in the first case and a "." in the second case. There are libraries in most programming languages, to automatically parse these field names conveniently. For example, from the popular qs library for node,

assert.deepEqual(qs.parse('published_on[lte]="2015-12-31"'), {
    published_on: {
        lte: '2015-12-31'
    }
});

Note that I am using YYYY-MM-DD as the date format. It is strongly recommended to use a single date format for all your APIs, whichever format you choose. Similarly, while working with time, choose a single timezone, preferably UTC.

Instead of changing the parameter names, we can add the operator to the operand on the RHS of the equal sign too. For example:

GET https://api.example.com/books?published_on=gte:"2005-01-01"&published_on=lte:"2015-12-31"

Here we are using "gte" and "lte" to denote the operator but specify it in the RHS of the equal symbol.

If you have a long list of filters and operators, you should probably avoid using HTTP Query Parameters. An API that receives these complex query strings, perhaps with your own DSL, (either as JSON or any other serialisable format), as a HTTP Request body, would make code maintenance simpler. Elasticsearch uses this method.

Conclusion

I have been trying to write this for some time now but kept on deferring for months now. So I decided to just type it out today in a stretch. The post is not as fine as I wanted it to be, but it is better to at least write an unrefined post than not writing anything at all. I hope you have found this useful. Let me know if you have any comments, feedback or corrections in this post. Thanks if you have read till here.

Containers 101

2017-09-11T13:53:00.001+05:30

The term "containers" became popular in the recent times, thanks to Docker. However, the idea of containers is there for long, through things like: Solaris Zones, Linux Containers, etc. (even though the underlying implementations are different). In this post, I try to give a small overview of the containers ecosystem (as it stands in 2017), from my perspective.

This post is written in response to a question by, hacker extraordinaire, Varun on what one should know about Containers as of today. Though the document is mostly generic, some lines of it are India specific, which I have highlighted clearly. Please mention in comments, if there is anything else that should have been covered, or if I have made any mistakes or if you have any opinions.

So, What exactly are Containers ?

Containers are an unit of packaging and deployment, that will guarantee repeatability and isolation. Let us see what each part of that sentence means.

Containers are a packaging tool like RPMs or EARs in the sense that they offer you a way to bundle up your binaries (or sources in case of interpreted languages). But instead of merely archiving your sources, Containers provide a way to even deploy your archive, repeatably too.

Anyone who has done packaging, knows, how much of a pain dependency-hell can cause. For example, An application A needs a library L of version 0.1, whereas another application B needs the same library L but of version 0.3 Just to screw up the life of packagers, the versions 0.1 and 0.3 may be conflicting each other and may not co-exist in a system, even in different installation paths. Containerising your application puts each of these applications A and B into their own bundle, with their own library dependencies. However, the real power of containerising is that for each of your application, A and B, they get a view of isolation that they are running in a private environment and so L1 0.1 and 0.3 may never share any runtime data.

One may be reminded about Virtual Machines (VMs) while reading the above text. Even VMs solve the above isolation problem, but they are very heavy. The fundamental difference between a VM and a Container is: a VM virutalizes/abstracts a hardware/operating-system and gives you a machine abstraction, while a Container virtualizes/abstracts an application of your choice. Containers are thus very lightweight and far more approachable.

The Ecosystem

Docker is the most used container technology today. There are other container runtimes such as rkt too. There is an Open Containers Initiative to create standards for container runtimes. All these container runtimes make use of linux kernel features, especially, cgroups to provide process isolation. Microsoft has been making a lot of efforts to support containers natively in the Windows kernel, to support Containers natively as part of their Azure cloud offering for quite some time now.

Container Orchestration is a way for deploying different containers on a bunch of machines. While Docker is arguably the champion of container runtimes, Kubernetes is unarguably the King/Queen of container orchestration. Google has been using containers in production, for much long before it became fashionable. In fact the first patch of cgroups support in the linux kernel was submitted to LKML by Google as far back as 2006. Google had/s a large scale cluster management system named Borg which deployed containers (not docker containers) across the humongous google cloud farm. Kubernetes is an open source evolution of Borg, supporting Docker containers natively. Docker-Swarm is an attempt by Docker (the company behind the Docker project) to achieve container orchestration across machines, but there simply is no competition in terms of quality or documentation or feature coverage, compared to Kubernetes (in my limited experience).

Also, in addition to these, There are some poorly implemented, company-specific tools that try to emulate Kubernetes, but these are mostly technical debt and it is wise (imho) for companies to ditch such efforts and move to open projects backed by companies like Google, Red Hat and Microsoft. A distinguished engineer once told me, There is no compression algorithm for experience and there is no need for us to repeat the mistakes made by these companies, decades ago. If you are a startup focussing on solving an user problem, you should focus on your business problem and a container orchestration software should be the last thing that you need to implement.

Kubernetes, though initially a Google project, has now attracted a lot of contributors from a variety of companies such as Red Hat, Microsoft etc. Red Hat have built OpenShift, a platform that provides a lot of useful features such as, Pipelines, Blue-Green deployments, etc. on top of Kubernetes. They even offer a hosted version. Tectonic (on top of Kubernetes) by Core OS is also a big (at least in terms of developer mindshare) player in this ecosystem.

SUSE has come up recently with the Kubic project for containers (even though I have not played with it myself).

Microsoft have hired some high profile names in the container ecosystem for working on the Kubernetes + Azure (Including people like: Brendan Burns, Jess Frazelle, etc.) cloud. Azure is definitely way ahead of Google in India, when it comes to cloud business. Their pricing page is localised for India, while Google does not even support Indian currency yet and charges in USD (leading to jokes like the oil/dollar conspiracy, among the Indian startup ecosystem ;) ). AWS and Azure definitely have a bigger developer mindshare in India than Google Cloud Platform (as of 2017).

The founding team of kubernetes (Xooglers) have started a company named Heptio. While I have no doubts on their engineering prowess, I am skeptical if relying on these companies may be risky for startups in India (lack of same timezone support, etc.). If you are in the west, these options (and others such as rancher) may be interesting.

Kubernetes Basics

In Kubernetes, the unit of deployment is a Pod. A pod is merely a collection of Docker containers which will be deployed together always. For example, if your application is a API server that makes use of a Redis cache, before hitting the database for each request, you create a Pod with two containers, a API server container and a Redis container and you deploy them together.

Kubernetes refers to an umbrella of projects that run on a cloud, to manage a cloud. It has various components, such as an API server to interact with the kubernetes system, an agent software named kubelet that runs on each machine in the cloud, a fluentd type of daemon to accumulate logs from various containers and provide a single point of access, a web dashboard, a CLI tool named kubectl to perform various options, etc. In addition to these kubernetes specific components, there are also other services, such as the distributed hashstore etcd (originally from coreos) that you need to setup a basic kubernetes cluster. However, If you are a small company, It'll be wise to make use of GKE or Azure hosting or OpenShift hosting instead of deploying your own kubernetes system managed by your own admins. It is not worth the hassle.

If you want to play with kubernetes in your development laptop (unless you can afford to treat production as your test box), there is a tool named minikube to help you with that. If you are an application developer and considering to dockerizing and deploying your application, then minikube is definitely the best place to start.

There are quite a few meetups happening for kubernetes all around the world. Visiting some of these may be enlightening. The webinar series by Janakiram was good, but it is a little too long to my taste and I lost interest halfway. The persistent ones among you may find it very useful.

Docker Compose

One of the tools from the Docker project that I love a lot is the handy Docker Compose. It is a tool to work with multiple containers, in a sense it is somewhat like your kubernetes Pods, but without having to install / manage the heavyweight kubernetes ecosystem. I use Docker Compose extensively in CI, where it is the perfect fit for doing end-to-end testing of a webstack, if your sources are in a monolithic repository. In your CI system, you can bring up all your components (say, an API server, a database, a front end node server) and perform an end-to-end testing (say, via selenium). In fact, I cannot fathom how I was doing CI earlier without docker-compose, (just like how I cannot fathom how I used cvs before git, etc.)

AWS

No blog post on cloud technologies will be complete, without mentioning the 800 pound gorilla, Amazon Web Services. Amazon supports containers natively. You can deploy either a single container or multi-container images natively, via Amazon Beanstalk. It is very much similar to the Google Appengine (if you have used it). Beanstalk is a PaaS offering and it takes a Container image and scales it automagically depending on various factors (such as CPU usage, HTTP usage, etc.). I've run Beanstalk and is very satisfied with it (perhaps not as much as with AppEngine though). It is very reliable, performant and scales well (tested for a few hundred users in my limited experience).

For the larger workloads and those who want more control, Amazon offers Elastic Container Service. You can create a bunch of EC2 instances and a bunch of Containers, and ask ECS to run these containers on these VMs in a way that you prefer. This, however locks you to the AWS platform (unlike k8s).

Both Beanstalk and ECS do not cost anything extra other than the price of VMs, which you already pay.

I, however, wish that Amazon starts supporting kubernetes natively. There are other ways to make use of kubernetes in AWS. The most enterprisey is probably Tectonic by Core OS, but we also have projects like kube-aws and kops.

Conclusion:

If you have actually read until this point, Thanks a lot :-) I could have written a little bit in detail about the nuts and bolts of the containers technology, but I believe that this post, as is, will be a good material for a 101 type of introduction. Also, there are people with far more working knowledge than me, who are more equipped to write on the details. So, I have left it as an exercise to the readers to find such talks, blogs or books :)

golang range Tickers

2017-06-14T11:01:00.000+05:30

Update: Please use the playground/gist urls for reading code. Blogger's code formatting is terrible and does not support embedding gists either.

Yesterday Praveen sent me an interesting piece of golang code. Read the following code and tell what the answer will be:

===
type LED struct {
state bool
ticker *time.Ticker
}

func toggle(led *LED) {
led.state = !led.state
}

func looper(led *LED) {
for range led.ticker.C {
toggle(led)
}
}

func main() {

fmt.Println("Initial number of GoRoutines: ", runtime.NumGoroutine())

led := &LED{state: true, ticker: time.NewTicker(time.Millisecond * 500)}
go looper(led)
fmt.Println("Number of GoRoutines after a call to looper: ", runtime.NumGoroutine())

time.Sleep(2 * time.Second)

led.ticker.Stop()
fmt.Println("Number of GoRoutines after stopping the ticker: ", runtime.NumGoroutine())

runtime.GC()
fmt.Println("Number of GoRoutines after gc: ", runtime.NumGoroutine())

}
===

Golang playground URL: https://play.golang.org/p/1as5QN1r2c
Gist URL: https://gist.github.com/psankar/8af76ba183b0203ec141bca8156f5955

I will explain roughly what the code is doing.

There is a LED struct which has a Ticker and a state variable. While creating an instance of the led struct, we initialise the state and the Ticker. There is a looper function will toggle the state, whenever the Ticker fires an event.

Now when the program is launched, there will be one goroutine (the initial main thread). After we call looper in a goroutine, the goroutineCount will be 2. Now, comes the tricky part. We stop the Ticker, after a particular amount of time. We even call the gc.

It was observed by Praveen that this piece of code was leaking go routines and the number of go routines was never going down, inspite of the Ticker getting stopped.

The reason why the leakage is happening is because, the "range" loop is never exiting. If the range loop was on a channel, you could "close" it. The ticker.C channel however is a receive only channel and you cannot close it.

How do we fix this, so that none of the goroutines are leaking ? If you have watched the talks, golang concurrency patterns by Rob Pike and Advanced golang concurrency patterns by Sameer Ajmani, then you will realise that it is quite easy to add another parameter to the looper function, which could just exit the loop. So the updated code will be:

===
type LED struct {
state bool
ticker *time.Ticker
}

func toggle(led *LED) {
led.state = !led.state
}

func looper(led *LED) {
for range led.ticker.C {
toggle(led)
}
}

func looper2(led *LED, q chan bool) {
for {
select {
case <-led.ticker.C:
toggle(led)
case <-q:
fmt.Println("Exiting the goroutine")
return
}
}
}

func main() {

fmt.Println("Initial number of GoRoutines: ", runtime.NumGoroutine())

led := &LED{state: true, ticker: time.NewTicker(time.Millisecond * 500)}
q := make(chan bool)
go looper2(led, q)
// go looper(led)
fmt.Println("Number of GoRoutines after a call to looper: ", runtime.NumGoroutine())

time.Sleep(2 * time.Second)

led.ticker.Stop()
fmt.Println("Number of GoRoutines after stopping the ticker: ", runtime.NumGoroutine())

q <- true
fmt.Println("Number of GoRoutines after sending a message on the quit channel: ", runtime.NumGoroutine())
}
===

Playground URL: https://play.golang.org/p/NlWbyHLHvA
Gist URL: https://gist.github.com/psankar/4e5b2e563038ce3e9c17eb208c76168a

Let me know if you have any comments.

Conversations with self while "Learning Reactjs"

2016-11-11T14:29:00.000+05:30

I implemented a bunch of APIs in Go. Took about a few hours in the night. Let me add a web client to these. May be I will learn to build a SPA. Which poison to choose from ? React, Angular 1.x, Angular 2.x, Vue ?

> Go with react. That is what all the cool kids are using. Also, something to do with: Angular 2 continues to put “JS” into HTML. React puts “HTML” into JS. sounds geeky and logical.

Okay. Let me start with this react. Where do I even begin ? Seems very complex.

> Alright. There is this create-react-app which is introduced by Facebook to make it easy to begin, so that you do not have to break your head about gulp, grunt, node, etc. and their magical version incompatibilities

I started with this create-react-app and went a little further. I can create various components and render them, but how do I get various views/components to interact, to form a workflow (say such as sharing a session string or so) ?

> This is where state management comes in. You need to maintain state in a nice way centrally. You need to use the Flux architecture, introduced by Facebook.

Cool. So I just use the flux library from Facebook and things will all fall in place ?

> Actually flux is a standard, but everyone uses Redux which is an implementation of this standard. Oh, btw there are lot of other implementations such as alt. The creator of the redux seem to be an active guy and helps in the community often, writes long stackoverflow posts, etc. How can someone who write long posts be wrong ?

Hm. Okay. Let me start with this redux. What should I understand ?

> It is simple. If you understand: Global store, Reducers, Actions, Dispatch, Containers, you have understood redux. Just follow these egghead tutorials.

Okay. I tried following these. They are really beginner unfriendly. Actually this series on youtube is better, though a bit out-dated and non-standard. I have now done a simple redux toy app.

> Try a complex app, with multiple pages and talk to that API that you implemented.

Good idea. I will start with it. Oh, wait. My component has a lot of buttons, text boxes, etc. I need a way to get something rudimentary, like: getting the value from the username and password input boxes, when a "Login" button is clicked. Do I need to make a mess of global states and private-component-specific states ? That is like so contrary to what we learnt so far.

> Think again. Is there any alternative for this ?

May be I can have Actions, ActionCreators and State variables for each field in each component, centrally maintained in the global store ? That will be a looooot of boilerplate.

> Ahem. May be you should start using redux-form library. It will minimize your workload and optimizes the boilerplate.

Is it well maintained ? It has plenty of github stars but what if the bus factor is high and the primary author loses interest when he gets a dayjob somewhere else ? Also, it is already in version 6. Isn't react itself announced just three years ago ? Why is there so 6 major versions of this library already ? Will this change again if I depend on it ?

> Hrm. How long has it been since you began the exercise ?

It has been about a month or so, learning only during the latenights (after the dayjob and getting kid to sleep etc.) and occasionally weekends. Already I am tired. May be this javascript-fatigue is real.

> Now that you have learnt React and Redux, and experienced first-hand how much it takes time to identify the quintessential combination of libraries, you should not attempt to build anything in your free time, you should be careful in choosing these technologies, for a proper dayjob.

If all these javascript fatigue posts are to be believed, the alternatives are equally bad if not worse. Angular 2 broke APIs in RC stage, does not offer guarantee to not break APIs even after release it seems. Anyways, I started this project to learn about react and I can say, I know my way around react. It is a different question if I want to choose UI programming as a full time profession at all. The current flux (not to be confused with the architecture) of things makes it extremely painful. I think people choose mobile first development, not because of product requirements but because of javascript fatigue.

PS: A lot of things where I had to take a detour and wasted a lot of time is trimmed from the post, as they are anyway not directly related to ReactJS

[Help Needed] FOSS License, CLA Query

2016-04-27T16:32:00.003+05:30

I want to start a FOSS project. FOSS Licenses are a grey area. I am trying to seek some public opinion here, to choose a license and a Contributor License Agreement (CLA). The project details are:

The project is a database (say, like mongodb, Cassandra etc.). It will have a server piece that users can deploy for storing data. Though it is a hobby personal project as of now, I may offer the database as a paid, hosted solution in future.
There are some client libraries too, for providing the ability to connect to the above mentioned server, from a variety of programming languages.
The client libraries will all be in Creative Commons Zero License / Public Domain. Basically anyone can do anything with the client library sources. The server license is where I have difficulty choosing.
Anyone who contributes any source to the server software should re-assign their copyrights and ownership of the code, to me. By "me", I refer to myself as an individual and not any company. I should reserve the right to transfer the ownership in future to anyone / any company. I may relicense the software in future to public domain or sell it off to a company like: SUSE, Red Hat, Canonical, (or) a company like: Amazon, Google, Microsoft etc.
Anyone who contributes code to my project, should make sure that [s]he has all the necessary copyrights to submit the changes to me and to re-assign the copyrights to me. I should not be liable for someone's contribution. If a contributor's employer has a sudden evil plan and want to take over my personal project to court (unlikely to happen, nevertheless), it should not be possible
I or the users of the software, should not be sued for any patent infringement cases, for code that is contributed by someone else. If a patent holder wants to sue me for a code that I have written in the software, that is fine. I will find a way around.
Anyone should be free to take the server sources, modify it and deploy it in his/her own hardware/cloud, for their personal and/or commercial needs, without paying me or any of the contributors any money/royalty/acknowledgement.
If they choose to either sell the server software or host it and sell it as a service, (basically commercial reasons) they must be enforced to open source their changes in public domain, unless they have a written permission from me, at my discretion. For instance, if coursera wants to use my database source, after modifications, it is fine with me; but I will not want, say Oracle to modify my software and sell the modified software / service, without opensourcing their changes. If someone is hosting and selling a service of my software, with modified sources, there is no easy way for me to prove their modification, but I would still like to have that legal protection.

The best license model that I could come up for the above is: Dual license the source code to AGPLv3 and a proprietary license. Enforce a CLA to accept all contributions only after a copyright reassignment to me, with a guarantee that I have the right to change the license at a future time.

What is not clear to me however, is the patent infringement and ownership violation related constraints and AGPL's protection on such disputes. Another option is: Mozilla Public License 2.0 but that does not seem to cover the hosting-as-a-service-and-selling-the-service aspect clearly imho.

Are you readers of the internet have any better solution ?

Are you aware of any other project using any other license, CLA model that may suit my needs and/or is similar ?

What other things should I be reading to understand more ?

Or, should I lose all faith in licenses and keep the sources private and release the binary as freeware, instead of open sourcing ? That would suck.

Or should I just not bother about someone making proprietary modifications and selling the software/service, by releasing the software to public domain ?

Note: Of course, all these is assuming that my 1 hour a month, hobby project would make it big, be useful to others and someone may sue. In reality, the software may not be tried by even a dozen people, but I'm just romanticizing.

Programmers guide to Microservices/SOA

2016-03-02T17:26:00.003+05:30

Introduction

SOA or Service Oriented Architecture is one of the buzzwords among architects/senior-developers, job descriptions for the last few years. However, most of the definitions of SOA online are riddled with formal words, such as the one from OASIS, which says: "A paradigm for organizing and utilizing distributed capabilities that may be under the control of different ownership domains. It provides a uniform means to offer, discover, interact with and use capabilities to produce desired effects consistent with measurable preconditions and expectations."

The above definition though is precise, is too abstract for a developer. This post tries to explain what constitutes a [micro-]service oriented architecture and how it differs from a traditional monolithic approach that a programmer may be accustomed to. This post is an introductory material aimed at beginners to SOA, who have already done some monolithic projects. Experts in SOA, could validate the facts mentioned and suggest alternatives.

Microservices

A very informal way to understand microservices is, if we split up every class of our design into a HTTP accessible webservice on its own, we would end up with a bunch of services, which together constitute a microservices based architecture. The difference between SOA and microservices is just the level of granularity to which you decompose your classes (in a monolithic application) into independent HTTP services. The more minimal in functionality, each of your service implementation is, the more closer it is to be called a microservice.

Splitting a single application into multiple services, imposes a few restrictions on our coding, but in turn gives us a lot of flexibility and power in scaling. Let us look at some of the coding/design constraints.

Stateless Systems

The fundamental difference from a monolithic design is in maintaining state information. All the individual classes which earlier interacted via global variables (for locks, mutexes, config variables, etc.) can no longer rely on them.

Let us take a simple example, we are building a rudimentary Shopping application with just one type of item stored. The Shopping application has two parts, the Inventory part that adds new items and the Sales part that removes items. Let us consider the following pseudo-code:

var mu = &sync.Mutex{}
var stockItemCount = 10

func (i *Inventory) AddToStock(n int) {
 mu.Lock()
 stockItemCount += n
 mu.Unlock()
}

func (s *Sales) UpdateStock(n int) bool {
 mu.Lock()
 defer mu.Unlock()
 if stockItemCount >= n {
  stockItemCount -= n
  return true
 } else {
  return false
 }
}

In the above code snippet (trivialized for brevity), we have a global variable stockItemCount, which is protected by a mutex mu. The AddToStock function of the Inventory class/type, adds to this global variable whereas the UpdateStock function of the Sales class/type, removes from the global variable. The mu lock synchronizes the access, such that the functions have exclusive access to the global variable on execution.

In a SOA, the Inventory and the Sales classes will become their own individual HTTP webservices. These new individual classes, viz., SalesService and InventoryService, may now run on different machines.

Inter-Service Co-ordination

So how do these different services potentially running on different machines, share and synchronize access to common data ? The solution is simple. We move away from the globalVariable+mutex pattern and implement a publish-subscribe or queueing pattern. What does that mean ?

We move the stockItemCount management into a separate StockService which is accessible by both the InventoryService and SalesService (earlier considered classes/type). Let us take a look at a sample pseudo-code:

// Stock Service on Machine A
var count = 10
func (s *StockService) ProcessQ() {
    for {
        op := Q.Read()
        if op.Type == “Inventory” {
            count += n
        } else if op.Type == “Sale” {
            if count >= n {
                count -= n
                http.POST(op.callbackURL, "success");
            } else {
                http.POST(op.callbackURL, "failure");
            }
        }
    }
}

type Operation struct {
    Type string
    Value int
    Callback URL
}

// Inventory Service on Machine B
func (i *InventoryService) AddToStock(n int) {
 Q.Write(Operation{“Add”, n, nil})
}

// Sales Service on Machine C
func (s *SalesService) UpdateStock(n int) {
 Q.Write(Operation{“Remove”, n, callbackURL})
}

func (s *SalesService) POSTHandler(w http.Response, *r http.Request) {
    s.Notify(r.Body) // success or failure
}

As seen above, we have two Classes which are converted into Services (SalesService and InventoryService) and a new third service named StockService. We also have a Q (a distributed Queue infrastructure) that we use. We have a Operation class/type, with a Type string, whose instance we will be adding to the Q. The AddToStock function in the Inventory service, creates a new Operation item of type "Add", whereas the UpdateStock function of the SalesService creates a new Operation item of type "Remove" to the queue. The StockService has a ProcessQ function which goes on an infinite loop to fetch items from the Q and based on the Type of the operation, perform either addition or deletion of value.

It should be clear now that the SalesService and the InventoryService are now totally stateless. They just make use of the Q to communicate with the StockService. The meticulous among the blog-readers would have observed that the StockService is still stateful. We maintain the count variable still as a global variable. In any large scale system, there may be some components, which may not be completely stateless. We will have some drawbacks because of having such stateful parts. We will discuss them more in a future section.

The Q forms a very central part of the above architecture. The Q can be implemented by the programmer manually, and could potentially be deployed in a totally different set of machine(s) from either of A, B or C. However, there are some stable Queue implementations that we could use, instead of reinventing the wheel. Apache Kafka, RabbitMQ are some popular opensource systems. If you want a hosted solution, Amazon SQS is offered by AWS and Cloud PubSub by Google. These systems could be called Messaging Middleware.

There are projects where a massive amount of data will be generated (say from sensors instead of humans) and we will need realtime processing of streaming data. We could use specialized streaming middleware such as Apache Storm or a hosted solution such as Amazon Kinesis.

Benefits of SOA

As we just saw above, what was a simple single process with two classes, became three different classes and a queueing system with four different processes across four (or more) different machines, to accommodate SOA. Why should a programmer put up with so much of complexity ? What do we get in return ? We will see some of the benefits in this section.

Horizontal Scalability

If we have a server with 4GB RAM serving a 100k requests per second for our above Shopping site, and due to an upcoming holiday season, there will be an estimated increases in the visitors count and we will have to serve, 400k parallel requests per second, we could do one of two things. (1) We could buy more expensive hardware, say a 16GB RAM machine. We could move our site deployment to this bigger machine until the holiday season and get back to the old system later. (2) We could launch another three of 4 GB RAM machines and handle the increased load. The former is called Vertical Scaling and the latter is called Horizontal Scaling.

Vertical scaling is appealing for small workloads but is costlier as we have to provision huge machines. Even if you could rent high-end VMs in the cloud, the pricing is not too friendly. Horizontal scaling is cheaper on your wallet as well as provides more throughput and allows for more dynamism.

Auto-Scaling

In our Shopping application, we saw that the Sales and the Inventory Services are stateless. So we could horizontally scale them individually. For example, we could launch 3 new instances of the SalesService to handle a holiday-traffic while maintaining the single machine for the Inventory service. This kind of flexibility would not have been possible with our earlier monolithic design. However, note that the Stock Service that we had was stateful and so it could not be horizontally scaled. This is the drawback of having stateful components in your architecture.

Once we know that the systems could be horizontally scaled, the next logical progression is to make the scaling automatic. There are systems like Amazon Beanstalk and Google AppEngine (to a certain extent (with vendor lockin)) that allow your application code to automatically horizontally scale by launching new instances whenever the demand is higher. The new instances will be automatically shutdown when the burst of traffic is reduced. This reduces huge IT administration overheads. We could have such nice features, only because our application architecture was composed of stateless services.

Serverless Systems

The next step in the evolution of auto-scaling, is to have code that automatically decides the number of servers on which it should run, instead of having to provision anything. To quote, Dr. Werner Vogels, CTO of Amazon, "No server is easier to manage than no server". We are clearly moving in this direction with serverless webapps. Amazon Lambda brings to life this functional programming dream. Google is not far behind and have recently launched Cloud Functions (but not as rich as AWS Lambda yet imho). We have frameworks to build entire suite of applications without servers, using these services.

Polyglot Development

As we are deploying each service independently, we could use different programming languages, frameworks and technologies for each of the services. For example, any CPU intensive service could be written in a performant language like Go while a bunch of front end code could be written in parallel in React or nodejs.

Mobile First

Since we have developed proper HTTP APIs for our application, in addition to the webclient, any mobile client too could use our webservices. In this day, most of the companies start with a mobile-first or mobile-only strategy and do not require a webclient. Some pro-monolithic engineers tend to argue that the first iteration of development should be in a monolithic model and we could re-engineer for a SOA at a later stage of development, as development speed is faster in monolithic design. Personally, I disagree to this. If we start with a SOA in mind from scratch, with our modern day development stack, we could easily plumb existing things instead of reinventing wheel and could do projects faster. There are frameworks and techniques to auto-generate a lot of code, once we have finalized the APIs. I have had experience building web applications both as a monolith and in SOA from scratch, I have felt happier with SOA code every time, YMMV.

Auxiliary Parts

If we are building a SOA based system, we need to have a lot more auxiliary support systems. If we do not have these auxiliary parts in place, it will be very difficult to measure, debug or optimize. Different companies implement different parts below, based on their business needs and deadlines.

Performance Metrics

The most important auxiliary aspect of SOA is to have precise performance metrics for each of the services. If we have SOA without performance/metrics measurement, It will be as ineffective as trying to do bodybuilding or weightloss without observing what we eat. We will not be able to rate limit requests, prevent DoS attacks, understand the health of the service without measurement. The performance measurement can be done in two ways, (1) Measure the performance and show metrics by realtime event monitoring (2) Log various events, errors, response times, etc., aggregate these logs and batch process them later, to understand the health of various components. We will need a combination of both the approaches for any large scale systems.

Luckily there are plenty of tools, services and libraries available for this. AWS API Gateway is perhaps the easiest way to register your APIs and monitor the endpoints. However we may need more finegrained measurements too (such as how long the calls to the database takes, which user is causing more load, what times are the loads high, etc.). There are various tools that we could use such as statsd, ganglia, nagios, etc. and various companies that offer hosted solutions too, such as sematext, signalfx, newrelic, etc.

Distributed Tracing

Tracing is a concept that is supplementary to metrics and performance measurement. When a new request comes to a service, it may in turn make use of 3-4 other services to serve the original request. Those 3-4 other services may in turn call 3-4 other services. Tracing helps us find out, on a per-request basis, the map of which services are used to serve it, how long it took at each point, where the request is stuck if could not be serviced, etc.

We could achieve tracing by giving a unique id / context object to each incoming new request in the outermost API which receives the request, pass it along as we make further API requests until the final response is finished. This context could be passed along as a parameter in the webservice calls. The monitoring of the tracing events could again be realtime or deducted from log-aggregation.

Dapper is a paper released by Google summarizing how tracing is done in Google. Twitter have released Zipkin, a FOSS implementation of the above Dapper paper, that is in production.

Pagination

Assume that we are exposing an API in our StockService to list all the items that we have, along with its RetailPrice. If we have say a billion products, the response to the API will be huge. Not just the response, the system resources needed to build that response, on the server side will be tremendous. If we are fetching the billion items from the database, the caches will be thrashed, the network will be clogged, etc. To avoid all these issues, any API that could potentially list a lot of items should consider paginating its response by a pagenumber, i.e., an API call should take a page number as a parameter and should return only M number of items in a page. The value of M could be decided based on the size of each item on the response. We can optionally get the number of results that the user wants, also as a HTTP Parameter.

For example:

http://127.0.0.1:8080/posts/label/tech - Returns the first 10 blog posts with label "tech"
http://127.0.0.1:8080/posts/label/tech/1 - Same as above
http://127.0.0.1:8080/posts/label/tech/2 - Returns blog posts 11 -> 20 with label "tech"
http://127.0.0.1:8080/posts/label/tech/?limit=5 - Returns the first 5 blog posts with label "tech"
http://127.0.0.1:8080/posts/label/tech/?start=15&limit=5 - Returns the blog posts 15 to 20 with label "tech"

API Versioning

If software never changes, we software engineers will be out of jobs. It is good that software evolves. However, we need some contracts/APIs, so that the changes are smooth and does not bring down the entire ecosystem when a change happens. Once we have an exposed an API outside our developer team, it is wiser to never change its request/response parameters.

In our StockService example (that we discussed a few paragraphs ago), we could have the following API:

http://stockservice/items/ - Returns all the items.

Later someone figured out that, it is not wise to return all the items always and decides to change the behavior to return only the first 10 items. This change will break all the existing clients, who will all assume that there are only 10 items in total while in reality we may have a billion more items waiting to be paginated.

The easiest way to regulate the API changes is by adding version to APIs. For example, if the original API to return all the items had a version param, we could just increment it like:

http://stockservice/V1/items/ - Returns all the items
http://stockservice/V2/items - Returns the top 10 items

The version need not be part of the URL always. We could take the Version as an extra HTTP header also, instead of creating a new URL endpoint. It is a matter of taste and each approach has its own pros and cons.

CircuitBreaking

Once we have multiple components in a system, there is a high chance that some part of the system may be down for updates. When such a thing happens, a service could choose to wait for some time before making any attempts to retry if it knows that the service will be failing. Martin Fowler has written in detail about this, which is a good read.

Service Discovery

In a large scale system architected with a microservices based design, you will have a plenty of services. Now each of these service, may want to know the location (URL, ip-address+port, etc.) for the services on which it depends. So we need some kind of a centralized service registry where all these information is stored and maintained.

The easiest and probably most used way to identify these services is through DNS. However, there are plenty of other tools available for this purpose too. ZooKeeper from Apache, etcd from CoreOS are all strongly consistent, distributed datastores which could be used for service discovery. Consul from HashiCorp, Eureka from Netflix are dedicated service discovery software. All of the above are FOSS projects as well. If your application has only less than a dozen services, probably it makes sense to just read from a shared file, across these services, instead of deploying a complex suite of software too. But keep in mind that it won't scale as you grow and so it is better to start with good practices as a habit from the beginning.

SDKs

A new TCP connection takes time to establish because of the initial handshake delay. It will be foolish to not reuse these connections. There is an inherent need for retrying things in HTTP if things fail, before giving up. Some programmers do not like writing HTTP client code always either. It is often recommended to release SDKs for the APIs that we release, to facilitate programmers to consume our APIs easily. For example, a python programmer can merely import our SDK's classes to add an item to our StockService, instead of having to write http retry code.

In the past we have had technologies like DCOM, CORBA, RMI etc. that aimed at doing distributed computing within walled gardens of technology. They lost out in market share due to the simplicity of REST services where HTTP verbs (GET, PUT, POST, DELETE) could perform remote operations, without the need for complex and mostly platform-specific stubs/skeletons etc.

There is a common middle ground where the best of both worlds could be used. The most notable framework for this is gRPC. It is an open source project, started by Google, adopted by many companies (most recently coreos) that helps in providing a web API where the client SDK generation is also made easy. It support http2 as well. If I were starting a new project today, I would give this a serious thought.

Further Information

A very good read on the need for SOA is Steve Yegge's Platforms rant.
Read the techblog of companies who are moving to SOA (not just those who have moved already)
Talk to engineers from Netflix, Amazon Web Services, if you know someone. Sadly both of those companies do not exist in India (as of 2016), even though the both the services are available.
Follow Netflix techblog http://techblog.netflix.com/
Watch AWS reInvent videos and if you have a chance attend that event (instead of events like Google I/O which are more business driven)

Other Notes:

If you like this post, share it with your friends.
Please send any comments / feedback regarding the language or content used. I am planning to use this for teaching material in a college, for a 1 hour talk, shortly. Should some other topics be covered ?
All opinions expressed are purely personal.

2015 Learning Retrospective

2016-01-01T01:36:00.001+05:30

The year began well. Started working on keeri with an aim to implement a distributed database, thereby learning the distributed systems concepts and leveraging my storage / filesystem experience.
Took the coursera's cloud computing concepts course to understand the fundamentals that will help in implementing keeri
As part of the database implementation, needed to implement a SQL parser which will convert given SQL statements into a decision tree. Took a coursera course on compilers. Implemented a decent recursive-descent (note the wordplay) parser that will process SQL queries with parentheses, Logical operators and Relational operators.
Having already been tired with the non-core aspects of the "distributed" database, abandoned the project temporarily.
Need to learn more about NewSQL technologies. Especially in the areas around how it helps for better tooling (for IDEs and the like) and also for parallelism.
Studied a bit of database literature around ARIES, Voltdb etc.
Attempted to read part-time parliament but lost interest midway because of reading raft, which is for similar purpose but a lot simpler to read, follow. Did a paper reading session together with Sureshkumar Thangavel for this.
Played around with Continous Integration systems (travis, jenkins, etc.) out of interest, which later helped in projects in two different dayjobs.
Wasted a lot of time, pretending to be preparing for interviews but did not do anything more than chatting with job change aspirants. But no complaints as time enjoyed is not time wasted.
Learnt a little bit in more detail about queueing systems (Amazon SQS, rabbitmq to be specific)
Wrote some test / tutorial programs for the Amazon Go SDK
Did a few prototypes using Go for the API backend, Angular and React as the web frontends for few project ideas. Bothered about the fatigue induced by the constant reinvention in the frontend JS technologies. The future looks potentially even more heavily fragmented with no sanity in the horizon.
Learnt to create docker images. Did some non-trivial dockerization for a legacy product with then employer. Wanted to checkout kubernetes, rocket and potentially provide patches. But lost interest.
Wrote a bunch of long blog posts which triggered some nice private discussions. 1 2 3
Worked a little bit on ithavi - the book on operating systems in tamil, but shamefully minuscule progress. Should do more next year at least.
The year began well but lost steam midway, probably due to the decision to change the dayjob after 10+ years with SUSE/Novell. It led to distraction, lack of interest and some sentimental times leading to lesser productivity towards hobby projects. Hopefully the next year will be better, but with a job in a startup that works fast, I am not sure how much bandwidth I may have.
Still not convinced if I should work on any of these system software anymore or if I should focus on some other paradigm that is at its infancy. Ken Thompson and Dennis Ritchie worked on Unix when OSes were not mature. Leslie Lamport worked on distributed systems papers which became valuable after more than two decades. Go is now using a paper on garbage collection by Dijkstra and Lamport written in the 70s. So, I am thinking if I should focus on some problems / technologies whose time has not come yet, to feel that excitement of walking on unchartered territories. There are a few options like quantum cryptography etc. which have good theorists who need programmers. If I could collaborate with such intelligent people and synergistically add some value, it will be satisfying. I briefly discussed with some researchers in India (IIT Madras, TIFR etc.) about doing a PhD or helping as an assistant. But not any progress and nothing sounds too promising if work has to be done from India, thanks to our country's brain drain and the Government of India's focus on doing research on loony vedic technologies instead of on useful things. That is enough rant for the year :)

This is a series of blog posts, that I write every year to document and reflect on the learnings, that I have had outside the dayjob. Previous editions: 2014, 2013

FOSS System Software in 2015

2015-11-29T15:26:00.001+05:30

Prelude

About an year ago, I was playing around with Cassandra for a quick prototype in then dayjob. It opened up the world of distributed systems to me and I was piqued. Audaciously, I decided to implement a simple distributed database, Keeri, to have a grasp of the fundamentals of the implementation of distributed databases. In the past, I have implemented a simple filesystem which has helped me immensely when I was working as a filesystem engineer. Also, I have always been fascinated by the theory behind the database internals right from college days, but did not get my hands dirty.

After a few weeks of work, I was able to implement a recursive-descent SQL parser, which analysed the incoming SELECT queries, made a tree with the subqueries properly branched as sub-trees. I made a simple columnar store that appends data (via an API as opposed to SQL) but without any atomicity guarantees. In short, it was a rudimentary, in-memory system that functions decently. However, I was nowhere near the initial goal of implementing a distributed database.

I realised that there were plenty of design choices in a distributed database implementation, right from architecture, replication, membership, consensus, CAP, etc. I even took a coursera course that helped me understand the basics in detail. As of today, I have enough confidence in my skills and knowledge to implement a distributed database which could serve as a good teaching material, if not as a production software. However, I have not made a single line of code in the past seven months to the project. Abandoning (at least temporarily) the project, hurts.

Yesterday, my daughter decided to wake me up from my sleep after midnight, I spent the remaining night wide awake, while she slept, thinking why I have not made progress in keeri. I realised that I have been overwhelmed by the amount of things to do, that are not core to the system. For example, after I decided that the database has to be NEWSQL based, it is imperative that I needed a SQL parser. But there are a dozen types of parsing (LL, LR, Recursive Descent, ANTLR etc.) techniques. Understanding the pros and cons of the each type and finding the most suitable candidate is a non-trivial task. SQL Parsing is just one component of the system. There are other components such as the choice of datastructures (based on read/write ratio, type of load etc.) One approach is to proceed with the simplest choices for each component with well-defined borders. The individual components can be later replaced. By the time, I completed the SQL query to a decision tree code, I felt exhausted, even before I began the core database and distributed systems functionality.

Observations

From my past open source experience, I have known the synergic boost that developers experience in FOSS communities. It is always good to work in a like minded team of developers rather than individually when working on big problems. I started thinking what other FOSS projects exist for distributed databases (or any other large scale system software) that were created in the last 5-6 years. A few things that came to my mind were:

Hadoop: An umbrella of projects, initially started by Yahoo, including core projects such as HDFS, HBase and a laundry list of supporting projects. Most of the code is now under the Apache project with a plenty of companies sponsoring the development and using the projects.

Docker: The coolest kid in the town. Initially started by Solomon Hykes funded by dotCloud as a side project. Arguably the most active project to date used heavily by almost all tech companies worth their salt. This spawned off a series of other projects too.

CoreOS: Linux re-thought for being a Cloud focussed distro. Backed by a company with the same name. Founded by ex SUSE, Rackspace people, collaborating with Greg KH himself.

Redis: Started by antirez, funded by vmware, pivotal and most recently redis labs. Probably the most used k-v database, probably challenged only by the older memcached.

Cassandra: Initially started by Facebook and later became an Apache project. Heavily used by companies like Netflix, Twitter, Applet etc. even after facebook has moved away.

Kafka: Initially started by Linkedin and later became an Apache project. Used by linkedin and almost every company today. Most of the original team that created the project have now jumped off to a new company named Confluent working on the project full time.

CockroachDB: A project that claims to be the open source equivalent for Google's Spanner. Started by ex-googlers. Development funded and managed by Cockroachlabs.

As I kept thinking about these (and a few other) projects in a state of semi-sleep, I had a eureka moment when I realised that all these projects, even though are open source, began funded by a company / investor money. This is in complete contrast to the FOSS projects of the previous generation like GNU, Linux, GNOME etc. It is a welcome change that developers are now not scared of becoming C*Os and spend time in management or bootstrapping a company. Perhaps, only today, we have companies like Zenefits (disclaimer: employer ;) ) making it easy to start a company and so more developers find it easier to start companies. VCs being in a bullish mindset also helps.

However, I have one concern. Unless the project usage explodes and gains contributors from multiple companies, there is a high chance that the projects may compromise on quality to accommodate a business need. For example, if Torvalds was working for Google, wakelocks *might* have merged into kernel much earlier to suit a Google release cycle. Torvalds being a neutral outsider, without having any commercial interests in any company (directly) has helped Linux immensely. If a FOSS project is started with a backing company in place, from day one, how high will the company's benefits influence the design / features / review processes of the FOSS project ?

As I think more, I realised that, most of these new system software are developed to address the pains of a "as a service" providers. So unless there is a business case, it is perhaps difficult to create a new modern system software, as the era of one size fits all is over. This also makes the previous concern about, the chief maintainer being company neutral, irrelevant. Only when a software is made, backed by a company, with real use cases and customers, instead of theoretical / intellectual curiosities we will get live data. Personal pet projects may have to be satisfied with machine generated data, which may not be the best testsuite for data intensive software like databases.

Questions

To sum up, I wonder if developers (students) any more interested / will be able to develop FOSS projects in their own hobby time, that could grow as big as Linux, without having a corporate backup, for the first few years at least, What do you think ?

Also, if you think it is possible, any recommendations for developers with family and personal needs to spend time, off the regular day job, to persist with pet FOSS projects without exhaustion ? Are there any statistics available on contributor details for popular FOSS projects (similar to Kernel stats prepared by Greg KH and LWN) ?

Any other aspects that I have missed ?

P.S: I was sharing the gist of this post with a friend who shrugged off saying, "Get a job in the company which works on a project which appeals to you". However, it is not that simple, considering most of these young companies do not even have an office anywhere outside the developed world. Also after a certain point in life, switching jobs is not trivial and depends on various other factors.

AWStruck

2015-11-22T14:44:00.002+05:30

tldr:

A long post about my experience with implementing a quiz software in my college, a decade ago and wondering how easy things have become now due to AWS.

Prelude

In 2002 (iirc) (thirteen years ago, as of composing this post) when I was in college, we had an inter-collegiate technical symposium, where Online Quiz was one of the events. A Microsoft Visual Basic 6.0 (which I personally consider to be one of the best software ever developed) application was developed in-house and installed on about 50 computers, where various contestants from different colleges could come and take the test. However, as Murphy predicted, due to various virus issues, the software failed spectacularly. Some answers/responses got corrupt, accumulation of responses from different machines proved faulty, the scoring went awry in some corner cases, etc. Overall, the application turned out to be total chaos. However, since India is populous, we were able to throw more people at the problem and finish the event, with a lot of manual effort, inspite of a few unhappy participants.

In the planning phase for the subsequent edition of the symposium two years later, a software development committee for formed. It would do all the software for the entire event, (like creating a website, developing flash/swish videos, software for the individual events, etc.). The quiz event had two rounds, a preliminary round where all the appearing colleges contested and a final round where six (or probably more) top colleges from the previous round were selected. An eloquent person was made incharge of the quiz event. I proposed to the person that we do the software for the preliminary rounds ourselves, instead of depending on the committee. The committee was already swamped with work and they were happy to get rid of a piece that has more chances of failure. Some adventurous people (like Antony) expressed their interest in joining the project. Thus it all began.

The Adventure

Much to the amusement of my roommate Bala, I started with planning the architecture and design on paper (complete with UML diagrams, etc.), instead of starting with coding as is the norm for us those days. Much later I came across an interesting quote by Alan Kay, "At scale, architecture dominates material". Having learnt from the mistakes of the previous years, I made some decisions.

* The software should follow the web (client-server) model, that is getting popular. At least this is an excuse to learn some new (then) technologies, like JSP, Javascript, Tomcat etc.
* The server machine becomes a single point of failure for the entire system. It could prove to be a performance bottleneck to, as our machines were all having a humongous 32 MB of RAM. There was one 64 MB ram in our lab which I planned to use as the server. In our hostel, some had a machine with luxurious 128 MB of RAM, which I was planning to borrow if the need comes.
* The single point of failure, the server should not be susceptible to virus attacks. So we should experiment installing Solaris or this thing called Linux (There was no Ubuntu then).
* Internet was a luxury and for the entire college we had access to it in about three computers, only in the evenings. So anything that requires too much internet access for development is automatically rejected.
* The software should scale at any cost, for at least 200 parallel connections
* We should regularly backup the sources in different machines, in case the development boxes gets a virus attack. We had no idea of version control systems then.
* We will be using Mysql/oracle or some real database instead of writing to files. MS Access was ruled out automatically as Visual Basic was eliminated already. In hindsight, sqlite would have been an excellent choice.
* The quiz webpage when saved on the client browser should save the file along with the answers chosen/typed.
* Each quiz session will last for about 30 minutes. There will be username/passwords generated for each unique participant.

We developed the JSP webapp running in Tomcat in a few weeks. We used the generous help of my classmates to throughly test the correctness of our scoring system. As with any manual system, it was prone to errors. A tester made a mistake in scoring and we broke our head trying to find a non-existent bug in our code for a few hours. This testing also helped us get the load numbers for the current system, with about 30 concurrent users. We had some performance monitoring hooks written in our code for this.

We survived multiple virus attacks during the development, because of the distributed source backup techniques that we have employed. At one stage, we even burnt our sources in a CD when the administrators decided to Norton Ghost all the hard disks in our lab, with a fresh Windows XP image, to minimise the virus effects.

I learnt the magical world of performance monitoring, database indexes, high availability, connection pools etc. during this project. I learnt much more in this single experiment than the almost half a dozen papers we had on software engineering, process management, quality assurance etc. taught by lecturers with no real world knowledge and questionable scoring practices. Some of the fascination that I acquired with database engines has still not subsided.

Having finished the coding one week prior to the event, we focussed more on scaling and testing. I prepared a backup server, another high-RAM machine in case our main server went kaput. Much to the jovial criticisms of my friend Sangeeth, we tested our system, the night before the event, for 2000 parallel users and it worked well without breaking a sweat. This is such a silly number in today's figures, but we were easily satisfied then with low numbers in both server performance (and salary). Almost all the front end code was handwritten Javascript with no frameworks / libraries (as mostly none existed or we were not aware). I was satisfied with what we have done, irrespective of however the results may turn out to be the next day.

Having lost a good sleep due to the stress testing the previous night, I woke up late and missed the delicious Pongal in our hostel for breakfast. Ruing about the missed breakfast caused a weird stare from the rest of the team. I rushed for the preliminary quiz event ahead of time and two among us did a final test on the last day. We planned on using some Rational test suites for automated testing but could never get to that, thanks to all the virus related frequent re-installs of the base operating system.

The participants came in numbers, attended the event. Surprisingly for us, a lot of people did not use the half-an-hour duration and finished much earlier, even with negative questions. The event chief too had a moment of doubt, if we have prepared the questions easy. But looking at the instant results in the server and the high percentage of low marks taught the lesson that many people have come to the event to have fun and not to seriously compete or win.

Before we could ruminate on that philosophical thought, a participant had a problem. Her network cable went broke and she could not submit her quiz. I felt bad that I should have implemented auto-save for responses as soon as people make a choice. I intentionally avoided that to reduce load on the server. I was about to ask if the person could take the test from a different machine. But the inimitable event-head the presence of mind, to ask the participant to save the quiz on the same computer and that we will evaluate that offline. Antony did the scoring and that particular person turned out to be a topper. This particular event taught me a lot about presence of mind and how we should always plan for failure in computer systems, how ever thorough we test. The scalability as expected was never found to be a problem.

After the event is finished, our lab admin, Marshal joked that we should start a company with this quiz software, as we have done it as a generic survey software where questions can be added. We laughed at the suggestion and moved away. The event was successful. Some of the software developed by the committee for some other events were affected by the recurring virus problem. But I went and slept like a log on a temporary bed made of three office chairs, next to my classmate Saktheesh who was working on a closing video for the event.

The Present

The long story above is not to just narrate my/our work, but also to highlight how much approachable the programming / technology landscape has become. A quiz / questionnaire software can be implemented today (2015) in probably a few hours, thanks to the large number of frameworks (such as Rails, Django, etc.). In fact, most of the tutorials have better code that you can merely copy/paste than what we have implemented a decade ago. The most striking thing today, however, is not the story of coding, but the story of deployment.

Anyone with an internet connection, a basic course on programming and decent googling skills can program any service easily today. What is even more fascinating is that such a software can be very easily deployed on the internet, served to the whole of the world, complete with a domain name, auto-scaling, DoS prevention etc. in just a few clicks. This is all made possible through Amazon Web Services. There are other players like Google, Heroku etc. but it is AWS that is way ahead of any other players and provide more services. The reach of AWS is what made me choose the title of this blog post.

AWS has done more to spur the startup ecosystem. The social impact of AWS is much higher than what Google did for online Ads, Microsoft did for PCs. Disruptive companies like airbnb, slack, netflix (which was just an online video rental service 7 years ago) can exist today, only because their devops, installation and maintenance of machines could be outsourced to AWS. They could not have grown to such 800 hundred pound gorillas if the AWS infrastructure was not available, in such a short time. Sure, there are some companies like Uber, Whatsapp that do not use AWS, but they would not have got funded easily if not for the startup scenario, which was formed with AWS as the backbone.

The Future

I have been visiting various buildings in Bangalore trying to find an office space for Zenefits India, as I am the first engineer here. All the places have a Server Room, which is not used by any of the startups. Almost all the startups use a Mac for developer machine and have their deployment servers in AWS (or some public cloud). The office spaces of Bangalore have not caught on with the trend. We are hiring btw, so if you consider yourself an extremely good engineer and one of the best at what you do in your job, do apply.

Most of the new services offered by Amazon, such as Amazon Lambda, DynamoDB etc. and also things like Containerisation, have made development of scalable applications, easier. Developers need not worry about failover systems, HA system, clusters etc. any more. I wonder what kind of an impact this will have on the job market. I wonder how long it might take job positions like mysql admins, sysadmins, devops engineers, DBAs to become as old/obsolete as, say mainframe programmers are considered today (2015). Perhaps it may not be soon but it is very much possible soon. Ubiquitous applications like SAP, Office etc. are also now cloud first and it will only become more cloud focussed in the future.

I wonder how much of system software research will be affected in the long term. Many of the modern day young bright minds (students from prestigious colleges and universities) are working in webapps, joining startups and doing their own companies, instead of working on projects with high entry barriers like the Linux Kernel, LLVM etc. (at least in India). Perhaps, we would have started the quiz project that we did as a company (somewhat like surveymonkey) if we had enough exposure then. I may not have done that but some students smart with a business acumen, would have.

There are very interesting research problems in distributed systems that include both Databases and OSes. Most of the present day systems are just distributed systems constructed over Linux / POSIX systems. However, there is a potential for a DOSIX (along the lines of POSIX) API purely designed for large-scale, cross-geo distributed systems. It will be interesting to see what kind of research happens in this direction. In the recent past, We have a new distributed consensus algorithm Raft after decades of using Paxos. More such re-inventions are bound to happen soon, may be on novel things like non-blocking, distributed garbage collection etc.

Online Programming Competitions are Overrated

2015-03-02T13:37:00.000+05:30

The title is not merely a clickbait, but my current opinion, after attending a programming competition for the first time. This post expresses my opinions on the hiring processes of [some of] the new age companies through programming competitions and algorithms-focused interviews.

I believe that the assessment for a senior/architect level programmer, should be done by finding how co-operative [s]he is with others to create interesting products and their history than by assessing how competitive [s]he is in a contest.

Algorithms

On my lone programming competition experience (on hackerrank), the focus of the challenges were on Algorithms (discrete math, combinatorics etc.).

Usage of standard, simple algorithms, instead of fancy, non-standard algorithms is a better idea in real life, where the products have to last for a long time, oblivious to changing programmers. Fancy algorithms are usually untested, harder to understand for a maintenance programmer.

Often, it is efficient to use the APIs provided by the standard library or ubiquitously popular libraries (say jquery). Unless you are working on specific areas (say compilers, memory management etc.) an in-depth of knowledge of a wide-range of algorithms may not be very beneficial (imo) in day-to-day work, elaborated in the next section.

Runtime Costs

There are various factors that decide the runtime performance, such as: Disk accesses, Caches, Scalable designs, Pluggable architectures, Points of Failures, etc.

Algorithms optimize mostly one aspect, CPU cycles. There are other aspects (say choice of Data structures, databases, frameworks, memory maps, indexes, How much to cache etc.) which have a bigger impact on the overall performance. CPU cycles are comparatively cheap and we can afford to waste them, instead of doing bad I/O or a non-scalable design.

Most of the times, if you choose proper datastructures and get your API design correct, we can plug the most efficient algorithm, without affecting the other parts of the system, iff your algorithm proves to be really a bottleneck. A good example is the Evolution of filesystems, schedulers in the Linux Kernel. Remember that Intelligent Design school of software development is a myth.

In my decade of experience, I have seen more performance problems due to poor choice of datastructures or unnecessary I/O, than due to poor selection of algorithms. Remember, Ken Thompson said: When in doubt, Use Brute Force. It is not important to get the right algorithm on the first try. Getting the skeleton right is more important. The individual algorithms can be changed, after profiling.

At the same time, this should not be misconstrued as an argument to use bubblesort.

The 10,000 hour rule

Doing well in online programming competitions is mostly the 10,000 hour rule in action. You spend time in enough competitions and solve enough problems, you will quickly know which algorithm or programming technique (say dynamic programming, greedy) to employ if you see a problem.

Being an expert at online programming competitions does not guarantee that [s]he could be trusted with building or maintaining a large scale system, that has to run long and the code live for years (say on the scale of filesystems, databases, etc.). In a competition, you solve a small problem at a microscopic level. In a large scale system, the effects of your code are systemic. Remember how the fdisk, sqlite, firefox fiasco ?!

In addition to programming skills, there are other skills needed such as build systems, dependency management (unless you are working on the kernel), SCM, Versioning, Library design aspects, automated testing, continuous integration etc. These skills cannot be assessed in online programming competitions.

Hardware

In my competition, I was asked to solve problems in a machine that is constrained to run in a single thread. I do not know if it is a limitation on hackerrank, or if all online competitions enforce this.

If it is the practice in all online programming competitions, then it is a very bad idea. Although I could understand the infrastructure constraints for these sites, with the availability of the multi-core machines these days, your program is guaranteed to run on multiple cores. You miss out on a slew of evaluation options if the candidate is forced to think of single threaded design.

With the arrival of cloud VMs and the Google appengine elasticity, it is acceptable to throw more CPUs or machines at a program on-demand, without incurring high cost. It is okay to make use of a simpler, cleaner algorithm that is more readable and maintenance friendly (than a complex, performant algorithm), if it will scale better on increased number of CPUs or machines. The whole map-reduce model is built around a similar logic.

I don't claim that concurrency/parallelism/cloud is a panacea for all performance problems, but it is too big a thing to ignore while assessing a programmer. A somewhat detailed explanation is at the Redis creator's blog (strongly recommended to subscribe).

AHA Algorithms

I first heard of the concept of AHA Algorithms in the excellent book Programming Pearls by Jon Bentley. These are the algorithms which make very complicated, seemingly impossible problems look trivial, once you know the algorithm. It is impossible for a person to solve such problems within the span of the competition/interview if the candidate is not aware of the algorithm earlier and/or does not get that AHA moment. Levenshtein Distance, Bitmap algorithms etc. fall in this category. It may not be wise to evaluate a person based on such problems.

Conclusion:

Candidizing (is that a word ?) long-term FOSS contributors for hiring may be an interesting alternative to hiring via online programming competitions or technical interviews. Both the interviews and contests have extrapolation errors when the person starts working on a job, especially on large scale systems.

I see that a lot of the new age companies are asking for github profile in their resumes, which is good. But I would prefer a more thorough analysis on long standing projects and not merely personal pet projects that may not be very large-scale or popular. Not every person works for a FOSS project in free time, is also a deterrent to holding such an approach.

Online programming competition websites could limit the number of participants in advance and give the participants an infrastructure that matches realtime development, instead of a input-output comparison engines, with a 4 second timeout.

Having said all these, these online programming contests are a nice way to improve one's skills and to think faster. I will be participating in a few more to make myself fitter. There may be other programming challenges which are better and test all aspects of an engineer. I should write about my view after an year or so.

One more thing: Zenefits & Riptide I/O

In other news, My classmates' companies Zenefits and Riptide I/O are hiring. Have a look if you are interested in a job change (or even otherwise). They are in an excellent (imo) stage where they are still a startup in engineering culture, but have an enormous funding to work on the next level of products. Should be an exciting opportunity for any curios engineer. Zenefits works on web technologies and delivers their SaaS. I would have joined Zenefits if they had an office in Bangalore. Riptide IO works on IoT and has some high profile customers.

Naming Policy - Deactivated Quora Account

2015-01-14T12:11:00.001+05:30

I just deactivated my Quora account, as they have a policy of mandatory lastname. They stated that they will not accept Initials as well. My account was put on hold due to the lastname not satisfying their requirements.

Tamils do not have a last name due to various political reasons. Avoiding the lastname is considered to be good for ending caste discrimination as well. (Ironically a quora link)

Companies like Google had the sense to relax their real name policy after their initial debacle with G+

Sadly, Quora does not want to learn from their's or others' mistakes. I wonder if they will even ban names in non-English letters later. So if they cannot be inclusive, I feel that they deserve to lose business.

2014 Learning Retrospective

2015-01-03T22:30:00.001+05:30

The learning in the previous year (2013) was a bit shallow but on a wide variety of topics. The year 2014 turned out to be not bad. I went to some great depths in a small number of areas.

Worked on Korkai - A corpus builder for Tamil. It extracts unique Tamil words from blogger, wordpress and wikipedia dumps. Learnt a lot about XML processing and golang
Started working on Vaiyakani, an auto-completing, dictionary-based, self-learning, transliterating text-editor for Tamil, after getting unsatisfied with the lack of offline Tamil typing software in Linux. Learnt a great deal about Tries, Prefix Datastructures, Sqlite database engine performance, Datastructures used in the implementation of maps in various libraries and programming languages etc.
The quest for implementing a perfect text editor led to a brief phase of disappointment where I complained on every layer wondering why the below layer is bad and briefly attempting to improve it. (The application is bad, The toolkit is bad, The compiler is bad, The operating system is bad, The hardware is the root of all evil etc.) Thankfully shepherded back into proper line of thinking by the helpful Evan Martin.
Started writing a book on Operating Systems in Tamil. The project is kind of stalled for a while now due to some copyright related issues with the dayjob employer. Hopefully should resume working on it by this month end.
Started tinkering around big data applications and large scale distributed systems. Started feeling the joy of building largescale systems using Golang. Did some prototypes for a new product idea in office and all these prior experiences helped to be very productive.
Explored a lot of databases, especially Cassandra. Built it from source. Started using it in a system with millions of queries load. Explored the gocql (Golang's Cassandra driver) sources.
Played around with a lot of key value datastores (like leveldb, bolt, lmdb etc.) Started with a lot of hope on these and slowly started feeling that k-v stores are probably over-rated (for large scale systems). As if to prove my point, came to know that Spanner the successor to Bigtable is multi-columnar.
Learnt a bit about Docker, Containers, Kubernetes etc. Should explore them more deeply this year.
All these work on database engines, distributed systems, distributed databases etc. lead me to read the underlying research papers of such systems. Realized that for understanding these systems, there needs to be a richer knowledge of some discrete math and richer literature.
Learnt about Paxos, Raft and other distributed consensus ideas. Humbled to know of a few brilliant minds like Leslie Lamport, Jim Gray, etc. Got inspired to know of a few more interesting people and ideas as well.
Thanks to a DE in Novell, got the opportunity to play with various Amazon cloud services free of cost (there is a free tier for the curios).
Trying to understand how things work behind the screens in Amazon (the cloud company not just the online shop), Came across the interesting DynamoDB paper and came to know of a few interesting/inspiring people (they don't know me yet, though ;) ) like James Hamilton, Swaminathan Sivasubramanian , Werner Vogels etc.
Got too tempted to leave the dayjob and do some real research / Ph.D but considering the financial constraints, will probably stick to the dayjob.
Gave a couple of talks about Golang. One for a startup and another for a bunch of engineers. Attended a talk on Google cloud technologies and slashn.
Came across Pig, Hive and Qubole but did not explore deeply
Learnt a lot about markup languages, LaTeX etc.
Started some work on a paper on distributed systems, only to get overwhelmed on how to proceed, considering the vastness of the topic that I have chosen for writing. Hopefully should get it into shape in the next few months or just throw it away and proceed with the daily grind.
Wrote a blog post about technology catchup for the last decade and a bunch of other long posts, which got some unexpected accolades.
Started daydreaming if I should choose a nascent area (such as Quantum Computing) to work in the freetime. It will throw up less instances of some paper published on 1960s / a patch in 1990s, for an idea, that I so enthusiastically assumed that I have invented (in current operating systems/storage etc.) until I search for it.
Switched to openSUSE Factory and loving it.

Overall, a satisfactory year with relatively deeper learning. Felt that the learning could have been even more richer if I had teamed up in more like-minded, technology-driven, small teams. Should experiment that for a while next year.

Decade of Experience and some [un]wise words

2014-11-05T23:33:00.001+05:30

This month, I complete working 10 years (6 months as an intern and 9.5 years as an employee) for Novell / SUSE / Attachmate / NetIQ India. During this time: I had some very good managers and some very bad managers; Worked with teams from multiple geographies (US, Germany, UK, Czech, Australia and of course India); Worked across multiple age groups (people who finished their PhDs before I was born to people who were born after the Matrix movie was released). The diversity was mainly due to the opensource nature of the work.

Here are some things that I have learned in the last 10 years. Some of them may apply to you. Some of them may not. Choose at your own will. If you are interested in becoming a [product|project] manager, the following may not be helpful, but if you intend to stay a developer, it may be useful.

In big companies, It is easier to do things and ask for excuse than to wait for permission, for trying radical changes or new things. There will always be people in the hierarchy to stop you from trying anything drastic, due to risk avoidance. Think how much performance benefits read-ahead gives; professional life is not much different. Do things without bothering about if your work will be released/approved.
Prototype a lot. What you cannot play around with in production/selling codebases, you can, in your own prototypes. Only when you play around enough, you will understand the nuances of the design.
Modern technologies get obsolete faster. People who knew just C program can still survive. But people who knew Angular 1.0 may not survive even Angular 2.0 Keep updating yourself if you want to be in the technology line for long
Do not become a blind fan of any technology / programming language / framework. There is no Panacea in software.
When one of my colleagues once asked a senior person for an advice, he suggested: "God gave you two ears and one mouth, use them proportionately". I second it, with an addendum, "Use your two eyes to read" also.
Grow the ability to zoom in and out to high/low levels as the situation demands. For example, you should know to choose between columnar or row based storage AND know about CPU branch prediction AND have the ability to switch to both of these depths at will, as the situation demands. Having a non-theoretical, working knowledge of all layers will make you a better programmer. There is even a fancy title for this quality, full-stack programmer.
Best way to know if you have learnt something well, is to teach it. Work with a *small* group of people with whom you can give techtalks and discuss your research interests. Remember the african proverb, If you want to go fast, go alone. If you want to go far, go together. Having a study groups helps a lot. But remember, talk is cheap and don't get sucked into becoming a theoretician. If you are interested in becoming one, get a job as a lecturer and work on hard problems. Look to Andy Tanenbaum or Eric Brewer for inspiration.
Keep a diary or blog of what you have learned. You can assess yourself on an yearly basis and improve yourself. Create and use your github account heavily.
Try different things and fail often. Failure is better than not-trying and being idle. When Rob Pike says, "I am not used to success", he is not merely being humble. It takes years of dedication, work, luck and a lot of failures to become successful and have a large / industry level impact.
Do not be driven too much by money or promotion or job titles. The world does not remember Alan Turing or Edsger Dijkstra or Dennis Ritchie by their bank balance or positions. There are probably a thousand software architects in your locality if you dig linkedin. Try to do good work. Also learn to think like an author.
Good work invariably will get appreciated, even if the appreciation may be delayed in many cases. Sloppy work will be noticed in the long term, even if it is missed in short term. The higher you grow, sloppy work may get more visibility.
There will be people smarter and more talented than you, always. Try to learn from them. Sometimes they may be younger than you. Don't let age stop you.
Work in an open source project with a very active community. Communication skills are very important even for an engineer. The best way to improve it for an engineer is to work on an open source project. Ideally, see through the full release of a linux distro. It will take you through all activities like packaging, programming, release management, marketing etc. the tasks which you may not be able to participate in your day job. I recommend openSUSE if you are looking for a suggestion ;)
Except for Mathematics and Music, there is no other field with prodigies. Understand the myth of genius programmer.
There are a lot of bad managers (at least in India). Most of these managers are bad at management, because they were lousy engineers in the first place and so decided to do an MBA and become a people manager, after which they don't have to code (at least in India). If you get one of them, do not try to fight them. Work with them on a monthly basis with a record of objectives and progresses. The sooner you get a bad manager, the better and faster you will appreciate good managers.
Last but not least, identify a very large, audacious problem and throw yourself at it fully. All the knowledge that you have accumulated over the years with constant prototyping and reading will come in handy while solving it. In addition, you will learn a thousand other things, which you could not have learned by lazily building knowledge by reading alone. But the goal that you need to work on has to be audacious (like the goals that gave us Google filesystem or AWS etc.) and solve a very big problem. However, you should start this only after a few years of building a lot of small things and have a full quiver. To become a good system-side programmer, you should have been a good userspace programmer. To become a good distributed systems developer, you should have used a distributed system, etc.

May be one another point that I could add is: Try to write short and crisp (unlike this blog post).

code churning and golang

2014-11-02T13:31:00.000+05:30

Recently, we were doing a new prototype in dayjob. I had the freedom to choose the technology stack for this idea. I wrote a lot of golang code to compare a few aspects across few technologies (say streaming writes perf stats for Cassandra vs MariaDB etc.) to evaluate some of these technologies for our needs.

The whole activity spanned for about 12 weeks roughly and we were able to build a very good evolutionary prototype. I was looking at the gitlab stats at the end of 12 weeks and found my personal log to be:

9752 lines added
7119 lines deleted

Even if we assume a 6 day workweek, it translates to about 135 lines of new go code added per day by a developer on an average. There have been very productive days, where I was able to add more than 400 lines of non-copy-paste code in a single day, that I ended up having to take rest the next day, to recover.

In the past, I have written a lot of C code. I have never felt this productive in C, largely due to the manual memory management (and the ensuing problems like double free, leaks, valgrinding etc.) and difficult concurrency (pthreads, locks, etc.)

It is kind of obvious that Go will naturally feel more productive, due to automatic memory management and concurrency friendly features (goroutines, channels, etc.), resulting in very less non-business-code.

However, I observed that there are two other non-intuitive reasons why Go lang was very productive (for me). These reasons do not appear big on their own. But in the overall development time, they were a big influence on my productivity. They are:

1) Static Binary Creation without complex, external build tools

Thanks to my openSUSE packaging experience, I have always taken up the responsibility to keep the sources of the project where I work as an engineer, in a properly and a packager-friendly build system. I like build friendly sources to an extent that, about an year ago, One of the first tasks that I did, when I moved to a team, was to port an old packaging system of hand-written makefiles and obsolete build systems with sources across tens of thousands of files and managed for about two decades, to CMake. IOW, I know about Linux packaging and its pains.

With go, I was able to easily build the sources and get all the dependencies via a single `go get` command. Installing the binary on a test cluster, or in the AWS was merely a single command away. There was no need to wait for any complicated build setup, setting up dependencies or even waiting hours for a build to finish. There is no need to write complicated Makefiles, CMakeFiles, Configure files, build scripts etc.

Usage of the `go get` tool mandates developers to follow a certain discipline regarding installation/inclusion of libraries, binaries. Static binary generation helps avoid a tonne of deployment hassles. All these minor things, when we do a dozen or more builds in a day, add up to a very big productivity boost. It is not even that uncommon to do a dozen builds in a day, to aid testers, in the prototyping stage. Because of the elegance and simplicity of `go get`, the testers did not even have to wait on dedicated packagers or on developers to get the testbuilds. Even if you don't have dedicated testers, static binary generation, simplifies your test setup time.

2) Composition instead of Inheritance

This point is very difficult to explain, as it is more abstract, but is more influential than the previous. In the beginning, I was struggling to get Composition right. I ended up trying to organize my files based on an inheritance model (much like the [fs/< files>.c , fs/< ext>/, fs/< btrfs>/< files.c>] in the linux kernel), trying to get a baseclass delegating things to a derived class manually based on a derived class identifier in the object etc. I struggled and did not feel productive in coding.

I had to pause, unlearn a few things and think in a fresh perspective again to understand it. Composition is like Cycling. Once you get the hang of it, there is no falling down. I felt that the Composition based model has helped a lot more than any other feature of golang to improve my productivity.

With composition, the amount of code changes needed when you refactor code (which is very common in most freshly written code) is very very less than in a code, designed for inheritance. It is very hard to explain how this helps in simple English words. But I recommend you write code for yourself and appreciate this. In addition to easy refactoring, Composition tends to reduce boilerplate code substantially and makes diamond problem obsolete.

The Embedding of the Transport object in go lang's http Client object helped me understand Composition a lot clearer than any tutorial or book.

Conclusion

Because, I was able to write a lot of code fast, I was not too scared to shed code and start from scratch when needed. This explains the about 7k deletions of code.

goimports and vim-go also helped a lot to get some IDE like features, all of which should thank gofmt in return.

Have you felt any other reason that made you feel a high-level of code churning can be achieved in Go ?

Technology Catchup

2014-10-13T13:31:00.002+05:30

Coincidentally three different people asked me in the last month, to write about new technologies that they should be knowing, to make them more eligible to get a job in a startup. All these people have been C/C++ programmers, in big established companies, for about a decade now. Some of them have had only glimpses of any modern technologies.

I have tried a little bit (with moderate success) to work in all layers of programming with most of the popular modern technologies, by writing little-more-than-trivial programs (long before I heard of the fancy title "full stack developer"). So here I am writing a "technology catchup" post, hoping that it may be useful for some people, who want to know what has happened in the technologies in the last decade or so.

Disclaimer 1: The opinions expressed are totally biased as per my opinion. You should work with the individual technologies to know their true merits.

Disclaimer 2: Instead of learning everything, I personally recommend people to pick whatever they feel they are connected to. I, for example, could not feel connected to node-js even after toying with it for a while, but fell in love with Go. Tastes differ and nothing is inferior. So give everything a good try and pick your choice. Also remember what Donald Knuth said, "There is difference between knowing the name of something and knowing something". So learn deeply.

Disclaimer 3: From whatever I have observed, getting hired in a startup is more about being in the right circles of connection, than being a technology expert. A surprisingly large number of startups start with familiar technology than with the right technology, and then change their technology, once the company is established.

Disclaimer 4: This is actually not a complete list of things one should know. These are just things that I have come across and experimented a little bit at least. There are a lot more interesting things that I would have have missed. If you need something must have been in the list, please comment :-)

With those disclaimers away, let us cut to the chase.

Version Control Systems

The most prominent change in the open source arena, in the last decade or so, is the invention of Git. It is a version controlled system initially designed for keeping the kernel sources and has since then become the de-facto VCS for most modern companies and projects.

Github is a website that allows people to host their open source projects. Often startups recruit people based on their github profile. Even big companies like microsoft, google, facebook, twitter, dropbox etc. have their own github accounts. I personally have received more job queries through my github projects than via my linkedin profile in the last year.

bitbucket is another site that allows people to host code and give even private repos. A lot of the startups that I know of use this, along with the jira project management software. This is your equivalent of MS Project in some sense.

I have observed that most of the startups founded by people who come from Banking or Finance companies to be using Subversion. Git is the choice for people from tech companies though. Mercurial is another open source, distributed VCS which has lost a lot of limelight in the recent times, due to Git. Fossil is another VCS, from the author of sqlite, Dr. Richard Hipp. If you can learn only one VCS for now, start with Git.

Programming Languages & Frameworks

Javascript has evolved to be a leading programming language of the last decade. It is even referred to as the X86 of the web. From its humble beginnings as a client-side scripting language to validate if the user has typed a number or text, it has grown into a behemoth and entered even the server-side programming through the node-js framework. For incorporating ModelViewController pattern, javascript has gained the AngularJS framework. JS is a dynamically typed language and to bring in some statically typed langauges' goodness, we have a coffeescript language too.

Python is another dynamically typed, interpreted programming language. Personally, I felt that it is a lot more tasteful than Javascript. It feels good on eyes too. It helps in rapid application development and is available by default in almost all the Linux distros and Mac machines by default. Django is a web framework that is built on python to make it easy to develop web applications. In addition to being used in a lot of startups, it is used in even big companies like Google and Dropbox. There are variants of Python runtime such that you can run it in the JVM using Jython or in the .NET CLR using the IronPython. I have personally found this language to be lacking in performance though, which is elaborated more in a subsequent section.

Ruby is an old programming language that shot into fame in the recent years through the popular web application framework Ruby on Rails, often called just Rails. I have learnt a lot of engineering philosophies such as DRY, COO etc. while learning RoR.

All these above languages and frameworks use a package manager such as npm, Bower, pip, gems etc. to install libraries easily.

Go is my personal favorite in the new languages to learn. I see Go becoming as vital and prominent a programming language as C, C++ or Java in the next decade. It is developed in Google for creating large scale systems. It is a statically-typed, automatic-memory-managed language that generates native-machine-code and helps writing concurrent-code easily.

Go is the default language that I use for any programming task in the last year or so. It is amazingly fast even though (just because?) it is still in the 1.X series. In my dayjob we did a prototype in both go and python, and for a highly concurrent workflow in the same hardware, Go puffed Python in performance (20 seconds vs 5 minutes). I won't be surprised if a lot of the python and ruby code gets converted to golang in their next edition of rewrites. Personally, I have found the quality of go libraries to be much higher compared to Ruby or nodejs as well, probably because not everyone has adapted to this language yet. However, this could be just my personal biased opinion.

If you like to get fancy with functional programming, then you can learn Scala (on top of JVM), F# (on top of .NET), Haskell, Erlang, etc. The last two are very old btw but in use even today. Most recently, Whatsapp was known to use Erlang. D is also seen in the news, mostly thanks to Facebook. Dart is another language that is from Google but still to receive any wide deployment afaik, even with Google's massive marketing machinery behind it. It has been compared to VBscript and is criticized, and as of now chrome-only. Dart has received criticism from Mozilla, Webkit (rendering engine that powers Safari (and chrome earlier)), Microsoft IE as well. Dart is done by Lars Bak et al. (the people who gave us V8, chrome's Javascript engine)

Rust is another programming language that is aimed for high-performance concurrent systems. But I have not played around with it, as they don't maintain a stable API and they are not 1.0 yet. Julia is another programming language aimed at doing distributed systems, about which I have heard a lot of praise, but it still remains a exotic language afaik. R is another language which I have seen in a lot of corporate demos where the presenters wanted to show statistics, charts. Learning this may be useful even if you are not a programmer and works with numbers (like a project manager).

There is a Swift programming language from Apple to write iOS apps. I have not tried Swift yet, but from my experience of using Objective C, it cannot be worse.

Bootstrap is a nice web framework from twitter, which provides various GUI elements that you can incorporate into your application, to rapidly prototype beautiful applications, that are fluidic even when viewed in mobile.

jquery is a popular javascript library that is ubiquitous. Cascading Style Sheets (shortly CSS) is a markup language that helps configure the style of the web page UI elements. CSS is becoming mature to the extent of showing animations too. You should ideally spend a few weeks to learn about HTML5 and CSS.

Text Editors

Sublimetext is what the cool kids use these days as the editor. I have found the tutorial on tutsplus to be extra-ordinarily good at explaining sublime. It is a free (as in beer) software and not open source.

Atom is a text-editor from github built using nodejs and chromium. I did not find a linux binary and so did not bother to investigate it. But I have heard it to be good for Javascript programmers than any others, as the editor could be extended by javascript itself.

Brackets is another editor that I have heard good things about. Lime is an editor that is developed in Go, aimed to be an open-source replacement for the sublimetext.

Personally, after trying various text editors, I have always comeback to using vim. There are a few good plugins for vim in the recent times. Vundle, Pathogen are nice plugin managers for vim to ease up installation of plugins. YouCompleteMe is a nice plugin for auto-completion. vim-spf13 is a nice distro of vim, where various plugins and colorschemes are pre-packaged.

Distributed Computing

In the modern day of computing, most programs have been driven by a Service Oriented Architecture (shortly SOA). Webservices are the preferred way of communication among servers as well. While we are talking about services, please read this nice piece by Steve Yegge.

memcached is a distributed (across multiple machines), caching system which can be used in front of your database. This was initially developed by Brad Fritzpatrick, while he was the head of the LiveJournal and who is now (2014) a member of the Go team at Google. While at Google, he has started GroupCache which as the project page says is a replacement for memcache in many cases.

GoogleFileSystem (GFS) is a seminal paper on how Google created a filesystem to suit their large needs of data processing. There is a database built on top of this filesystem named BigTable which powered Google's infrastructure. Apache Hadoop is an open source implementation of these concepts, which was originally started in Yahoo and now a top-level apache project. HDFS is the equivalent of GFS for the Hadoop. Hive and Pig are technologies to query and analyze data from the Hadoop.

As with the evolution of any software, GFS has evolved into a Colossus filesystem and BigTable has evolved into a Spanner distributed database. I recommend you to read these papers even if you are not going to do any distributed computing development.

Cassandra is another distributed database which was started in Facebook initially, but is used in many companies such as Netflix and Twitter. I have used Cassandra more than any other distributed project and actually like it a lot. It uses a SQL like query language called CQL - Cassandra Query Language. It is modelled after the DynamoDB paper from Amazon. I am too tempted to write an alternative to this in Go, just to have the idea of writing a large scale distributed system, instead of just using it as a client, but have not got around to a good dataset or usecase with which I can test it.

MongoDB is another document oriented database, which I tried using for a pet project of mine. I don't remember exactly but there were some problems with respect to unicode handling. The project was done prior to go becoming 1.0, so the problem could be in any end.

Most of the new age databases are called NOSQL databases but what they really mean is that the database skips a lot of functions (such as datatype validation, stored procedures, etc.) and try to grow by scaling out instead of scaling up.

Cloud

OpenStack is a suite of open source projects that help you create a private cloud. DeltaCloud is a project which was initially started by RedHat, and now an apache top-level project, as a way to provide a single API layer which will work across any cloud in the backend. This project is done in ruby. I was initially interested in participating in its development, until I got introduced to Go and moved into a different tangent.

To start off a software company is a very easy task to do in today's world. The public clouds are becoming cheaper and cheaper everyday and their capacity can be provisioned instantly.

Amazon web services provides an umbrella of various public cloud offerings. I have used Amazon EC2 which is a way to create a Linux (and windows) VM that runs on Amazon's datacenters. The machines come on various sizes. Amazon S3 is a cloud offering that provides you way to store data in buckets. This is used by Dropbox heavily for storing all your data. There are various other services too. In some of our prototyping, we found the performance of Amazon EC2, to be consistent mostly, even in the free tier.

Google is not lagging behind with their cloud offerings either. When Google Reader was shut down, I used Google's Appengine to deploy an alternative FOSS product and I was blown away by the simplicity of creating applications on top of it. Google Compute is the way to get VMs running on the Google Cloud. As with Amazon, there are plenty of other services too.

There are plenty of other players like Microsoft Azure, Heroku etc. but I do not have any experience with their applications. While we are talking about Cloud, you should probably read about Orchestration and know about at least Zookeeper.

In-Process Databases

These are databases which you can embed into your application, without needing a dedicated server. They run on your process-space.

sqlite is the world's most deployed software and it competes with fopen to become the default way to store data for your desktop applications (if you are still writing them ;) ). A new branch is coming with the latest rage on storage datastructures, a log-structured merge tree as well.

leveldb is a database that is written by the eminent Googlers (and trendsetters of technology in the last decade or so) Jeff Dean and Sanjay Ghemawat who gave us MapReduce, GFS etc. It is forked by Facebook into RocksDB as well.

KyotoCabinet and LMDB are other projects on this space.

Linux Filesystems

Since we have covered GFS, HDFS, etc. earlier. We will look at other popular filesystems.

btrfs is a copy-on-write filesystem in Linux. It is intended to be the defacto linux filesystem in the future, possibly obsoleting ext series in the longer run.

XFS is a filesystem that initially came from SGI to Linux. This is my personal favorite and I have been using it on all my linux machines. In addition to good performance, this offers robustness and comes with a load of features that are useful to me, like defragmentation.

We also have the big daddy of filesystems zfs too on linux.

Ceph is another interesting distributed filesystem that works on the kernel space and is already merged in the linux kernel sources for a long time now. GlusterFS is another distributed filesystem which works in the userspace. Both of these filesystems focus on scaling out instead of scaling up.

Conclusion

Pick any of these technologies that you like and start writing a toy application on it, may be as simple as a ToDo application and learn through all the stages. This approach has helped me. It may help you also.

I have written this post from a Thinkpad T430 running openSUSE Factory and GNOME Shell with a bunch of KDE tools. I like this machine, However, in the past few months I have realized that, in today's world, If you are a developer, it is best if you run Linux on your server and Mac on your laptop.

Kernel Development Beginner

2014-09-11T10:48:00.000+05:30

Yesterday Vignesh asked me if I could give some guidance to a college junior of mine who wants to start with Kernel programming. Being a filesystem developer on Novell for a while now, I thought I could share some things that I have learned. I wrote a somewhat long reply which I am reproducing below (with minor edits for clarity) in the hope that it may be useful to someone.

Since it was originally intended to be a mail, it is a little more verbose than a blog post. My advice is based on the situation on my college when I studied a decade ago. Things would have probably changed and the recommendations may need tweaking based on the context.

---

The most important quality that you need to inculcate if you want to do any kernel space programming is "Patience" (or persistence if you will). Though it is a good quality for any large scale project, it is a fundamental requirement for kernel programming. It is very easy to see progress and make an impact on userspace projects, but even simple changes in the kernel core will take a lot of time to get accepted, and will often require multiple rewrites. But fear not, as there are plenty of people who have conquered this mountain and it is not something to be worried about.

The starting steps will be:

1) Try to understand how to use git. We were (are ?) not taught to use a version control system in our college and it is such a fundamental thing. So start using git for college assignments and get the hang of it.

2) Start writing a lot of C programs and get experienced with pointers, memory allocation, threading. You can start implementing things like Stack, Queue, Trees etc. (whatever you study in datastructures) in a simple, thread-safe way. Do not focus on how you can visualize these datastructures but how you can effectively implement their functionality and thread safety. Use pthreads for threading. Do not use any library (like Glib) for giving you convenient datastructures (like Strings). Implement each of the things on your own. (But when you are writing code for a product, use a standard library always instead of re-inventing the wheel)

Write these C programs on Linux and compile using gcc. In our college days we were using turboc on windows and I hope things have changed. Use a linux distro (fedora, debian, openSUSE, Gentoo etc.) exclusively; Do not use Windows (at least for a while) to make yourself aware of the sysadmin, shell-scripting parts of linux, which will come in handy.

3) Grab a (any) book on Operating Systems theory and read it. The dinosaur book by Silberschatz et. al. is a good start.

4) Without hesitation buy, Robert Love's Linux Kernel Programming book. It is one of the best beginner material and start reading it parallel to the OS book. This is easier to read than the previous one and more practical. But the previous one adds more value and is more theoretical. Handle (3) and (4) in parallel without blocking on any of the other activities.

5) After you are done with (1) and (2), and feel sufficiently confident with C and pointers, grab the linux kernel sources from http://git.kernel.org/ and try to build the sources yourself. http://kernelnewbies.org/KernelBuild should help. Learn how to install and boot with the kernel that you have built.

6.1) Subscribe to Kernel Newbies mailing list http://kernelnewbies.org/MailingList and read every mail, *even* if you do not understand most of it.

6.2) Watch: https://www.youtube.com/watch?v=LLBrBBImJt4

6.3) Subscribe to http://lwn.net RSS feeds.

After this, you should be able to fix and send any trivial, documentation, staging fixes. Once you have done this and get the hang of the process, you will know how to send patches for any parts of the kernel.

By this time, you would have found your areas of interest in kernel (filesystems, memory management, io scheduler, CPU scheduling etc.). You will then have to dig deeper in those particular areas, by:
a) subscribing to the individual mailing lists (such as fs-devel, etc.)
b) reading about the bug reports for the individual component
c) finding the literature that is relevant for your subsystem (The linux memory management book Mel Gorman, etc).

Three other non-technical things that I would recommend are:

1) Create a new email address and use that for all your open source activities. That way you do not miss any important updates from your friends.

2) Kernel programming will not give you big money in the short and medium term (at least in India). If your motivation is not excellence in engineering, but becoming popular or rich (it is not wrong btw) then you should focus on some other areas of programming (developing apps, websites, solving user problems, making meaning etc.).

It will often take months (or even years) before you make a significant contribution that is not merely a memory leak or bug fix. Be prepared for that. But since you have age, energy, time (once you get married and/or have kids you will understand) on your side, it is not that difficult.

Many people try kernel programming and then quit because they do not have the patience and perseverance. It may also happen that they have found a more interesting technology at its nascent stage (like Distributed Computing, Artificial Intelligence, Containers, NLP etc.) It is not wrong to quit midway :) Any little time spent on kernel programming will immensely benefit you as a programmer even when you are doing user space programming. This holds good for not just kernel programming but any large-code-base/system programming (like Compilers, glibc, webkit, chrome, firefox etc.)

3) Be more aware of the GSoC community in colleges around you.

All the best.

2013 Learning Retrospective

2014-01-17T13:03:00.000+05:30

My open source work + learning in 2013 was not bad. Considering the lack of personal time due to changes in family and other reasons, I think I did well. Some of the things done are:

Learnt a lot about kernel space filesystems. Tried to write a beginner level introductory tutorial and then abandoned it. Instead started writing a simple kernel space filesystem from the scratch that can help as a teaching material - https://github.com/psankar/simplefs
Implemented B Trees in Go. Researched a bit about B Trees, Log structured merge trees and gave a talk on how B Trees are used in filesystems https://github.com/psankar/btree-go
Released a new version of my chrome extension with support for searching and highlighting multiple strings in Chrome https://github.com/psankar/Find-Many-Strings
Learnt a bit about ext4 filesystem (de)fragmentation implementation. Started working on some tools to simulate filesystem ageing and filesystem fragmentation. Work in progress. This is part of the dayjob though and not done in the freetime
Learnt more about the Go programming language. Almost all the projects last year were done on Go. Understood how to deploy my application in the Google cloud engine. Did some minor comparisons across various online appengines, such as heroku, amazon, google etc.
Gave SublimeText a honest try and just as it happens with every other editor, came back to Vim, although this time with a few new plugins: https://github.com/Valloric/YouCompleteMe & https://github.com/tpope/vim-pathogen specifically
Inspired by gotour, Implemented a HTML 5 based slideshow tool where the slides can be written in markup, slide contents can talk to any compiler and show the output in the webpages. This will come handy in teaching programming languages https://github.com/psankar/kuvalai
Learned about bootstrap and angularjs. Did some non-trivial websites using these technologies
Learned a bit about the architecture of various distributed filesystems
As an extra unplanned activity in the dayjob, did a big buildcleanup by migrating to CMake from a proprietary buildsystem and was able to shed about 30k lines of buildfiles. Became well-versed in CMake, RPM generation as a side effect. Tried to play with ninja, tup too and learnt about their architecture
Read few nice papers such as The Ubiquitous BTree by Douglas Comer, Vnodes: An Architecture for Multiple FileSystem Types in Sun UNIX, Build System Rules and Algorithms, Analysis of six distributed filesystems etc.
Learnt a bit about linear time sorting by reading papers on sorting. Could not get to implementation sadly

Things that I wanted to do but could not do

Take some coursera courses on Security, Compilers, Hardware software interface and Algorithms
Learn Rust, Haskell and Dart programming languages
Research more about Logstructured merge tree, implementing it and write a blogpost (or get more understanding) on how the changing hardware (SSDs etc.) affect our datastructures, performance etc. from a storage perspective
Get a technical paper published :(
Spend more time in the openSUSE community and mailing lists

Introducing Find Many Strings v2 - A chrome extension

2013-09-25T16:36:00.000+05:30

Some of you might remember my chrome extension to search and highlight multiple strings simultaneously. I made an update to it recently and gave the ability to input multiple strings in one click. Here is an introductory video of the extension in action. Recommended to be watched in full screen.

The video along with the full subtitle support (so as to help a11y) was made Tharkuri. A big thanks for this strenuous job. If you are looking for someone to do [online] marketing / professional writing work in India, I highly recommend Tharkuri.

You can get the extension from the chrome store and the sources from the github repository. Please report any bugs / features to the github page and any feedback in the extension page or by mail.

Introducing simplefs - A ridiculously simple filesystem

2013-08-16T11:35:00.000+05:30

In the last few days I have been trying to implement a filesystem from the scratch. It has reached a level of maturity where I can release it to the public without feeling too embarrassed about the quality of the work. Here you go, simplefs. It is meant to be used as a tool for for teaching filesystem basics and not for production use.

I recently released the version 1.0 of the filesystem with support for:

Creation of files and nested directories
Enumerating files in a directory
Reading of files
Writing of files

In the next release, I plan to implement support for extents and in the next release to that the support for journaling. Your comments, feedback etc. are welcome.

I wish I have implemented a filesystem from the scratch a long time ago, may be during my college days. This activity gives you a good test bed for evaluating almost all aspects of your computer science knowledge, like, Operating systems, Data structures, Algorithms, Locking semantics (granularity, ordering etc.), Cache coherency, Programming expertise, Cost hierarchy (read-time across memory, disk etc.) etc. If you are interested in being a programmer for a long time, you should definitely try implementing (or at least designing) a filesystem from the scratch. In a world (or the Indian IT sector may be) where a programmer's role is getting restricted to being a [javascript] library plumber, such designing + programming tasks will give you the N-Ach satisfaction.

What I need from a WM/Desktop

2012-10-29T13:54:00.001+05:30

Any suggestions are welcome for the WindowManager/Desktop needs that I have. I am open to trying out prototype systems too.

My requirements are:

win + left , win + right keys should align windows in the left and right halves of the screen, max-vertically, respectively
win + top, win + bottom should align windows in the top and bottom halves of the screen, max-horizontally, respectively
win + f or win + enter should fullscreenize the current window
Should support 3x3 workspaces. My main development workspace will be in center. All other things like mail, browser, IRC, IM should be just one hop away in any of the neighboring four workspaces
The currently focused window should be brighter than the rest of the windows in the background
If I click on a window in the background, it should just get focused and not interpret as a click on the window (say if I click on a link in the browser in a bg window, it should not change the location but just bring the window to fg)
There should be a tile option to arrange all windows in the current workspace, into quadrilaterals of equal width and height
There should be an expose like option where I should be able to see all the open windows (may be triggered on win + f7 key or three-finger-scroll in the touchpad). This should just arrange the currently open windows
There should be no always-present bar in any edge of the screen (unlike gnome3). Intelli-hide panel on the bottom is good (like gnome-do docky) etc.
There should be no animations at all when the windows are resized/tiled/etc.
There should be no animations at all when we switch workspaces.
ALT + TAB should alternate between open windows (not applications) in the current workspace ONLY.
The title bar of the windows should be as thin as possible, such that it will accommodate three buttons for minimize, maximize and close. But they should not be too big like in GNOME 3 wasting a lot of space. Does not matter if it is configurable/themable or not.
Notifications should popup in some fixed location (say top-right corner) and stay until dismissed (I do not use notifications for individual chat message receiving etc. so this is good for me. I need to be notified only for a new chat and not messages in an already open chat window etc.)
Should allow changing of control key position to either alt key (to help when working along with Mac) or capslock key
Should have an option to have one workspace dedicated for an additional monitor and a keyboard shortcut to switch to that workspace
Things like multimedia keys, password store etc. are not exactly a big need. If they exist, it is good but if they don't they are not deal breakers. My main requirements are for the WM aspects and not really these features.

Are there any other WM needs that should be added to the above list to make my programming environment better ?

Are there any recommendations or does anyone have a configured setup (for xmonad, pekwm etc.) that is somewhat closer to the above requirements ?

Please share your comments/helps/config-files. Thanks.

Offended

2012-07-20T16:19:00.000+05:30

As some of you might know, I am a Vegan and I saw the error codes http://en.wikipedia.org/wiki/0xDEADBEEF#Magic_debug_values and was shocked to see code like 0xDEADBEEF

I am offended by this. Cow is a holy god in our Hindu/Jain religions and killing of it is banned in our faith. Such constants make the code more offensive to a nation of a billion people (India). So, I am going to ask everyone to remove such constants.

Also, I noticed that Christopher Blizaard's website is http://www.0xdeadbeef.com/ This is indeed offensive and I am going to ask him to change his website. His blog's feed is in many planets and I don't want to see that word, when I hover over a link.

Also, I should ask to get the planet Uranus renamed.

Mentioning beef may sound a simple thing to you. But such things makes Vegans feel insecure wondering if the whole development activity is for meat-eaters only.

#satire #reductioadabsurdum?

---

Will I write such a code ?

Is it in good taste ?

Probably Not, for the majority of people.

Will I get offended on seeing such a code ?

If I do, I will have to lose my sleep over everything.

Calling "[Microsoft] managed to make the kernel more offensive to half the population" is too much of an over-reaction, to, what is just a lame joke of a programmer trying to be funny. There is no need to drag Microsoft's name here, for the same reason, why we don't accuse The Linux Foundation of propagating male supremacy ideas, when Linus Torvalds says "Do you pine for the days when men were men and wrote their own device drivers?"

Inspiration for the post: http://mjg59.dreamwidth.org/14955.html

I stumbled onto the comments section and found that someone has posted the exact same thing as this blogpost, that I had in mind. World is in indeed small. Sadly no idea who made the comment.

Disclaimer: All opinions expressed are purely personal and do not represent my employer.

Smartphones - Acquired Necessity

2011-11-10T16:50:00.001+05:30

Smartphones - An acquired necessity

I have used a Motorola E398 mobile phone for the last 6 years. For the last one month, I tried using a Samsung Galaxy S2 smartphone. I have concluded that Smartphones are an acquired necessity and is not needed for most of my workflows. I will not buy a smartphone in the near foreseeable future.

Smartphones gives an impeccable improvement for one workflow - Email. If your work involves time-sensitive emailing, a smartphone is a crucial tool. However, if it doesn't, then you are better off buying a good desktop/laptop and a normal mobile phone, imho. A shortlog of things observed in this one month are:

Observations:

Vibration: Smartphones are too thin and don't have enough vibration, if you are used to thick phones (Not a big problem)
Poor signal reception: A biiiig problem. In the quest to slim the phone, the signal reception abilities of the phone are heavily hampered. While we were travelling in a car (from Bangalore to Chennai), A cheap Samsung GURU E1081 consistently beat Galaxy S2 in terms of signal strength. Most of the times the Galaxy S2 was showing "Emergency calls only" only. May be a phone from a real phone-maker like Motorola/Nokia may not have this problem. We had to keep the phone upright near the car window too.
Typing: Even with the on-screen-keyboard, swype etc. the phone is totally unsuitable for typing long text. We can do only twitter/facebook updates and not do any serious document editing or long blogpost(s). The requirement for Siri (Voice Control) is just a natural demand. I wonder how the folks at Google missed this big requirement. They should have introduced this with a big bang and not play catch-up to Siri. Probably they missed it as they were busy tomato-saucing Google+ to all their applications ;-)
Screen Estate: The screen size is totally unsuitable for reading books, blogs. It is okay for occasional blog reading, but it is straining. There are people who read blogs using their phone primarily. But I am spoiled by my employer who gave me an iPad, a Samsung GalaxyTab and a Thinkpad to play with, for some mobile related coding. I did not prefer the smartphone even once when I had my good old Thinkpad. Tablets and laptops provide far better reading experience and are equally handy. The GalaxyTab can act as a phone too for all your needs.
Fragile: Smartphones need extreme care. Not ideal for rough use, unless you are using a rugged phone like Motorola Defy+ (which has its own set of problems). A friend once dropped a Google phone from his hands while taking from pant pocket (~3 Feet) and the glass shattered into pieces. I have thrown my E398 numerous times (atleast few dozen times from ~5.6 feet) and nothing has happened except an automatic restart.
A Patentable Idea: The unlock pattern (of Android) leaves fingerprints and so your phone is not really secure. If you look at the phone, by keeping the screen flat in front of your nose, you can easily detect the unlock pattern finger prints. This may be fixed soon with the advent of touchscreens that do not leave a finger print. I wonder why can't we just authenticate based on the fingerprint in a smartphone !? That may be cool. I should probably patent it, as it seems in mobile phones any stupid thing can be patented. Think: SYSTEM and METHOD for RECORDING and AUTHENTICATION of users to mobile phones via fingerprints, by letting them swipe on either the surface/camera/etc.
Battery Life: Even after switching off the wireless, due to the rich display, the battery life of all the smartphones is very less. My phone battery did not last more than 3 days. With wireless on and just GMail app running, the battery lasted for just about 1 day. Some of the non-smart phones these days have close to 2 weeks battery life. If battery life is your criteria or travel a lot, you must order a extra battery if you are buying a smartphone.
Muscle Memory: After laying in my bed for sleep, many times, I have taken my motorola phone, unlocked it, launch the alarm application, set alarm for a specific time, exited the application, locked the phone again. I do all this while keeping my eyes closed. The normal phones with keypad are easily operable in one hand. I can take a call in one hand, while the other hand is balancing my body in a bus moving through the high-traffic streets of India. Contrary to what you hear, smartphones require both hands to be used and is not so easily adaptive for muscle memory. For a basic operation like calling a recently called number, you will take more time in a smartphone than a phone with a dedicated CALL button.
COST: The single biggest reason why I won't buy a smartphone is Cost. Smartphones from any decent hardware maker are very costly. Personally, with my Indian mentality that takes pride in being cheap, I find it stupid to spend 30,000 INR (600 USD) on a phone which will be valued at 1,000 INR (20 USD) after may be 3 years. This inference is based on the Motorola Razor handset pricing in India. Compare this to a new Samsung GURU phone with color-display, USB charging capacity etc. that costs a mere 1000 INR (20 USD) and has battery life of about 1 week with normal usage. However, there is a big demand for cheap smartphones in markets like India. If and when Nokia releases their cheap smartphones in India, they are sure to repeat their success story in India, just like they did with their Torchlight series phones.

Samsung-specific-observations:

The default alarm application does not have snooze option. There is no excuse for this.
Samsung Kies - Ahem :/
Also, it is not available on Linux. Good news is that with the recent versions, the software update can be done within the phone itself without Kies.
The indic support patch is not upstreamed yet. But kudos to Samsung as they are the only Android handset makers who support Indic fonts native as of today, afaik. A few of my Tamil friends bought Samsung phones just for this reason.
OLED - Amazingly rich screen, especially while displaying black color.
The front camera in S2 is just a joke. Totally useless for my needs.

Android:

"Android" brand has a better image than "Linux" in the consumer market. It is not without reasons. There are a lot of positive things about Android. I have not mentioned any of them because you can find them easily.
However, To be honest, the usability of Android phones is *not* jaw-droppingly-awesome, imho. They are just as normal (good ?) as say Meego UI or GNOME 3. But consumers love them. If enough money is spent on marketing, pigs really can fly. I wish some of the earlier projects like openmoko/maemo/meego had rich companies that were as committed as Google is to Android.
I hope ChromeOS opens a door for Linux on Desktops, just as how Android made Linux the most dominant operating system on mobiles. That may help PC OEM vendors to think a little instead of their current act of blindly worshipping Microsoft.
The biggest positive impact of Android imho is: Android made companies which usually don't bother about Linux users (like Evernote) to write applications for Linux.

Conclusion:

I will happily use my Motorola E398 until it lasts and then will buy a normal non-smart mobile phone when it no longer can run. Even though I may take up a job with a mobile phone company, I don't think I will buy a smartphone for my needs.