Pinu planeet

December 06, 2016

TransferWise Tech BlogProduct Engineering Principles @ TransferWise

Product Engineering Principles @ TransferWise

At TransferWise, we have seen phenomenal growth in the last few years - growth in users, transaction volumes, and team. Our engineering team has grown from 50 to 120 in the last 18 months. With this kind of growth, it’s easy for the company culture to evolve and degrade very quickly unless we reiterate and be mindful of our key principles and values we operate on.

I am often asked by engineers at conferences, potential hires, startup founders and others in the industry what are the key principles we organize around while building the TransferWise product. I thought I'd pen down some thoughts on this.

Before we hit the main principles we follow, here’s a quick primer on how our teams are organized. As of today, TransferWise has about 25 teams - each of which is autonomous and independent, that focus on customer-centric KPIs which eventually drive our growth. A mouthful but let’s break this down.

Teams are autonomous: Teams own their decisions. They decide how to evolve the product by talking to customers and by looking at data. The teams seek input and are challenged by others in the company on their decisions but eventually they are the final decision makers.

Teams are independent: Teams own their destiny. We try to keep cross team dependencies to a minimum. While there are some cases where a team may need help from another team to get something done, we try and avoid this as much as possible.

Teams focus on customer-centric KPIs: Teams solve customer problems. They are organized around a specific customer (team organized around a specific region, say the US) or a specific problem all customers face (making payments instant across the world). Given teams are autonomous and independent, they can pick what they want to work on but everything they work on has to drive a customer metric. Our product moves money around the world. Our customers tell us constantly that they care about their money moving super fast, conveniently, for cheap. So everything a team does looks to optimize these factors. The team should be able to stand up in front of the entire company and explain how their work impacts the customer and which metric it moves.

Now that we’ve got the team setup out of the way, let’s talk about how Product Engineering works at TransferWise. Here are the key principles we follow.

Hire smart people and empower them

Our product is a function of the people who build it. That means how we hire and who we hire has a massive impact on what ends up becoming our live product. For product engineering, our hiring process includes the following steps:

  • Take home test
  • Technical interview with 2 engineers
  • Optional follow-up technical interview
  • Product interview with a product manager and an engineer
  • Final interview with our VP of engineering or co-founder

While this interview loop may seem familiar, most candidates comment about the product interview being a unique experience. In the product interview, we focus on your ability to put yourself in the shoes of a customer, understand what customer problem you are solving, how you would build something to get validation on an idea and then iterate on it to deliver a stellar customer experience. Read Mihkel’s post on demystifying product interviews to get more details on what we cover and why.

Once hired, product engineers are empowered to move mountains. Engineers chose which problem to solve, why, what the customer impact will be and the prioritization of their tasks. Of course, this should be in line with team goals and not solely based on individual goals.

Weak code ownership

As mentioned above, we believe in teams being independent. A big part of this is that teams don’t have dependencies on other teams. But how does this work in a large organization with an ever evolving and growing product?

Let’s take an example. As our product expands across the world, every country has different rules on what data we are required to verify on our customers. Let’s say as we launch in Australia. There is a new regulatory requirement to check some additional data on Australian customers. This requires Team Australia - an autonomous and independent team focused on the Australian customers - to make a change to our verification engine. But the verification engine is owned by the Verification team. In a lot of organizations, Team Australia would request the Verification team to pick up this change on their roadmap. But the Verification team also has a lot of such requests from other teams. They also have their own projects to improve the core verification engine to support all our different regions. So what usually ends up happening in other organizations is Team Australia can’t move as fast as they desire as they are dependent on the Verification team and their priorities.

This is why we follow the weak code ownership principle. In this setup, every part of the code is owned by a team but a team is allowed to change any part of other team's code. Sounds like chaos but there are some basic enforcement rules around this. The owning team sets the rules that other teams have to follow to play in their codebase.

In the above example, instead of the Verification team making the requested change, Team Australia is empowered to make the change in the verification codebase. But they have to follow the rules set by the Verification team to commit to their code base. These rules are up to the owning team to decide on. They could be something like below:

  • Before taking on any major change, the team making the change must go through a design discussion on said changes with the owning team.
  • The team making the change has to follow certain design patterns
  • No code will be accepted without adequate test coverage
  • All changes have to go through peer-reviewed pull requests

This setup allows product engineering teams to be independent and helps teams remove dependencies on other teams and allows teams to iterate at the pace they desire.

We compare this setup to an internal open source project. Owners define the rules to play with and own the codebase and others can commit and make changes as long as they follow the rules. As an additional benefit of this setup, owning teams are incentivized to make their code more modular and add relevant tests so that another team cannot easily break things. This leads to code readability and higher quality.

Product engineers focus on customer impact

In a lot of companies engineers never talk to real customers. Some engineers we talk to during our interview process don’t really know who they are building the product for.

Information flow in a lot of companies from customer to engineer:
Product Engineering Principles @ TransferWise

Information is lost with every person introduced along the way.

At TransferWise, being a product engineer means you get close to customers. You talk directly to customers, look at data to make decisions and understand how the features you build are evolving and being used by different customers in production. We use Looker and Mixpanel as our analytics engine and this is available to everyone in the company. Anyone can run queries and slice and dice the data the way they desire.

Product engineers also take customer calls, chats and respond directly to customer emails. Here’s an example of a challenge our co-founder Kristo set out to inspire engineers to take more calls and get closer to our customers.

Product Engineering Principles @ TransferWise

The resulting picture speaks for itself. :-)

Product Engineering Principles @ TransferWise

No one else can build your product but you

Given how involved engineers are in analyzing the data, talking to customers, understanding the reason to make a change, and how fast our iteration cycles are, we believe that we cannot just write down our specifications and have someone outside our company build the product. We don’t do offshoring, outsourcing or use consultants to build our product. This doesn’t mean we don’t have independent workers (i.e. non-salaried employees who work at TransferWise engineering). We do. Some of them have been with us for a long time and are critical contributors. But they are embedded within our teams and operate the same way any other employee does. They get close to our customers, take part in all decisions.

Some rules, more common-sense

We have a few rules that are standard across the entire product engineering organization. We believe teams should be able to pick the tools to get the job done within certain limits (more below on limits). All our teams run sprints but it’s up to them to define their sprint cadence. It has just happened that most teams run a one week sprint but now we are seeing some teams looking to move to a two-week sprint as their projects get more complex. Similarly, some teams follow scrum to the book, while some do kanban and others run their own variation on scrum.

That said, we have a few common sense rules:

  • Product engineers to own their code in production. This means managing your own releases, monitoring your code in production, getting alerts when something goes wrong and being present if your system is having an issue. We believe this incentivizes the right behavior. When you know you will be alerted at 2AM when something goes wrong in production, the quality of code that gets shipped tends to be better.
  • We have weekly technical exchange sessions called “TeX”. It’s a forum where product engineers share knowledge on various technical topics. These can range from design discussions, changes made to a specific part of our system, new technologies we should be investigating.
  • We are a JVM shop. We are open to other languages. We have some PHP, Node running around but our main stack has always been a JVM with our monolith application written in Groovy on Grails and our microservices written in Java 8 on Spring Boot. We believe language wars are good conversations over beers but try to avoid them at work and get on with building product.
  • If you want to introduce a new language or that shiny new technology to our system, it’s simple! Do a TeX and invite your fellow engineers. Explain to them the specific benefits of introducing this technology. Do an honest pro and con analysis and explain why it’s worth the rest of the engineers to go through the learning curve to pick this technology up. This is crucial! As we have weak code ownership people need to be able to make changes to parts of the system they don’t own. So new technologies introduced not only impact the specific team owning the service but also impact other engineering teams.

Honest blameless postmortems

This one is probably our favorite principle. Everyone makes mistakes and when you move fast, things break. The key is how we recover from these mistakes, what we learn and how we prevent them in the future.

In most companies, an individual isn’t really incentivized to ship fast and learn with the fear of breaking things. One is rarely rewarded for taking massive risks to improve something tenfold as the risk of breaking something is much higher. People tend to get dinged on their annual reviews when they break something leading to a production outage. So what ends up happening is people triple check their work and put in more and more safeguards for that one possible failure that can happen.

We want to build a culture where we aren’t afraid to fail, but are accountable for our failures and make the cost of a failure smaller and smaller.

One major tool we use to reflect, learn and be accountable for our failures is public honest blameless postmortems.

Vincent wrote a great post on what makes a good postmortem. The intent of the post mortem is to go deep into why something happened, how many customers were impacted, what were our learnings and what measures did we put in place to make sure this doesn’t happen again. People challenge postmortems publicly if the postmortem isn’t deep enough or doesn’t have real learnings.
Culturally this is one of the most empowering and powerful tools we have in the company. We started this in product engineering but this has evolved where we do public postmortems across the company on most teams.

Challenges

Like any model of organizing, this model has challenges too. Below are a few challenges we have learned along the way:

  • Duplication of effort: With autonomous independent teams, we can have some overlap in work done by different teams. We try and counter this by having a few people in the organization who spend time across teams and have a view on what different teams are building. This would include engineering leads who spend time with different teams and get an understanding of successes and challenges each team has. So when a team starts building a new service with similarities to another service being worked on by another team, we try to consolidate effort and get both teams on the same page to hopefully not duplicate effort.
  • Collective decision making: Sometimes it’s just hard to get the whole team to align on a decision taking varied opinions into consideration. We counter this some of this by running small teams so there are fewer people who need to get on the same page. Also when teams get stuck they seek out help from others in the organization who have been in a similar situation before or could help them break a gridlock.
  • Singularity in vision: Given we have devolved decision making to teams, there’s no one person who calls all the shots. We have a company mission but teams can decide their own way to achieve the mission. This can be especially unnerving to some folks given they can't just go over to one person and ask for direction or say "I am doing this as the CEO wants it."
  • Communication: With teams being independent and working on their specific problems, we tend to run the spectrum of teams that over communicate to make sure others know what they are working on and to those who under communicate. TransferWise runs primarily on Slack. We have specific channels for sharing things cross team. We also have horizontal virtual teams called guilds where engineers get together to work on a problem that cuts across the organization. For example, we have a performance guild which has representatives from different teams. This is a group of individuals who are interested in building tooling, standards, and practices to help all our teams improve the performance of our systems. They focus on building the required monitoring, alerting for everyone to use. That said, we are still learning how to improve communication across teams as our organization grows.

Why do we operate this way?

As a start up we have a major weapon - speed! When people closest to the customers are making the decisions, we can run more tests and iterate quicker as compared to a setup where teams rely on managers to make decisions. In the latter, managers become bottlenecks slowing down decision making. Additionally, they usually aren’t as close to the day to day operations and the customer feedback loop to make an informed decision. In our setup, we believe we get more fail and pass signals faster.

We fundamentally believe companies that iterate faster, fail faster and learn faster will succeed in the long run. That means to learn faster than others we need to optimize for speed with checks for building a high-quality customer experience that our customers love. This is the main reason for our setup.

We realize that running this way has its drawbacks as listed above but we believe we can take these drawbacks and solve for them while we optimize for speed.

This is, of course, something that has worked for us so far and we will have more learnings as our product and company evolves. We will share those learnings as we go along. We would love to hear your thoughts on what you optimize for, how you do it, and scenarios where you think the above setup doesn’t work.

Thanks to Harald, Jordan, Martin Magar, Taras, Vincent for their input.

December 05, 2016

Anton ArhipovTwitterfeed #2

So this is the second issue of my Twitterfeed, the news that I noticed in Twitter. Much more sophisticated compared to the first post, but still no structure and no definite periodicity.


Articles:


Java Annotated Monthly - December 2016. Nice collection of articles about Java 9, Java 8, libraries and frameworks, etc. With this, my Twitterfeed is now officially meta! 😃

RebelLabs published Java Generics cheat sheet. Print it out and put at the wall in your office!

Server side rendering with Spring and React. Interesting approach to UI rendering with React. Some parts of the UI are rendered at the server side, and some data is then rendered at the client side.

One year as a Developer Advocate. Vlad Mihalcea reflects on his achievements from the first year in the role of a Developer Advocate for Hibernate. Well done!

IDEA 2016.2 Icon Pack. IDEA 2016.3 update came with the new icons and some people don’t really like those. There is now a plugin to replace the new icons with the old icons. Enjoy!

Oh, and talking about IntelliJ IDEA, there is another great blog post related to 2016.3 release. Alasdair Nottingham writes about Liberty loos applications support in IDEA: Faster application development with IntelliJ IDEA 2016.3

Reactive programming vs Reactive systems. Jonas Boner and Viktor Klang make it clear, what is the difference between the two. "Messages have a clear (single) destination, while events are facts for others to observe".

Good Programmers Write Bug-Free Code, Don’t They? Yegor Bugayenko has a good point about the relation of good programming to a bug-free code.

Cyclops Java by Example: N-Queens. A coding kata for N-Queens problem using "cyclop's for-comprehensions".

Zero downtime deployment with the database. The name says it all.

RxJava 2.0 interview with David Karnok about the major release. Here comes support for Reactive Streams specification!

Reactor by Example. Reactor is very similar to RxJava, but it is also in the core of Spring Framework’s 5.0 reactive programming model.

An explanation of the different types of performance testing. I think this is quite important to make the difference.

Videos:


Spec-ulation by Rich Hickey. As usual, must watch!

Microservices evolution: how to break your monolithic database. Microservices are becoming mainstream, it seems. So we need best practices for building microservices based systems.

November 29, 2016

TransferWise Tech BlogWhy Over-Reusing is Bad

One of the Holy Grails of software development has always been reuse. In the following post I want to focus on reuse of application logic in its different forms. I will not cover reuse on more abstract levels like ideas or patterns. This will be a two part series where I explore this topic from both backend and frontend development perspective.

I believe in the following idea:

It does not matter how fast you can build the first version of your product. It only matters how fast you can change it later.

On the backend the predominant paradigm right now is microservices. I believe that one of the reasons why this approach is so successful is that it gets reuse quite right. Microservices are really good blocks for reuse as they align very well with the idea of splitting big system into multiple smaller bounded contexts. As per the concept of bounded context reusing any parts of the internal logic between different microservices is considered an anti-pattern.

Changing Shared Code is Inherently Hard

But what is so bad about sharing code? Isn't it something good that we should always strive for? Before answering lets take a step back. Why do we want to reuse stuff? Because we want to be able to change fast. Now here comes the paradox — by reusing code that has been already written we are able to save some coding time but everything that is shared inherently becomes itself harder to change. This is so because once our reusable thingy is out there we need to keep all the consumers in mind when changing it. As the number of consumers grows the harder it becomes to juggle between different requirements, nuances of each context. Essentially the risk of reusing the wrong abstraction grows over time. It is just so easy to introduce these tiny additional parameters that enable reusing maybe not all but perhaps something like 80% of the original logic or 60% or 40%.

The Knowledge Cap

Knowledge cap is another thing to keep in mind. As software development is about building knowledge then any piece that is built by someone else means we will have a potential cap in our team's knowledge. This happens even when this someone else is another team in the same organisation. Often this loss of knowledge is quite OK and totally fine - we don't want every team to re-implement their versions of AND/OR gates. However, ideally all the assets that are at the core of what the team is doing should be developed by the team itself.

Frequency of Change

In general we can say that reuse makes more sense for more peripheral/infrastructure things like accessing database or doing http calls. However, if some part of our infrastructure code needs to be changed very frequently then it might still make sense to roll out our own technical solution. Ideally high frequency of change means that it is somehow tied to the unique value proposition of our product and extra implementation effort makes sense anyway. So frequency of change should be at least as important (if not more) factor in deciding whether to reuse vs build ourselves.

Clear Boundaries

In case we need to reuse something then the best thing we can do is to make the boundaries of our code and the reused code as clear as possible. This is the reason why microservices offer superior form of reuse compared to components running in the same process. It requires much more discipline to keep the connection points few and explicit when something resides in the same process as opposed to something that lives on the other side of the network.

So reuse by itself is not bad. However, reuse on the wrong level of granularity or forcefully trying to reuse everything that looks similar at first sight can be much more harmful than duplication. Reuse has a cost as well.

November 28, 2016

TransferWise Tech BlogThe TransferWise Stack - heartbeat of our little revolution

The TransferWise Stack - heartbeat of our little revolution

As any tech startup that's passed its five-year mark, TransferWise has come quite a way from the first lines of code that powered it. Our product is a living organism, with changing needs. What was right yesterday isn't necessarily so today. Nor might our current technology choices withstand the test of time. We hope they do - but maybe they don't. And it's okay.

At conferences and meetups people often walk up to our engineers to ask what languages, tools and technologies we're using. Fairly so, as we haven't done a stellar job of telling our tech-savvier customers and fellow developers much about that. Hopefully we can rectify that a bit by taking time now to reflect in writing.

We'd love to hear back from you about the decisions you would have made differently.

Brief history

Once upon a time, in 2010, there was a founder who wanted to solve a problem. He knew a bit of Java and wanted to get moving quick. At that time, Groovy and Grails seemed to have brought some of the Ruby on Rails flare to the JVM world. Boom, here's a quick way to bootstrap! By end of 2013, about a dozen engineers were working on the codebase.

In early 2014, the existing Grails-powered system wasn't cutting it anymore for some workloads. It had been quick and easy to deliver new features but the team had made some system design shortcuts on the way. The time had come to extract some critical batch processing into a separate component. We've been following the path of moving code out ever since.

By late 2016, TransferWise has about 120 engineers working in two dozen teams. Measured by lines of code, more than half our business logic lives in separate services. We're looking at ways to scale the system across data centers and coping with the next 80 engineers joining during 2017. Let's get into the details of how we intend to enable these moves.

Microservices - Spring Boot and Spring Cloud

Few contenders got to the starting line when picking between the possible groundworks for our service stack. We were sure to stay on the JVM. We wanted something that would promise good support for future "cloud" concerns. These included service discovery, centralized config and transaction tracing.

Many people in the team trust the quality of thinking going into the Spring ecosystem, so Spring Boot quickly gained popularity. It provides a good, extensible platform for building and integrating services. We like its annotation-based autowiring of dependencies, YAML configuration files and reasonable conventions. We use a custom Spring Initializr to bootstrap new services. That helps us to make sure all the needed bootstrap config is in place and nobody gets a headache trying to manually bring in the right dependencies. It's all running on Java 8.

Spring Cloud adds integration for many useful tools in a service-oriented environment. Some are Spring projects, like the Config Server. Some leverage battle tested open source components like Netflix Eureka, Zuul and Hystrix.

We are actively adopting Config Server and Eureka, and use Hystrix and Zuul in appropriate places.

Grails and Groovy

Grails and Groovy currently power all our business logic that's not extracted into microservices. That makes up a bit under half of our lines of code. We've taken a clear direction to deprecate this codebase by end of next year.

When Groovy came along, it brought with itself a chance to write nice, succinct code and leverage the whole Java ecosystem. It continues to be a good language for DSL-s, for instance. Groovy used to have more benefits over Java, like functional constructs, easy collection processing and other boilerplace-reducing tidbits. Java gives us better compile-time checking and less runtime overhead. Hence we've gone with Java in our micro-services.

Neither is Grails to blame for anything. It allowed the early team to quickly iterate on a product in its infancy. The convenience features of Grails served against it over the years. Grails hides complexity away from the developer. It makes it easier to shoot oneself in the foot when trying to deliver value. By taking a decision to focus on scalability of the codebase sooner, we would have been able to postpone migration by another year or so. Yet, in our opinion, Grails makes sense as a platform for a single moderately-sized web app - rather than for dozens of microservices. This made the transition inevitable in any case.

It's worth noting that latest Grails version, 3.x is, also, built on top of Spring Boot. As we're quite happy with plain Spring Boot, we are not currently considering it.

Persistence layer - MySQL, PostgreSQL

We're a financial service company. This instructs us to always prioritise consistency over availability. BASE is fun and eventual consistency sounds like a cool topic to wrap one's head around. We want our persistence to meet the ACID criteria - Atomic, Consistent, Isolated and Durable.

TransferWise started with MySQL. The widespread adoption, ease of setup and loads of people with some basic experience made it a good choice. With growth, more questions have come up about our DB engine, like:

  • does it support our analytical workloads?
  • is it easy to build High Availability?

MySQL still holds most of our data in production. Yet, migrating our analytical datastore to PostgreSQL is already underway as our tests show it to be a better fit for our models and tooling. Our financial crime team relies on many nifty PostgreSQL features. We foresee Postgres to also be the platform of choice for our future service datastores. Mature availability and resilience features it offers out of the box drives this. We like Postgres.

It's likely that we'll be adopting NoSQL for some use cases down the road, where referential integrity doesn't add value.

Messaging & Kafka

A big part of our business logic swirls around the state changes of a transfer. There's many different things that need to happen around the changes - like fraud checks and analytics updates. Earlier, these were all done synchronously together with the state change.

As the business has grown in complexity, that obvious and simple approach doesn't scale so well. A good rule of thumb in designing data flows is that we can make every signal that doesn't need an immediate response asynchronous. If you can take a document to another department and not wait around for them to process it, you can process it asynchronously in the code as well. We want to unbundle the reactions to an event from the transactional changes of an event. That makes it easier to scale the different parts and isolate potential problems.

In the first iteration of our payout handling, we experimented with Artemis as a messaging platform. We didn't become confident about running Artemis in a HA setup in production, and now most of our messaging has moved to Apache Kafka.

The front end

That's all fine, but Kafka doesn't let you create smooth user experiences!

In 2014 TransferWise web UIs still used good old jQuery. We were not happy with the testability of that codebase. We also knew we'd need to modularize in a while due to team growth. In September that year, we launched our revamped transfer setup page built using AngularJS. We've now adopted Angular for almost all parts of our website.

We use Bower and NPM for dependency management. Grunt automates the builds. Jasmine and Protractor power the tests and Karma + PhantomJS run the tests. There's also webpack and Babel doing their part. Some parts of the code already use TypeScript. Whew.

Front-end, of course, isn't only code. Practices matter more than using the latest-and-greatest tools.

The teams in TransferWise are quite independent in how they evolve the product in their areas of ownership. This has, at times, meant trouble governing the visual identity and making sure things look neat, clean and consistent across the site.

To help with that we've taken a battle tested approach of implementing our style guide on top of Bootstrap. And, for the record, we think that LESS is more.

We believe in TDD, which is particularly important in places with less compile-time safety - like Javascript. Ürgo has blogged before about some of the thinking we apply. There's more to share. Stay tuned for future posts covering different aspects of our experience in detail.

Running in production

So, how do we keep it together? How do we make sure our systems are serving our customers well?

We want to keep our feedback loops short and tight, to keep signal-to-noise ratio high. It means that people building the systems must also operate them. If you own a part of the codebase, you also own the service it provides. This means being responsible for functional and non-functional characteristics of a feature. Ownership applies from feature creation to deprecation. To enable that, we need a shared toolkit for troubleshooting, monitoring and alerting.

Infrastructure monitoring and some operational metrics run on Zabbix. We are adopting Graphite/Grafana to keep an eye on business operations aspects of the system.

We use New Relic to check the HTTP endpoints. It's not cheap but works well, thanks to its instrumentation capabilities. We've defined performance and error alerts for our higher volume endpoints. Respective teams get alerted via VictorOps rotations if performance degrades. New Relic also provides some tracing features to see the time spent on different steps of request processing.

When an exception happens in production, it gets sent to Rollbar. Rollbar groups and counts the occurrences and sends alerts for spikes or new items. In general it allows us to spot glitches in the system and estimate how many customers they affect.

To analyze a specific problem or identify patterns in logs, we use the Elastic stack. The stack consists of LogStash, ElasticSearch and Kibana. They are, respectively, used to parse, index and search/visualize logs.

Slack helps us to channel alerts and comms into one place.

There's a video and slides covering some of this from Baltic DevOps 2016 where we were honored to speak.

Vincent has written a great post our way of doing post-mortems. We use them to learn from customer-affecting production issues.

Bits & pieces

Vahur wrapped up some of the other tooling that we use.

PHP - There's a few tools in the company built in PHP, some legacy, some purposefully kept in PHP.

Spark - While being a pretty buzzword-averse bunch of engineers, we do like to adopt the right tech for the right task. In our case the prime problem domain to benefit from machine learning has been our financial crime prevention subsystem. We're building up machine learning models on Apache Spark.

Ansible & Terraform - Our infrastructure is growing and humans make mistakes. That makes infrastructure a prime automation target. We're adopting Terraform for declarative infrastructure management. We use Ansible to configure the instances.

Context

We build TransferWise for our current and future customers. They don't care much about which libraries, frameworks and components we bring in to move their money. It's our job as engineers to pick the tools and practices that allow us to deliver value quickly and sustainably.

Often we've chosen the simplest tool to getting the job done, instead of the most powerful. In other times we've gone with a framework that's not the newest kid on the block but has wide production adoption.

We're all proper geeks. We love new tech. We love to play and learn, pilot and break stuff. We do so in our hobby projects and geek-out sessions over Coke and beer. When building for our customers, we optimize for their happiness over the chance of adopting the next left-pad 0.0.3 ;) And you're right - it's as boring as building the HyperLoop of financial networks for millions of people could ever be.

TransferWise stack on StackShare

Thanks to Peep, Vincent and Ürgo for their help.

November 23, 2016

Four Years RemainingWhat is the Covariance Matrix?

Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

    \[\mathrm{cov}(X,Y) = E[(X - E[X])(Y - E[Y])].\]

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as {\bf\Sigma}_{i,j} = \mathrm{cov}(X_i, X_j), and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

Meet the normal distribution

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector \bf x has a normal (or Gaussian) distribution with mean \bf \mu and covariance \bf \Sigma if:

    \[\Pr({\bf x}) =|2\pi{\bf\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}({\bf x} - {\bf\mu})^T{\bf\Sigma}^{-1}({\bf x} - {\bf \mu})\right).\]

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. {\bf\mu} = {\bf 0}) and refrain from writing out the normalizing constant |2\pi{\bf\Sigma}|^{-1/2}. Now, the definition of the (centered) multivariate Gaussian looks as follows:

    \[\Pr({\bf x}) \propto \exp\left(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}\right).\]

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

Transforming the symmetric Gaussian

Consider a symmetric Gaussian distribution, i.e. the one with {\bf \Sigma = \bf I} (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1)   \[P({\bf x}) \propto \exp(-0.5 {\bf x}^T {\bf x}).\]

Now let us apply a linear transformation {\bf A} to the points, i.e. let {\bf y} ={\bf Ax}. Suppose that, for the sake of this example, {\bf A} scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points {\bf y}:

What is the distribution of {\bf y}? Just substitute {\bf x}={\bf A}^{-1}{\bf y} into (1), to get:

(2)   \begin{align*} P({\bf y}) &\propto \exp(-0.5 ({\bf A}^{-1}{\bf y})^T({\bf A}^{-1}{\bf y}))\\ &=\exp(-0.5{\bf y}^T({\bf AA}^T)^{-1}{\bf y}) \end{align*}

This is exactly the Gaussian distribution with covariance {\bf \Sigma} = {\bf AA}^T. The logic works both ways: if we have a Gaussian distribution with covariance \bf \Sigma, we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some {\bf A}, and we are given {\bf AA}^T.

More generally, if we have any data, then, when we compute its covariance to be \bf\Sigma, we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation \bf A, and we just estimated the matrix {\bf AA}^T, corresponding to this transformation.

Note that we do not know the actual \bf A, and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if \bf A is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, {\bf AA}^T = {\bf I} for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix \bf \Sigma can be represented as:

(3)   \[{\bf \Sigma} = {\bf VDV}^T,\]

where {\bf V} is orthogonal (i.e. a rotation) and {\bf D} is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4)   \[{\bf \Sigma} = ({\bf VD}^{1/2})({\bf VD}^{1/2})^T = {\bf AA}^T,\]

where {\bf A} = {\bf VD}^{1/2}. This, in simple words, means that any covariance matrix \bf \Sigma could have been the result of transforming the data using a coordinate-wise scaling {\bf D}^{1/2} followed by a rotation \bf V. Just like in our example with \bf x and \bf y above.

Principal component analysis

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:

  1. What could have been the rotation \bf V and scaling {\bf D}^{1/2}, which produced our data from a symmetric cloud?
  2. What were the original, “symmetric-cloud” coordinates \bf x before this transformation was applied.
  3. Which original coordinates were scaled the most by \bf D and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?

All of those questions can be answered in a straightforward manner if we just decompose \bf \Sigma into \bf V and \bf D according to (3). But (3) is exactly the eigenvalue decomposition of \bf\Sigma. I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

The metric tensor

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard {\bf\Sigma}^{-1} as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation \bf A (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

Namely, let us redefine, for any two vectors \bf v and \bf w, their inner product as:

(5)   \[\langle {\bf v}, {\bf w}\rangle_{\Sigma^{-1}} = {\bf v}^T{\bf \Sigma}^{-1}{\bf w}.\]

To stay consistent we will also need to redefine the norm of any vector as

(6)   \[|{\bf v}|_{\Sigma^{-1}} = \sqrt{{\bf v}^T{\bf \Sigma}^{-1}{\bf v}},\]

and the distance between any two vectors as

(7)   \[|{\bf v}-{\bf w}|_{\Sigma^{-1}} = \sqrt{({\bf v}-{\bf w})^T{\bf \Sigma}^{-1}({\bf v}-{\bf w})}.\]

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8)   \[{\bf v}^T{\bf \Sigma}^{-1}{\bf w} = {\bf v}^T({\bf AA}^T)^{-1}{\bf w}=({\bf A}^{-1}{\bf v})^T({\bf A}^{-1}{\bf w}),\]

The following illustration might shed light on what is actually happening in this \Sigma-“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud \bf y is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor {\bf\Sigma}^{-1} (which is, as we know, equal to ({\bf AA}^T)^{-1}. The PCA now becomes a method for analyzing the deformation of space, how cool is that.

The dual space

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product {\bf w}^T{\bf v}. Think of this inner product as a function f_{\bf w}({\bf v}), which takes a vector \bf v and maps it to a real number, the dot product of \bf v and \bf w. Regard the \bf w here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals f_{\bf w} is known as the dual space to your “data space”.

Note that each linear functional is determined uniquely by the parameter vector \bf w, which has the same dimensionality as \bf v, so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element \bf v of your “data space” denotes, well, a data point. An element \bf w of the dual space denotes a function f_{\bf w}, which projects your data points on the direction \bf w (recall that if \bf w is unit-length, {\bf w}^T{\bf v} is exactly the length of the perpendicular projection of \bf v upon the direction \bf w). So, in some sense, if \bf v-s are “vectors”, \bf w-s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of \bf w would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If \bf v are elements of a skewed Euclidean space with the metric tensor {\bf\Sigma}^{-1}, is a function f_{\bf w}({\bf v}) = {\bf w}^T{\bf v} an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon \bf w nor can we say that \bf w is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy {\bf w}^T{\bf v}=0. Instead, they satisfy {\bf w}^T{\bf \Sigma}^{-1}{\bf v} = 0. Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf \Sigma}^{-1}{\bf v}. The new parameters \bf z could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. f_{\bf w}, without any ugly \bf\Sigma-s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9)   \[f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf\Sigma}^{-1}{\bf v} = ({\bf \Sigma}^{-1}{\bf z})^T{\bf v} = f_{\bf w}({\bf v}),\]

where {\bf w} = {\bf \Sigma}^{-1}{\bf z}. (Note that we can lose the transpose because \bf \Sigma is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form {\bf \Sigma}^{-1}{\bf z}. Or, in other words, {\bf\Sigma}^{-1} is a transformation, which maps the dual space representations from the “skewed world” to our “straight world”. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the dual space. That is, it is the \bf z vector in a linear functional {\bf z}^T{\bf\Sigma}^{-1}{\bf v}. The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression {\bf z}^T{\bf\Sigma}^{-1}{\bf v} to evaluate the functional. However, why don’t we find the actual normal \bf w to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that {\bf w}={\bf\Sigma}^{-1}{\bf z} will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like {\bf w}^T{\bf\Sigma}^{-1}{\bf v} or ({\bf v}-{\bf w})^T{\bf\Sigma}^{-1}({\bf v}-{\bf w}), remember that those are simply inner products and (squared) distances in a skewed space, while {\bf \Sigma}^{-1}{\bf z} is a conversion from a skewed dual space. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

The dual space metric tensor

“But wait”, you should say here. “You have been talking about expressions like {\bf w}^T{\bf\Sigma}^{-1}{\bf v} all the time, while things like {\bf w}^T{\bf\Sigma}{\bf v} are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that {\bf w}^T{\bf\Sigma}{\bf v} is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression {\bf w}^T{\bf\Sigma}^{-1}{\bf v} denoted the “new inner product” between the points, the expression {\bf w}^T{\bf\Sigma}{\bf v} denotes the “new inner product” between the parameter vectors. Let us see why it is so.

Consider an example again. Suppose that our space transformation \bf A scaled all points by 2 along the x axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the x axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, |(2,0)|_{\Sigma^{-1}} will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if {\bf x}'={\bf Ax}, then

(10)   \begin{align*} f_{\bf w}({\bf x}) &= {\bf w}^T{\bf x} = {\bf w}^T{\bf A}^{-1}{\bf x}'\\ & =(({\bf A}^{-1})^T{\bf w})^T{\bf x}'=f_{({\bf A}^{-1})^T{\bf w}}({\bf x}'). \end{align*}

This means, that the transformation \bf A of the data points implies the transformation {\bf B}:=({\bf A}^{-1})^T of the dual vectors. The metric tensor for the dual space must thus be:

(11)   \[({\bf BB}^T)^{-1}=(({\bf A}^{-1})^T{\bf A}^{-1})^{-1}={\bf AA}^T={\bf \Sigma}.\]

Remember the illustration of the “unit circle” in the {\bf \Sigma}^{-1} metric? This is how the unit circle looks in the corresponding \bf\Sigma metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by \bf A becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that {\bf w}^T{\bf \Sigma}{\bf w} is exactly the variance of the data along the direction \bf w (assuming |{\bf w}|=1).

The covariance estimate

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that {\bf X}^T{\bf X}/n (where \bf X is the data matrix) must be related to the \bf\Sigma in the distribution equation \exp(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}).

Here is one possible way to see the connection. Firstly, let us take it for granted that if \bf X is sampled from a symmetric Gaussian, then {\bf X}^T{\bf X}/n is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, {\bf Y}={\bf XA}^T (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of \bf Y?

(12)   \[{\bf Y}^T{\bf Y}/n=({\bf XA}^T)^T{\bf XA}^T/n={\bf A}({\bf X}^T{\bf X}){\bf A}^T/n\approx {\bf AA}^T,\]

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution P with uncorrelated variables of unit variance and then transformed using some matrix \bf A, the expression {\bf X}^T{\bf X}/n will still be an estimate of {\bf AA}^T, the metric tensor for the corresponding (dual) space deformation.

However, note that out of all possible initial distributions P, the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.

November 22, 2016

Anton ArhipovTwitterfeed #1


Twitterfeed is the collection of news that I find via Twitter. I have no particular system or a method on how do I pick the news. Neither do I have a predefined period for grouping the news. It is neither daily or weekly or monthly - it is all just random. Enjoy! :)


Java 10 is now officially a project
IntelliJ IDEA 2016.3, my favourite IDE, was released!!! Yay!
CLion 2016.3, a IDE for C/C++ development was released
Rider, a new IDE for .NET is now publicly available via EAP
Akka 2.4.14 was released
Ceylon 1.3.1 was released
Fedora 25 was released

Heinz Kabutz teaches how to implement our own ArrayList in less than 10 minutes
Martin Kleppmann talks about conflict resolution for eventual consistency
Yegor Bugayenko rants about software architects
Roland Kuhn writes about understanding distribution


November 17, 2016

Four Years RemainingThe Secrets of Spring Motion

Mass on a spring

Imagine a weight hanging on a spring. Let us pull the weight a bit and release it into motion. What will its motion look like? If you remember some of your high-school physics, you should probably answer that the resulting motion is a simple harmonic oscillation, best described by a sinewave. Although this is a fair answer, it actually misses an interesting property of real-life springs. A property most people don't think much about, because it goes a bit beyond the high school curriculum. This property is best illustrated by

The Slinky Drop

The "slinky drop" is a fun little experiment which has got its share of internet fame.

The Slinky Drop

The Slinky Drop

When the top end of a suspended slinky is released, the bottom seems to patiently wait for the top to arrive before starting to fall as well. This looks rather unexpected. After all, we know that things fall down according to a parabola, and we know that springs collapse according to a sinewave, however neither of the two rules seem to apply here. If you browse around, you will see lots of awesome videos demonstrating or explaining this effect. There are news articles, forum discussions, blog posts and even research papers dedicated to the magical slinky. However, most of them are either too sketchy or too complex, and none seem to mention the important general implications, so let me give a shot at another explanation here.

The Slinky Drop Explained Once More

Let us start with the classical, "high school" model of a spring. The spring has some length L in the relaxed state, and if we stretch it, making it longer by \Delta L, the two ends of the spring exert a contracting force of k\Delta L. Assume we hold the top of the spring at the vertical coordinate y_{\mathrm{top}}=0 and have it balance out. The lower end will then position at the coordinate y_{\mathrm{bot}} = -(L+mg/k), where the gravity force mg is balanced out exactly by the spring force.

How would the two ends of the spring behave if we let go off the top now? Here's how:

The falling spring, version 1

The horozontal axis here denotes the time, the vertical axis - is the vertical position. The blue curve is the trajectory of the top end of the spring, the green curve - trajectory of the bottom end. The dotted blue line is offset from the blue line by exactly L - the length of the spring in relaxed state.

Observe that the lower end (the green curve), similarly to the slinky, "waits" for quite a long time for the top to approach before starting to move with discernible velocity. Why is it the case? The trajectory of the lower point can be decomposed in two separate movements. Firstly, the point is trying to fall down due to gravity, following a parabola. Secondly, the point is being affected by string tension and thus follows a cosine trajectory. Here's how the two trajectories look like separately:

They are surprisingly similar at the start, aren't they? And indeed, the cosine function does resemble a parabola up to o(x^3). Recall the corresponding Taylor expansion:

    \[\cos(x) = 1 - \frac{x^2}{2} + \frac{x^4}{24} + \dots \approx 1 - \frac{x^2}{2}.\]

If we align the two curves above, we can see how well they match up at the beginning:

Consequently, the two forces happen to "cancel" each other long enough to leave an impression that the lower end "waits" for the upper for some time. This effect is, however, much more pronounced in the slinky. Why so?

Because, of course, a single spring is not a good model for the slinky. It is more correct to regard a slinky as a chain of strings. Observe what happens if we model the slinky as a chain of just three simple springs:

Each curve here is the trajectory of one of the nodes inbetween the three individual springs. We can see that the top two curves behave just like a single spring did - the green node waits a bit for the blue and then starts moving. The red one, however, has to wait longer, until the green node moves sufficiently far away. The bottom, in turn, will only start moving observably when the red node approaches it close enough, which means it has to wait even longer yet - by that time the top has already arrived. If we consider a more detailed model, the movement  of a slinky composed of, say, 9 basic springs, the effect of intermediate nodes "waiting" becomes even more pronounced:

To make a "mathematically perfect" model of a slinky we have to go to the limit of having infinitely many infinitely small springs. Let's briefly take a look at how that solution looks like.

The Continuous Slinky

Let x denote the coordinate of a point on a "relaxed" slinky. For example, in the two discrete models above the slinky had 4 and 10 points, numbered 1,\dots, 4 and 1,\dots, 10 respectively. The continuous slinky will have infinitely many points numbered [0,1].

Let h(x,t) denote the vertical coordinate of a point x at time t. The acceleration of point x at time t is then, by definition \frac{\partial^2 h(x,t)}{\partial^2 t}, and there are two components affecting it: the gravitational pull -g and the force of the spring.

The spring force acting on a point x is proportional to the stretch of the spring at that point \frac{\partial h(x,t)}{\partial x}. As each point is affected by the stretch from above and below, we have to consider a difference of the "top" and "bottom" stretches, which is thus the derivative of the stretch, i.e. \frac{\partial^2 h(x,t)}{\partial^2 x}. Consequently, the dynamics of the slinky can be described by the equation:

    \[\frac{\partial^2 h(x,t)}{\partial^2 t} = a\frac{\partial^2 h(x,t)}{\partial^2 x} - g.\]

where a is some positive constant. Let us denote the second derivatives by h_{tt} and h_{xx}, replace a with v^2 and rearrange to get:

(1)   \[h_{tt} - v^2 h_{xx} = -g,\]

which is known as the wave equation. The name stems from the fact that solutions to this equation always resemble "waves" propagating at a constant speed v through some medium. In our case the medium will be the slinky itself. Now it becomes apparent that, indeed, the lower end of the slinky should not move before the wave of disturbance, unleashed by releasing the top end, reaches it. Most of the explanations of the slinky drop seem to refer to that fact. However when it is stated alone, without the wave-equation-model context, it is at best a rather incomplete explanation.

Given how famous the equation is, it is not too hard to solve it. We'll need to do it twice - first to find the initial configuration of a suspended slinky, then to compute its dynamics when the top is released.

In the beginning the slinky must satisfy h_t(x, t) = 0 (because it is not moving at all), h(0, t) = 0 (because the top end is located at coordinate 0), and h_x(1, t) = 0 (because there is no stretch at the bottom). Combining this with (1) and searching for a polynomial solution, we get:

    \[h(x, t) = h_0(x) = \frac{g}{2v^2}x(x-2).\]

Next, we release the slinky, hence the conditions h_t(x,t)=0 and h(0,t)=0 disappear and we may use the d'Alembert's formula with reflected boundaries to get the solution:

    \[h(x,t) = \frac{1}{2}(\phi(x-vt) + \phi(x+vt)) - \frac{gt^2}{2},\]

    \[\text{ where }\phi(x) = h_0(\mathrm{mod}(x, 2)).\]

Here's how the solution looks like visually:

Notice how the part of the slinky to which the wave has not arrived yet, stays completely fixed in place. Here are the trajectories of 4 equally-spaced points on the slinky:

Note how, quite surprisingly, all points of the slinky are actually moving with a constant speed, changing it abruptly at certain moments. Somewhat magically, the mean of all these piecewise-linear trajectories (i.e. the trajectory of the center of mass of the slinky) is still a smooth parabola, just as it should be:

The Secret of Spring Motion

Now let us come back to where we started. Imagine a weight on a spring. What will its motion be like? Obviously, any real-life spring is, just like the slinky, best modeled not as a Hooke's simple spring, but rather via the wave equation. Which means that when you let go off the weight, the weight will send a deformation wave, which will move along the spring back and forth, affecting the pure sinewave movement you might be expecting from the simple Hooke's law. Watch closely:

Here is how the movement of the individual nodes looks like:

The fat red line is the trajectory of the weight, and it is certainly not a sinewave. It is a curve inbetween the piecewise-linear "sawtooth" (which is the limit case when the weight is zero) and the true sinusoid (which is the limit case when the mass of the spring is zero). Here's how the zero-weight case looks like:

And this is the other extreme - the massless spring:

These observations can be summarized into the following obviously-sounding conclusion: the basic Hooke's law applies exactly only to the the massless spring. Any real spring has a mass and thus forms an oscillation wave traveling back and forth along its length, which will interfere with the weight's simple harmonic oscillation, making it "less simple and harmonic". Luckily, if the mass of the weight is large enough, this interference is negligible.

And that is, in my opinion, one of the interesting, yet often overlooked aspects of spring motion.

November 14, 2016

Raivo LaanemetsThe infrastructure setup

I run a number of services to make my work easier when providing services as a developer. Some of these are just for myself (like mail) but others are used by my clients who do not want to maintain their own development infrastructure. The services are:

  • Mail
  • Code repositories
  • Project tracker
  • Test server
  • Logging and monitoring
  • System backups
  • Other things

Mail

Mail is my main communication with many clients. I have ran my own mail server for the last 6 years. It is based on Debian's standard Postfix installation. Last 3 years I have been using WebMail as the webmail frontend. I do not recommend running your own mail server as there are lots of additional software needed and many pitfalls making your mail possibly get stuck in receipient's spam folders.

Code repositories

Git is currently the most popular code version control system. Such system holds source code and other files that need to be tracked for changes in time, often together with a project tracker. A central code repository makes distribution of code to server easier. I host central git repositories through Gitolite.

Project tracker

I use Redmine issue tracker for managing projects. It is a generic project tracker with built-in time tracking, multiproject support, multiteam support, repository viewer and lots of other useful features. I have used it since 2010 and have not seen anything better despite having used almost all popular trackers.

Test server

A new project gets its first deployment as soon as possible. This is how I work. There is a huge communication boost with the client when all work can be deployed immediately for quick feedback. The test server runs at my home. The server has a second role as NAS to keep larger files and my desktop backups.

Logging and monitoring

Monitoring makes sure that all deployments are up and that there are no upcoming issues such as expiring domains and expiring SSL certificates. Logging records the application activity and sends it to a central location. Error logs are monitored for new errors. The monitoring system sends mail alerts when something is down, soon to expire or runs with errors. I'm using a setup based on rsyslog and it is described here in more detail. The system uses a custom frontend to browse the logs and implement alerts.

System backups

The backup system makes file and database backups of every important deployment. This has saved us in couple of times when files have been accidentally modified or removed from production or when the whole environment has been destroyed. It is rare to happen but it is better to prepare. The backup system is a collection of bash scripts and I have described it here and here. I usually run backups for free during development and later in the maintenance phase for a fee.

Other things

Besides the important stuff above and the client systems I also run:

  • My tech blog.
  • The "watcher". This periodically monitors a number of online forums and Facebook groups to get possible job leads. I have made interesting contacts through it but the actual jobs have come from the other channels so far.
  • My company homepage. Promoting it some time ago worked well but it tends to bring potential clients with too big projects. So this has been kept quite minimal for now.
  • An instance of feeds. I read lots of blogs and they are my main source of tech news.
  • An instance of an online accounting software.

Hardware

The infrastructure is currently running on 3 separate machines:

  1. A Linode VPS instance (mail, code repositories, project tracker - important stuff).
  2. A Amazon EC2 instance (backups)
  3. Home server (test server, logging and monitoring)

The servers run Debian or a Debian-based distro. The current setup costs me about 100-200 EUR per month to run (without considering maintenance time). So far I think it's been worth it. It has been built over the last 6 years but might still get some changes in the future.

November 13, 2016

Raivo LaanemetsNow, 2016-11

This is an update on things related to this blog and my work.

Blogging

Upgrades to the blog code:

  • List of last posts under each post page. If it turns out to waste too much space I will remove it.
  • All posts page now has index by year.
  • I have optimized the article print view. It now contains less noise.
  • Flatter style: no border on inline code and code blocks. This improves readability.

I also did some minor edits to existing articles: fixed capitalization and normalized tags.

Work

  • I have updated the generic terms. I have covered some previously unregulated cases considering the services I provide and when I provide them.
  • I am back on the Scale Laser project. On the last 2 weeks I have been studying SVG more in depth for the drawing editor component.

Router upgrade

My home network now has a better router. I have replaced the old router with a hardware-NAT Asus RT-AC53. The old router caused disconnects for the main desktop and WiFi devices. The test server running at my home and hosting many projects was also affected by this. The router upgrade has made the issues disappear.

Open Source

  • Fast-feed supports Node.js 7.x. No actual changes were needed, I only updated the Travis conf. Readme now contains instructions how to install it on Windows.
  • Released MySQL-syslog-viewer. It was my weekend project that took more than one weekend in the end.
  • Released Session Viewer. It was another weekend project.
  • Blog-Core leaked identify processes. Fixed in release 1.1.1.
  • Minor Sublime EJS highlighter improvement.

Besides these I have 2 more weekend projects to announce. The code for them is already finished.

Other things

Docker issues

I ran into issues with Docker again. Idea of containers and automation is good but current Linux distros and software are not built for it and require too much manual interaction. This makes some things harder than it has to be.

Vue.js

I have looked at Vue.js as a modern JavaScript framework. I can see that it leads to cleaner code than KnockoutJS. I would like to evaluate it on a future project. I have already worked through some small examples and I like it a lot.

Apartment association

Much needed building renovations were delayed as apartment owners voted against it. There were 51 votes for it and 12 votes against. We have 102 apartments and the required number of votes for the loan was 50%+1 meaning that we missed 1 vote. We have already scheduled a new voting and hope to pass it. This decision has high impact on my schedule in 2017 as I won't be able to work in my home office anymore due to construction work.

To relieve the air quality problems before reconstruction I have bought a LB88 air humidifier. After renovations it should become unnecessary to use any air quality control equipment as the ventilation system will be completely rebuilt. At the moment we have slight overheating to provide livable temperatures in all apartments. In my apartment with old windows it causes the air humidity to drop below 20% during winters which results in breathing issues. I'm using the humidifier to keep it in the 40-50% range.

November 11, 2016

Four Years RemainingThe Meaning of the Gaussian Kernel

A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

Introductory Background

The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, x^T y, and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

    \begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}

Analogously, if you raise x^Ty to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: f(x) = w^Tx + b. We can "kernelize" it by first representing w as a linear combination of the data points (this is called a dual representation):

    \[f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,\]

and then swapping all the dot products x_i^T x with a custom kernel function:

    \[f(x) = \sum_i \alpha_i k(x_i,x) + b.\]

If we now substitute k(x,y) = (x^T y)^2 here, our model becomes a second degree polynomial regression. If k(x,y) = (x^T y)^5 it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function k, and, of course, the Gaussian function is one of these choices:

    \[k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).\]

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

The Gaussian Kernel

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power i makes a kernel with a feature space, which includes all i-wise products. Now let us examine what happens if we add two or more kernel functions. Consider k(x,y) = x^Ty + (x^Ty)^2, for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

    \[(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),\]

i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

Multiplying a kernel function with a constant c is also meaningful. It corresponds to scaling the corresponding features by \sqrt{c}. For example, k(x,y) = 4x^Ty = (2x)^T(2y).

Still with me? Great, now let us combine the tricks above and consider the following kernel:

    \[k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.\]

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by \sqrt{2} and all triple products scaled down by \sqrt{6}.

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

    \[\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).\]

We can conclude here that k(x,y) = \exp(x^Ty) is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector x to \frac{x}{|x|} before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel k(x,y) = x^Ty. If we apply it to normalized vectors, we get

    \[k\left(\frac{x}{|x|}, \frac{y}{|y|}\right) = \left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.\]

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

    \begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)}  \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}

Voilà, the Gaussian kernel. Well, it still lacks \sigma^2 in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by 1/\sigma

To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if i-th degree are scaled down by \sqrt{i!} before normalization). Simple, right?

An Example

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let x = 1, y = 2. The value of the Gaussian kernel k(x, y) for these inputs is:

    \[k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...\]

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

    \[\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).\]

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

----\phi'(1) = (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

----\phi'(2) = (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations \phi we just need to normalize:

----\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} = (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

----\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} = (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

Finally, we compute the simple dot product of these two vectors:

    \[\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....\]

In boldface are the decimal digits, which match the value of \exp(-0.5|1-2|^2). The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

A 2D Example

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

    \begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}},  \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. x_1x_2 = x_2x_1), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form \frac{x_1^a x_2^b}{\sqrt{n!}} must be repeated exactly \frac{n!}{a!b!} times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by \sqrt{\frac{n!}{a!b!}}. "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector (v, v, v), for example. Its norm is \sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|, exactly the same as the norm of the single-element vector (\sqrt{3}v).

As we do this scaling, each feature gets converted to a nice symmetric form:

    \[\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.\]

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with 21^2 = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

    \[\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).\]

Let us limit ourselves to this degree-3 example and let x = (0.7, 0.2), y = (0.1, 0.4) (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

----\phi'_3(0.7, 0.2) = (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

----\phi'_3(0.1, 0.4) = (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

After normalization:

----\phi_3(0.7, 0.2) = (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

----\phi_3(0.1, 0.4) = (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

The dot product of these vectors is 0.8196\dots, what about the exact Gaussian kernel value?

    \[\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.\]

Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

The features of the unnormalized d-dimensional Gaussian kernel are:

    \[\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is \sqrt{\exp({\bf x}^T{\bf x})}. Thus we may also state the following:

The features of the d-dimensional Gaussian kernel are:

    \[\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

That's it, now you have seen the soul of the Gaussian kernel.

November 10, 2016

TransferWise Tech BlogWhen to Throw an Exception

Simplest rule to figure out whether we need an exception is to think if having a stracktrace will give any value. For example, do we care about stacktrace when some business rule is violated?

One widely known principle is that exceptions should be used for programming errors. But what is a programming error? Maybe it is easier to understand if we say that exceptions should be used when something happens that cannot be avoided by using common knowledge. Common knowledge is something that we believe all well behaving consumers of our API should have.

For example, we can assume everybody to know that there are 31 days in December. So the following is a valid case to throw an exception:

newYearsEve = Date.of(32, 12, 2016)  

Now lets say we have a library in our application that provides currency exchange rates. In this example this library only supports rates for EUR source. When client executes this API with GBP-USD then we could throw an exception. However, this is not a programming error as we cannot expect everybody to know about this constraint in the library. We can solve this by exposing the list of supported routes as well and have following API:

Rate getRate(Route route)  
boolean isSupported(Route route)  

We are now exporting the knowledge about supported routes to client. This knowledge is certainly not part of common knowledge but by adding isSupported we have made it possible for the client to easily acquire that knowledge. Hence here we could again opt for throwing an exception when getRate gets called with an unsuppported route.

Read more about this topic: http://c2.com/cgi/wiki?AvoidExceptionsWheneverPossible

Raivo LaanemetsAnnouncement: Session Viewer

Some time ago I put together a web site visitor logger. At the first I did it as an experiment to see how many of my blog visitors are bots. I also wanted to see if I could use data from it to implement a better anti-bot solution for Google Analytics (GA).

I'm storing session data (a session in my app has the lifespan of a session cookie) in a MySQL database. The data includes easy-to-obtain numerical and boolean values. Besides running analytics I can now also browse individual sessions. This was not possible with GA. At the moment I'm seeing the most value of the app in that feature. Referrer spam has also disappeared but this is because the solution uses a custom protocol.

Bot detection

For bot detection I implemented the following techniques:

  1. Check for the specific JavaScript global properties described in this StackOverflow thread.
  2. Number of mouse/scroll events.
  3. Recording page view/session duration.

Results

I have put together some statistics based on the October data from this blog.

Total number of sessions1396
Total number of pageviews2148
Total number of distinct IP addresses1202

Global variables

By the number of sessions.

window.callPhantom30.21%
window.Buffer00.00%
window.emit00.00%
window.spawn00.00%
window.webdriver00.00%
window.domAutomation00.00%

Mouse and scroll events

Sessions with mouse events93566.98%
Sessions with scroll events89964.40%
Sessions with mouse or scroll events107476.93%

Session length

Sessions longer than 5 seconds126790.67%
Sessions longer than 15 seconds109678.51%
Sessions longer than 30 seconds88563.40%
Sessions longer than 60 seconds84960.81%
Sessions longer than 120 seconds61944.34%
Sessions longer than 240 seconds59242.41%

Results summary

Detection of bots through JavaScript global variables does not work well for generic bots. A number of bots, including Google's and Baidu's indexers, will pass the check. They are executing JavaScript and stay on a page for 5-10 seconds. I checked this by randomly picking short sessions and studying their IP locations (the app has built-in support for this through the IP-Lookup service from WhatIsMyIPAddress.com).

I did find some referral spam through manual inspection. There were 1-2 sessions per week with referral to free-share-buttons spam. The bot executed JavaScript but none of the globals above were set. The session length was less than 10 seconds and there were no mouse or keyboard events.

At the end I decided to filter sessions by 30 seconds. This seems to keep statistics clean enough without losing too much information. Most of the interesting data for me is in longer multi-page sessions anyway.

Session Viewer

I added a frontend with a couple of tables and charts once I had enough useful data in the database. It contains very few analytical features: only the last month top entry pages and top pages are shown. The goal of this was not to replace GA which is much more powerful with all the predefined and possible custom views but to complement it with the ability to browse individual sessions. This feature was very simple to add to the app. The session page just lists all visited pages by the view start time.

Screenshot: list of sessions

Session Viewer - List of sessions

Screenshot: single session

Session Viewer - Single session

Source code and documentation can be found in the GitHub repository.

November 05, 2016

Kiibi lühiuudiste blogiSpaceX ilmselt lahendas tankimisplahvatuse

SpaceX’i viimati suure tulevärgi saatel tankimise käigus õhku lennanud Falcon 9 raketi õnnetuse võis põhjustada see, et oksüdeerija – vedel hapnik – jahtus liigselt ja tahkestus. Kuidas see omakorda täpselt plahvatuse põhjustas ei selgitata, kuid liigse jahtumise põhjustas vedel heelium, mida hoitakse veel madalama temperatuuri juures lähestikku.

Põhjalikumalt loe Engadget‘ist.


November 02, 2016

Raivo LaanemetsBlog-Core 1.1.1

Today I have released Blog-Core version 1.1.1. The update contains a fix for leaking identify processes. The identify binary is a part of ImageMagick and is used for detecting image dimensions when "Insert as image" is clicked in the administrative UI. The process leak was a result of not closing its output stream/pipe.

The update contains no other changes and is fully compatible with the previous versions. To upgrade through swipl:

?- pack_install(blog_core, [upgrade(true)]).

October 30, 2016

Raivo LaanemetsAnnouncement: MySQL-syslog-viewer

Today I have finished Open Sourcing my log monitoring app. It started as a weekend project to find an alternative to real-time logging services and other apps. There are lots of alternatives but many of them did not satisfy my requirements or were a bit expensive to use. I discussed these topics in my previous article about logging.

The source code and documentation of the app can be found on GitHub. There is not much code. All heavy lifting is done by rsyslog and MySQL. I just put together a simple web app with some nice Node.js packages for quering the database and sending mail. How it relates to the system architecture is shown on the following diagram.

MySQL-syslog-viewer

At the moment I have this setup running for 8 machines and about 30 apps. Throught the remote rsyslog instances it is monitoring about 200 different log sources (files, scripts, apps) in total. The performance is OK, log insertion and querying are fast (5000 lines/sec is not an issue on a low-end dual-core CPU with 5400 rpm Western Digital Red drives). However, my original plan to purge logs by selective queries turned out to be too slow. This is the reason why the final version of the code only supports table purge by log line count and has no selective purge filters.

Screenshot from the working app. Source code and documentation are on GitHub.

October 26, 2016

Ingmar TammeväliTaasavastasin enda jaoks Opera veebisirvija (no – ads rokib)

operabenchmarkKuna reklaame topitakse tänapäeval arutult sisse, siis veebilehtede laadimine võtab isegikiire ühendusega juba aega ning arvuti CPU saab kõvasti vatti.

Opera download link

Nüüd ka AdBlock ja teised läinud marketingi teed, kus osad firmad saavad …läbida sealt ostes endale nö turvalise reklaami teenuse sealt. Ehk need reklaamiblokeerijad pole enam usaldusväärsed.

Kõige rohkem meeldiski Opera native adblock (ehk sirvijas saab kohe reklaame peita), lahe ka ta kuvab statistika, kaua läheb lehe laadimine koos reklaamidega ja ilma…


Raivo LaanemetsDoing some Python

Last three weeks I have been doing some Python. I got a desktop automation project where the main codebase ran in a Raspberry Pi computer. I did not choose Python initially but a part of the codebase was already developed it. The project was in extreme hurry and I did not have time to switch it to a language I knew better.

I had used Python before in 2008 in scientific computing classes and in programming teaching practice classes during my MSc studies. The former case used Python as a wrapper around some native numerical vector computation libraries and the latter case dealt with extreme basic language constructs like loops. These were both very academic use cases.

The Pi application used Python 3. I started by copy-pasting together code examples from Google and testing throughfully. The code processed Flic.io button clicks and ran some TCP requests. The buttons part was simple but the TCP part turned out moderately hard:

  • Flic.io loop was tied to the main thread.
  • TCP connection has to be open and handled in other threads.
  • Sockets are not threadsafe.

This led to the the solution with non-blocking IO and complex queue(locks)-based approach. At the end I got it properly working but the code turned out pretty long and complicated compared to the similar code I have written in Node.js where IO is asynchronous by default and where you cannot have race conditions. The server side daemon coordinating our Pi setups was indeed written in Node.js.

In the later stage the project also required 2 GUI applications running on Windows. It came as a nice surprise that Python has a built-in GUI toolkit. The alternatives considered were Qt (C++, too slow to develop, requires messy setup with compilers and IDEs), Node.js/Electron (Node.js has no cross-platform GUI toolkit at all and Electron has miserable performance on low-power devices), SWI-Prolog (it has XPCE and I know the language pretty well but I wanted to avoid dealing with concurrent threaded IO in another language).

I managed to build the apps with Python and tkinter. Another good surprise was support for .pyw bundles that allow to start applications on Windows by clicking the file icon. This also allowed me to put together a trivial NSIS installer that placed the working launch icon to the user's desktop.

Things I like about Python

  • Built-in cross-platform GUI toolkit tkinter.
  • Lots of other good stuff in standard library.
  • Bundling to pyz/pyw files.
  • Installed on Linux distros by default.
  • Relatively low memory usage.

and not like

  • Documentation sucks (how does non-blocking socket.recv behave?).
  • Too many IO approaches (blocking, coroutines, async, etc).
  • Lots of obsolete Python 2 example code that blows up under Python 3.
  • Weird strings (binary stuff should be separated into Buffer-like blobs).
  • No application-based package manager like NPM (Pip is system-wide).

This will likely be my single Python project for now. I generally do not develop GUI applications or Raspberry Pi applications. However, if I had to, I might use Python again.

October 19, 2016

Kuido tehnokajamRaporti andmeallika akna taastamine

Ctrl+Alt+D klahvikombinatsioon toob raporti Report Data andmeallikate akna tagasi, kui see peaks kadunud olema

October 11, 2016

Andrei SolntsevМиграция на Mockito 2.1

На днях мир был ошеломлён неожиданной новостью.

Шутки ли, вышла Mockito 2.1.0 - после стольких лет ожиданий! Признаться, я уж было отчаялся.

В Mockito 2 обещается много всяких вкусняшек, включая:

  • Поддержка Java 8
  • Переход с CGLIB на ByteBuddy
  • Мокинг final классов и методов

Ура! Надо брать!

Какое же меня ждало разочарование…

Тотальный облом

Я попробовал перевести на Mockito 2.1.0 свой рабочий проект, и … облом. Я получил больше 100 красных юнит-тестов (из ~6000).

Моя первая эмоция - какого чёрта! В топку этот Mockito 2!

Но при ближайшем рассмотрении оказалось, что Mockito 2 молодец, а вот эти тесты - плохие. Новый Mockito обнаружил в моих тестах целый ряд проблем, который старый Mockito не замечал. Вот оно чо, Мокитыч…

Что надо сделать

Ниже - путеводитель по миграции на Mockito 2. Протяни руку, читатель, и проведу тебя по этому тернистому пути к счастливому финалу.

Первым делом ты заменишь импорты

Много, много импортов. Придётся поменять много файлов.

import static org.mockito.Matchers.any;
import static org.mockito.Matchers.anyLong;
import static org.mockito.Matchers.anyVararg; // не нужен - меняем на any()

на

import static org.mockito.ArgumentMatchers.any;
import static org.mockito.ArgumentMatchers.anyLong;

Затем ты обновишь API функции doAnswer

Было:

when(userDeviceService.save(any(UserDevice.class)))
    .then(invocation -> invocation.getArgumentAt(0, UserDevice.class));

Стало (getArgumentAt -> getArgument, убираешь класс):

when(userDeviceService.save(any()))
    .then(invocationOnMock -> invocationOnMock.getArgument(0));

Дальше ты обновишь API isMock

Было:

  import org.mockito.internal.util.MockUtil;
  
  assertFalse(new MockUtil().isMock(expected));

Стало:

  import static org.mockito.internal.util.MockUtil.isMock;
  
  assertFalse(isMock(expected));

Новый вариант и правда лучше, не поспоришь. Не нужно создавать ненужный объект.

На какие грабли ты наступишь

При переходе на Mockito 2.1 cотня-другая твоих тестов сломается, потому что:

1) Матчер any() больше не срабатывает на null

Раньше это работало:

doNothing().when(service).calculateAmounts(any(BigDecimal.class), anyString());

Теперь не работает, если хотя бы один из параметров - null. Да-да, мой друг, тебе придётся перелопатить все тесты, которые почему-то передают null вместо настоящего значения, и прописать там правильные значения.

Но это и к лучшему, ведь и правда новый код лучше:

doNothing().when(service).calculateAmounts(order.amount, order.currency);

или, может, так:

doNothing().when(service).calculateAmounts(new BigDecimal("100.00"), "EEK");

Забыл сказать: если там действительно должен передаваться null, то это в тесте надо прописать явно:

doNothing().when(service).calculateAmounts(any(BigDecimal.class), isNull());

2) Матчер anyInt() больше не срабатывает на параметр типа long

Работало с Mockito 1.x, падает с Mockito 2.x:

    when(mqService.send(anyString(), anyInt())).thenReturn("transfer-ref");

Для Mockito 2.1 придётся поменять anyInt() на anyLong():

    when(mqService.send(anyString(), anyLong())).thenReturn("transfer-ref");

Да-да, мой друг, тебе придётся перелопатить все тесты, которые вместо long передают int и т.п. Но это и к лучшему, ведь эти тесты были неточные.

3) Ты обнаружишь у себя плохие тесты

Просто плохие. Негодные. Например, вот такой:

@Test
public void checkMoratoriumRunsSilentlyWhenNoMoratorium() {
  doReturn("false").when(service).parseMoratoriumMessage(any(Mandate.class), any(LoanApplication.class));
  ...
  service.checkForMoratorium(any(Mandate.class), any(LoanApplication.class)); // Какую хрень мы сюда передаём?
  ...
}

С Mockito 1.x этот тест работал, а с Mockito 2.1 уже не хочет. И правильно!

Очевидно, во второй строке хотели использовать mock, а не any:

service.checkForMoratorium(mock(Mandate.class), mock(LoanApplication.class));

Хотя мы-то с вами понимаем, что тут не нужен ни mock, ни any, а достаточно просто создать объекты:

service.checkForMoratorium(new Mandate(), new LoanApplication());

4) Ты обнаружишь у себя много неряшливых тестов

… которые проверяют лишь часть параметров и не замечают, что остальные - null.

doReturn(user).when(loginService).tokenLogin(eq("bob"), eq("login-key"), anyString());
    
security.login("bob", "login-key", null);

Как видите, во второй строке тест передаёт параметр null. И только null. Расследование показало, что ни один тест в системе не передавал туда ничего, кроме null.

Новый тест гораздо точнее, и не нужны все эти eq и anyString:

    request.remoteAddress = "127.0.0.2";
    doReturn(user).when(loginService).tokenLogin("bob", "login-key", "127.0.0.2");
    ...

5) Ты обнаружишь мистические красные тесты

Ты обнаружишь красные тесты, причину падения которых очень сложно раскопать

Например, такой:

@Test
public void requestByReferenceNumberNeverCreatesSubscription() {
  RequestByReferenceNumber requestByReferenceNumber = new RequestByReferenceNumber(user, "12345678901234567890");
  when(gisgmpService.request(any(RequestByDrivingLicense.class))).thenReturn(requestByReferenceNumber);

  GISGMP.requestCharges("12345678901234567890");
  ...

С этим я долго провозился. Я не мог понять, почему он перестал работать.

Обратите внимание на вторую строку. Очевидно, там хотели написать не any(RequestByDrivingLicense.class), а any(RequestByReferenceNumber.class) (они оба наследуют один суперкласс).

Это похоже на багу Mockito 1: он позволял использовать any(НеверныйКласс.class), и этот некорректный тест оставался зелёным несколько лет. :(

6) Ты нарвёшься на то, что anyList() и anyCollection() - теперь разные вещи

Например, со старым mockito этот тест работал:

  @Test
  public void domesticPaymentInForeignCurrencyCanBeEnabled() {
    doCallRealMethod().when(Payments.accountService).forOperation(anyList(), eq(DOMESTIC));

    Collection<Account> accounts = ...
    
    return accountService.forOperation(accounts, DOMESTIC);

Обратите внимание, что в моке в первой строчке используется anyList(), а на самом деле передаётся переменная accounts типа Collection (хотя в душе она List). Mockito 2 больше не позволяет таких шалостей. Изволь прописать anyCollection().

И будешь доволен как слон

В общем, мой друг, придётся помучаться, но в конце будешь доволен. Тесты стали лучше, мир стал светлее.

Андрей Солнцев

asolntsev.github.io

Andrei SolntsevMigration to Mockito 2.1

Recently we got a great news.

They released Mockito 2.1.0 - after so many years of hopeless waiting!

There is a lot of cool updates in Mockito 2, including:

  • Support for Java 8
  • Migration from CGLIB to ByteBuddy
  • Mocking final classes and methods

Yes! I must use it! - I decided.

What a huge disappointment it was for me…

Epic fail

I tried to upgrade my working project to Mockito 2.1.0, and … ups.

I got over 100 red tests (from ~6000).

My first emotion was - what the hell! Mockito 2 sucks!

But the further investigation showed that Mockito 2 is cool, but some of my tests are bad. The new Mockito found a whole bunch of problems in our tests.

Let’s discover them together.

What you need to do

Below you can find my tutorial for migration to Mockito 2.

Let’s start!

First, you will need to replace import

A lot of imports. In many files. Replace these lines:

import static org.mockito.Matchers.any;
import static org.mockito.Matchers.anyLong;
import static org.mockito.Matchers.anyVararg; // not needed - replaced by any()

by these lines:

import static org.mockito.ArgumentMatchers.any;
import static org.mockito.ArgumentMatchers.anyLong;

Second, you will update API of doAnswer

Was:

when(userDeviceService.save(any(UserDevice.class)))
    .then(invocation -> invocation.getArgumentAt(0, UserDevice.class));

Now:

when(userDeviceService.save(any()))
    .then(invocationOnMock -> invocationOnMock.getArgument(0));

(getArgumentAt -> getArgument, remove the class)

Third, you update API of isMock

Was:

import org.mockito.internal.util.MockUtil;

assertFalse(new MockUtil().isMock(expected));

Now:

import static org.mockito.internal.util.MockUtil.isMock;

assertFalse(isMock(expected));

The last version is really better, don’t you think? No need to create an unused object.

What problems you will experience

Some of your tests will break when you upgrade to Mockito 2.1. Below you will find some of the reasons.

1) Matcher any() does not match null anymore

This code worked before:

doNothing().when(service).calculateAmounts(any(BigDecimal.class), anyString());

But doesn’t work anymore, if some of parameters are null. Yes, my friend! You will need to rework all your tests that pass null instead of realistic value.

And it’s great, because the new code is really better:

doNothing().when(service).calculateAmounts(order.amount, order.currency);

Or this way:

doNothing().when(service).calculateAmounts(new BigDecimal("100.00"), "EEK");

If you really need to pass null parameter, you can still do it. Just pass null explicitly using isNull matcher:

doNothing().when(service).calculateAmounts(any(BigDecimal.class), isNull());

2) Matcher anyInt() does not match long anymore

This code worked with Mockito 1.x, but fails with Mockito 2.x:

when(mqService.send(anyString(), anyInt())).thenReturn("transfer-ref");

You need to replace anyInt() by anyLong() for Mockito 2.1:

 when(mqService.send(anyString(), anyLong())).thenReturn("transfer-ref");

Yes, my friend. You will need to rework all your tests that pass int instead of long etc. But it’s great, don’t you think? Because these tests were inaccurate.

3) You will discover bad tests

Yes, my friend. You will discover lot of bad lines among your tests. Just bad. Like this one:

@Test
public void checkMoratoriumRunsSilentlyWhenNoMoratorium() {
  doReturn("false").when(service).parseMoratoriumMessage(any(Mandate.class), any(LoanApplication.class));
  ...
  service.checkForMoratorium(any(Mandate.class), any(LoanApplication.class)); // WHAAAAAAAT?
  ...
}

This code worked with Mockito 1.x. Suddenly. But it fails with Mockito 2.1 - and it’s great, don’t you think?

Obviously, we should use mock instead of any in the second line:

service.checkForMoratorium(mock(Mandate.class), mock(LoanApplication.class));

By the way, even better solution is to avoid both mock and any and just create plain objects:

service.checkForMoratorium(new Mandate(), new LoanApplication());

4) You will discover a lot of sloppy tests

… that check only some parameters and don’t discover that others are null.

For example:

doReturn(user).when(loginService).tokenLogin(eq("bob"), eq("login-key"), anyString());
    
security.login("bob", "login-key", null);

As you see, tests passes null parameter in the second line. And only null! I discovered that there was no test that would pass something different from null there.

Mockito 2.1 will fail with such a sloppy test, because anyString() matcher doesn’t allow null anymore.

The newer test that works with Mockito 2.1 is actually more precise:

    request.remoteAddress = "127.0.0.2";
    doReturn(user).when(loginService).tokenLogin("bob", "login-key", "127.0.0.2");
    ...

You see it? You don’t need all these obscure eq and anyString. Much better!

5) You will discover mystical red tests

You will find some red tests for which it’s really hard to understand why they fail.

For example:

@Test
public void requestByReferenceNumberNeverCreatesSubscription() {
  RequestByReferenceNumber requestByReferenceNumber = new RequestByReferenceNumber(user, "12345678901234567890");
  when(gisgmpService.request(any(RequestByDrivingLicense.class))).thenReturn(requestByReferenceNumber);

  GISGMP.requestCharges("12345678901234567890");
  ...

It took quite a long time to solve this issue. I could not find out why this test started to fail with Mockito 2.

Please note the second line. Obviously authors wanted to write any(RequestByReferenceNumber.class) instead of any(RequestByDrivingLicense.class) (they both inherit the same superclass).

It seems to be a bug of Mockito 1: it allowed using any(AWrongClass.class), and this incorrect test have been green for several years. :(

6) You will find out that anyList() and anyCollection() are now different things

For example, this code worked with Mockito 1:

  @Test
  public void domesticPaymentInForeignCurrencyCanBeEnabled() {
    doCallRealMethod().when(Payments.accountService).forOperation(anyList(), eq(DOMESTIC));

    Collection<Account> accounts = ...
    
    return accountService.forOperation(accounts, DOMESTIC);

Note that mock in the first line uses anyList(), but the code actually passes variable accounts of type Collection (thought it’s actually List).

Mockito 2 doesn’t allow it anymore. You need to mock more precisely using anyCollection().

And you will be happy

To summarize: you will see a lot of pain, but you will become happy. The tests got better, the world got lighter.

Peace 2.1.0!

Andrei Solntsev

asolntsev.github.io