Pinu planeet

January 22, 2017

Anton ArhipovTwitterfeed #4

Welcome to the fourth issue of my Twitterfeed. I'm still quite irregular on posting the links. But here are some interesting articles that I think are worth sharing.

News, announces and releases


Atlassian aquired Trello. OMG! I mean... happy for Trello founders. I just hope that the product would remain as good as it was.

Docker 1.13 was released. Using compose-files to deploy swarm mode services is really cool! The new monitoring and build improvements are handy. Also Docker is now AWS and Azure-ready, which is awesome!

Kotlin 1.1 beta was published with a number of interesting new features. I have mixed feelings, however. For instance, I really find type aliases an awesome feature, but the definition keyword, "typealias", feels too verbose. Just "alias" would have been much nicer.
Meanwhile, Kotlin support was announced for Spring 5. I think this is great - Kotlin suppot in the major frameworks will definitely help the adoption.

Is there anyone using Eclipse? [trollface] Buildship 2.0 for Eclipse is available, go grab it! :)

Resonating articles


RethinkDB: Why we failed. Probably the best post-mortem that I have ever read. You will notice a strange kvetch at first about the tough market and how noone wants to pay. But then reading forward the author honestly lists what was really wrong. Sad that it didn't take off, it was a great project.

The Dark Path - probably the most contradicting blog post I've read recently. Robert Martin takes his word on Swift and Kotlin. A lot of people, the proponents of strong typing, reacted to this blog post immediately. "Types are tests!", they said. However, I felt like Uncle Bob just wrote this articles to repeat his point about tests: "it doesn't matter if your programming language strongly typed or not, you should write tests". No one would disagree with this statement, I believe. However, the followup article was just strange: "I consider the static typing of Swift and Kotlin to have swung too far in the statically type-checked direction." OMG, really!? Did Robert see Scala or Haskell? Or Idris? IMO, Swift and Kotlin hit the sweet spot in regards to type system that would actually _help_ the developers without getting in the way. Quite a disappointing read, I have to say..

Java 9


JDK 9 is feature complete. Those are great news. Now, it would be nice to see how will the ecosystem survive with all the issues related to reflective access. Workarounds exist, but there should be a proper solution without such hacks. Jigsaw caused a lot of concerns here and there but the bet is that in the long run, the benefits will outweigh the inconveniences.

Misc


The JVM is not that heavy
15 tricks for every web dev
Synchronized decorators
Code review as a gateway
How to build a minimal JVM container with Docker and Alpine Linux
Lagom, the monolith killer
Reactive Streams and the weird case of backpressure
Closures don’t mean mutability.
How do I keep my git fork up to date?

Predictions for 2017


Since it is the beginning of 2017, it is trendy to make predictions for the trends of the upcoming year. Here are some prediction by the industry thought leaders:

Adam Bien’s 2017 predictions
Simon Ritter’s 2017 predictions
Ted Neward’s 2017 predictions

January 04, 2017

TransferWise Tech BlogEffective Reuse on Frontend

In my previous post I discussed cost of reuse and some strategies how to deal with it on the backend. What about frontend? In terms of reuse both are very similar to each other. When we have more than just a few teams regularly contributing to frontend we need to start thinking how we approach reuse across different contexts/teams.

Exposing some API of our microservice to other teams makes it a published interface. Once this is done we cannot change it that easily anymore. Same happens on frontend when a team decides to "publish" some frontend component to be reused by other teams. The API (as well as the look) of this component becomes part of the contract exposed to the outside world.

Hence I believe that:

We should split web frontend into smaller pieces — microapps — much the same way as we split backend into microservices. Development and deployment of these microapps should be as independent of each other as possible.

This aligns quite well with the ideas of Martin Fowler, James Lewis and Udi Dahan who suggest that "microservice" is not a backed only concept. Instead of process boundaries it should be defined by business capabilities and include its own UI if necessary.

Similarly to microservices we want to promote reuse within each microapp while we want to be careful with reuse across different microapps/teams.

January 02, 2017

Raivo LaanemetsNow, 2017-01, summary of 2016 and plans for 2017

This is an update on things related to this blog and my work.

Last month

Blogging

  • Added an UX improvement: external links have target="_blank" to make them open in a new tab. The justification can be found in this article. It is implemented using a small piece of script in the footer.
  • Updated the list of projects to include work done in 2016.
  • Updated the visual style for better readability. The article page puts more focus on the content and less on the related things.
  • Updated the CV.
  • Found and fixed some non-valid HTML markup on some pages.
  • Wrote announcements to the last of my Open Source projects: DOM-EEE and Dataline.

I also discovered that mail notifications were not working. The configuration was broken for some time and I had disabled alerts on the blog engine standard error stream. I have fixed the mail configuration and monitor the error log for mail sending errors.

Work

I built an Electron-based desktop app. I usually do not build desktop applications and consider them a huge pain to build and maintain. This was a small project taking 2 weeks and I also used it as a chance to evaluate the Vue.js framework. Vue.js works very well with Electron and was very easy to pick up thanks to the similarities with the KnockoutJS library. I plan to write about the both in separate articles.

The second part of my work included a DXF file exporter. DXF is a vector drawing format used by AutoCAD and industrial machines. My job was to convert and combine SVG paths from an online CAD editor into a single DXF file for a laser cutter.

During filing my annual report I was positively surprised that I need to file very little paperwork. It only required a balance sheet + a profit/loss statement + 3 small additional trivial reports. On the previous years I had to file a much more comprehensive report now required from mid-size (Estonian scale) companies with about 250 employees.

Infrastructure

I have made some changes to my setup:

  • Logging and monitoring was moved to an OVH VPS.
  • Everything else important is moved away from the home server. Some client systems are still waiting to be moved.

The changes were necessary as I might travel a bit in 2017 and it won't be possible to fix my own server at home when an hardware failure occurs. I admit it was one of the stupidest decisions to run my own server hardware.

Besides these changes:

  • infdot.com now redirects to rlaanemets.com. I am not maintaining a separate company homepage anymore. This gives me more free time for the important things.
  • Rolled out SSL to my every site/app where I enter passwords. All the new certs are from Lets Encrypt and are renewed automatically.
  • I am now monitoring my top priority web servers through UptimeRobot.
  • The blog frontend is monitored by Sentry.

Other things

The apartment buildings full-scale renovations were finally accepted by the other owners and the contract has been signed with the building company. The constructions start ASAP. I have been looking for possible places to rent a quiet office space as the construction noise likely makes work in the home office impossible.

Yearly summary and plans

2016 was incredibly busy and frustrating year for me. A project at the beginning of the year was left partially unpaid after it turned out to be financially unsuccessful for the client. The project did not have a solid contract and a legal action against the client would have been very difficult. This put me into a tight situation where I took more work than I could handle to compensate my financial situation. As the work accumulated:

  • I was not able to keep up with some projects. Deadlines slipped.
  • I was not able to accept better and more paying work due to the existing work.
  • Increasing workload caused health issues: arm pains, insomnia.

In the end of the year I had to drop some projects as there was no other ways to decrease the work load. Last 2 weeks were finally pretty OK.

In 2017 I want to avoid such situations. Financially I'm already in a much better position. I will be requiring a bit stricter contracts from my clients and select projects more carefully.

Considering technology, I do not see year 2017 bring many changes. My preferred development platforms are still JavaScript (browsers, Node.js, Electron, PhantomJS) and SWI-Prolog.

December 28, 2016

Anton ArhipovTwitterfeed #3

Welcome to the third issue of my Twitterfeed. Over two weeks since the last post I've accumulated a good share of links to the news and blog posts, so it is a good time "flush the buffer".


Let's start with something more fundamental than just the news about frameworks and programming languages. "A tale of four memory caches" is a nice explanation of how browser caching works. Awesome read, nice visuals, useful takeaways. Go read it!

Machine Learning seems is becoming more and more popular. So here's a nicely structured knowledge-base at your convenience: "Top-down learning path: Machine Learning for Software Engineers".

Next, let's see what's new about all the reactive buzz. The trend is highly popular so I've collected a few links to the blog posts about RxJava and related.

First, "RxJava for easy concurrency and backpressure" is my own writeup about the beauty of the RxJava for a complex problem like backpressure combined with concurrent task scheduling.

Dávid Karnok published benchmark results for the different reactive libraries.

"Refactoring to Reactive - Anatomy of a JDBC migration" explains how reactive approach can be introduced incrementally into the legacy applications.

The reactive approach is also suitable for the Internet of Things area. So here's the article about Vert.x being used for IoT world.

IoT is actually not only about the devices but also about the cloud. Arun Gupta published a nice write up about using the AWS IoT Button with AWS Lambda and Couchbase. Looks pretty cool!

Now onto the news related to my favourite programming tool, IntelliJ IDEA!

IntelliJ IDEA 2017.1 EAP has started! Nice, but I'm not amused. Who needs those emojis anyway?! I hope IDEA developers will find something more useful in the bug tracker to fix and improve.

Andrey Cheptsov experiments with code folding in IntelliJ IDEA. The Advanced Expressions Folding plugin is available for download - give it a try!

Claus Ibsen announced that the work has started on Apache Camel IntelliJ plugin.

Since we are at the news about IntelliJ IDEA, I think it makes sense to see what's up with Kotlin as well. Kotlin 1.0.6 has been released, which is the new bugfix and tooling update. Seems like Kotlin is getting more popularity and people try to use it in conjunction with popular frameworks like Spring Boot and Vaadin.

Looks like too many links already so I'll stop here. I should start posting those more often :)

December 22, 2016

Raivo LaanemetsAnnouncement: Dataline chart library

Some time ago I built a small library to draw some line charts using the HTML5 canvas. I have been using it in some projects requiring simple responsive line charts. It can do this:

  • Draws min/max/zero line.
  • Draws min/max labels.
  • Single line.
  • Width-responsive.

Multiple lines, ticks, x-axis labels etc. are not support. There are other libraries that support all of these. It has no dependencies but requires ES5, canvas and requestAnimationFrame support. The library is extremely lightweight and uses very few resources.

Example

This is the HTML code containing the canvas and input data. The data is embedded directly by using the data-values attribute:

<canvas class="chart" id="chart"
  data-values="1,2,3,-1,-3,0,1,2"></canvas>        
<script src="dataline.js"></script>
<script>Dataline.draw('chart');</script>

And the CSS code to set the chart size:

.chart { width: 100%; height: 200px; }

Live rendering output:

<canvas class="chart" data-values="1,2,3,-1,-3,0,1,2" id="chart" style="width: 100%; height: 200px;"></canvas> <script src="https://rlaanemets.com/announcement-dataline-chart-library/dataline.min.js"></script> <script>Dataline.draw('chart');</script>

The source code of the library, documentation, and the installation instructions can be found in the project repository.

December 21, 2016

Kuido tehnokajamMärkeruudu ärakaotamine CheckBoxListi grupeerimisel

ASP.NET CheckBoxList komponendil mõnikord vaja nimekirja grupeerida erinevatel põhjustel Kogu trikk põhineb CSS3 kasutamisel, kuna see võib tekitada segadusi projekti piirides, siis mõtekas kasutada "inline CSS", näiteks vajaliku userControli sees <style>     #CheckBoxListOtsinguMajad input:disabled {         display: none;     } </style> reegli mõju piirame CheckBoxListOtsinguMajad

December 17, 2016

Raivo LaanemetsAnnouncement: DOM-EEE

DOM-EEE is a library to extract structured JSON data from DOM trees. The EEE part in the name means Extraction Expression Evaluator. The library takes a specification in the form of a JSON document containing CSS selectors and extracts data from the page DOM tree. The output is also a JSON document.

I started developing the library while dealing with many web scraping projects. There have been huge differences in navigation logics, page fetch strategies, automatic proxying, and runtimes (Node.js, PhantomJS, browser userscripts) but the data extraction code has been similar. I tried to cover these similarities in this library while making it working in the following environments:

  • Browsers (including userscripts)
  • PhantomJS
  • Cheerio (Node.js)
  • jsdom (Node.js)
  • ES5 and ES6 runtimes

The library is a single file that is easy to inject into any of these environments. As the extraction expressions are kept in the JSON format, and the output is a JSON document, any programming platform supporting JSON and HTTP can be coupled to PhantomJS, an headless web browser with a built-in server to drive the scraping process.

Example usage

This example uses cheerio, a jQuery implementation for Node.js:

var cheerio = require('cheerio');
var eee = require('eee');
var html = '<ul><li>item1</li><li>item2 <span>with span</span></li></ul>';
var $ = cheerio.load(html);
var result = eee($.root(),
    {
        items: {
            selector: 'li',
            type: 'collection',
            extract: { text: { selector: ':self' } },
            filter: { exists: 'span' }
        }
    },
    { env: 'cheerio', cheerio: $ });
console.log(result);

This code will print:

{ items: [ { text: 'item2 with span' } ] }

Alternatives

There is a number of similar projects. Most of them assume a specific runtime environment or try to do too much to be portable. Some examples:

  • artoo.js (client side).
  • noodle (Node.js, not portable enough).
  • x-ray (not portable, coupled with HTTP and pagination and 100 other things).

Documentation

Full documentation of the JSON-based expression language and further examples can be found in the project's code repository.

December 09, 2016

Raivo LaanemetsHello world from DXF

Last week I worked on a code to convert SVG to DXF. SVG (Scalable Vector Graphics) is a vector graphics format supported by most browsers. DXF (Drawing Exchange Format) is another vector format, mostly used by CAD applications. Our CAD editor at Scale Laser, a startup focusing on model railroad builders, uses a SVG-based drawing editor in the browser but the software controlling the actual cutting hardware uses DXF. There were no usable generic SVG to DXF converters available and we had to write our own. We only deal with SVG <path> elements and do not have to support other SVG elements.

DXF is fairly well specified through a 270-line PDF file here. The low-level data serialization format feels ancient compared to more structured XML and JSON. Also, it is quite hard to put together a minimal DXF file which can be opened by the most programs claiming DXF compatibility or that can be opened with AutoCAD itself. AutoCAD is the original program to use the DXF format.

I have put together a minimal file by trial-and-error. I kept adding stuff until I got the file loading in AutoCAD. The file follows and I explain the parts of it.

Sections

A DXF file consists of sections. The most important section is ENTITIES that contains graphical objects. Another important section is HEADER:

  0
SECTION
  2
HEADER
  9
$ACADVER
  1
AC1009
  0
ENDSEC

All sections are made up using group code-value pairs. A such pair is formatted like:

  <code>
<value>

The group code specifies either the type of the value (string, float, etc) or its semantic meaning (X coordinate) or both the type of the value and the meaning. A section begins with the SECTION keyword and section's name. A section ends with the ENDSEC keyword.

I found it was necessary to specify the file/AutoCAD version in the header. Without it, some tools, including AutoCAD, would give errors upon opening the file. This is accomplished by two code-value pairs:

  9
$ACADVER
  1
AC1009

This corresponds to the versions R11 and R12.

Lines

After the header comes the actual content section ENTITIES. It contains a rectangle made up of 4 lines (snippet truncated to show a single line only):

  0
SECTION
  2
ENTITIES
  0
LINE
  8
0
  62
8
  10
169.50
  20
42.33
  11
169.50
  21
94.19
...
  0
ENDSEC
  0

Graphical objects are specified one after another, without any further structure. A line starts with the LINE keyword and ends with the start of another object or with the section end. The line object here has the following properties.

The layer index (group code 8, value 0). I was not able to make the file display on most viewers without it:

  8
0

The line color (group code 62, value 8 - gray). Nothing was visible in some viewers without setting it:

  62
8

After that come the start and end coordinates of the line (X1, Y1, X2, Y2 as 10, 20, 11, 21 respectively):

  10
169.50
  20
42.33
  11
169.50
  21
94.19

DXF coordinates have no units such as pixel, mm etc. Interpretetion of units seems to be implicit and application-specific. For example, our laser software assumes mm as the unit.

Rendering output

This is the rendering output in de-caff, a simple Java-based DXF viewer:

Minimal DXF rectangle in de-caff

This is the rendering output in AutoCAD 2017:

Minimal DXF rectangle in AutoCAD

The full file containing the rectangle and the header section can be downloaded from here.

December 06, 2016

TransferWise Tech BlogProduct Engineering Principles @ TransferWise

Product Engineering Principles @ TransferWise

At TransferWise, we have seen phenomenal growth in the last few years - growth in users, transaction volumes, and team. Our engineering team has grown from 50 to 120 in the last 18 months. With this kind of growth, it’s easy for the company culture to evolve and degrade very quickly unless we reiterate and be mindful of our key principles and values we operate on.

I am often asked by engineers at conferences, potential hires, startup founders and others in the industry what are the key principles we organize around while building the TransferWise product. I thought I'd pen down some thoughts on this.

Before we hit the main principles we follow, here’s a quick primer on how our teams are organized. As of today, TransferWise has about 25 teams - each of which is autonomous and independent, that focus on customer-centric KPIs which eventually drive our growth. A mouthful but let’s break this down.

Teams are autonomous: Teams own their decisions. They decide how to evolve the product by talking to customers and by looking at data. The teams seek input and are challenged by others in the company on their decisions but eventually they are the final decision makers.

Teams are independent: Teams own their destiny. We try to keep cross team dependencies to a minimum. While there are some cases where a team may need help from another team to get something done, we try and avoid this as much as possible.

Teams focus on customer-centric KPIs: Teams solve customer problems. They are organized around a specific customer (team organized around a specific region, say the US) or a specific problem all customers face (making payments instant across the world). Given teams are autonomous and independent, they can pick what they want to work on but everything they work on has to drive a customer metric. Our product moves money around the world. Our customers tell us constantly that they care about their money moving super fast, conveniently, for cheap. So everything a team does looks to optimize these factors. The team should be able to stand up in front of the entire company and explain how their work impacts the customer and which metric it moves.

Now that we’ve got the team setup out of the way, let’s talk about how Product Engineering works at TransferWise. Here are the key principles we follow.

Hire smart people and empower them

Our product is a function of the people who build it. That means how we hire and who we hire has a massive impact on what ends up becoming our live product. For product engineering, our hiring process includes the following steps:

  • Take home test
  • Technical interview with 2 engineers
  • Optional follow-up technical interview
  • Product interview with a product manager and an engineer
  • Final interview with our VP of engineering or co-founder

While this interview loop may seem familiar, most candidates comment about the product interview being a unique experience. In the product interview, we focus on your ability to put yourself in the shoes of a customer, understand what customer problem you are solving, how you would build something to get validation on an idea and then iterate on it to deliver a stellar customer experience. Read Mihkel’s post on demystifying product interviews to get more details on what we cover and why.

Once hired, product engineers are empowered to move mountains. Engineers chose which problem to solve, why, what the customer impact will be and the prioritization of their tasks. Of course, this should be in line with team goals and not solely based on individual goals.

Weak code ownership

As mentioned above, we believe in teams being independent. A big part of this is that teams don’t have dependencies on other teams. But how does this work in a large organization with an ever evolving and growing product?

Let’s take an example. As our product expands across the world, every country has different rules on what data we are required to verify on our customers. Let’s say as we launch in Australia. There is a new regulatory requirement to check some additional data on Australian customers. This requires Team Australia - an autonomous and independent team focused on the Australian customers - to make a change to our verification engine. But the verification engine is owned by the Verification team. In a lot of organizations, Team Australia would request the Verification team to pick up this change on their roadmap. But the Verification team also has a lot of such requests from other teams. They also have their own projects to improve the core verification engine to support all our different regions. So what usually ends up happening in other organizations is Team Australia can’t move as fast as they desire as they are dependent on the Verification team and their priorities.

This is why we follow the weak code ownership principle. In this setup, every part of the code is owned by a team but a team is allowed to change any part of other team's code. Sounds like chaos but there are some basic enforcement rules around this. The owning team sets the rules that other teams have to follow to play in their codebase.

In the above example, instead of the Verification team making the requested change, Team Australia is empowered to make the change in the verification codebase. But they have to follow the rules set by the Verification team to commit to their code base. These rules are up to the owning team to decide on. They could be something like below:

  • Before taking on any major change, the team making the change must go through a design discussion on said changes with the owning team.
  • The team making the change has to follow certain design patterns
  • No code will be accepted without adequate test coverage
  • All changes have to go through peer-reviewed pull requests

This setup allows product engineering teams to be independent and helps teams remove dependencies on other teams and allows teams to iterate at the pace they desire.

We compare this setup to an internal open source project. Owners define the rules to play with and own the codebase and others can commit and make changes as long as they follow the rules. As an additional benefit of this setup, owning teams are incentivized to make their code more modular and add relevant tests so that another team cannot easily break things. This leads to code readability and higher quality.

Product engineers focus on customer impact

In a lot of companies engineers never talk to real customers. Some engineers we talk to during our interview process don’t really know who they are building the product for.

Information flow in a lot of companies from customer to engineer:
Product Engineering Principles @ TransferWise

Information is lost with every person introduced along the way.

At TransferWise, being a product engineer means you get close to customers. You talk directly to customers, look at data to make decisions and understand how the features you build are evolving and being used by different customers in production. We use Looker and Mixpanel as our analytics engine and this is available to everyone in the company. Anyone can run queries and slice and dice the data the way they desire.

Product engineers also take customer calls, chats and respond directly to customer emails. Here’s an example of a challenge our co-founder Kristo set out to inspire engineers to take more calls and get closer to our customers.

Product Engineering Principles @ TransferWise

The resulting picture speaks for itself. :-)

Product Engineering Principles @ TransferWise

No one else can build your product but you

Given how involved engineers are in analyzing the data, talking to customers, understanding the reason to make a change, and how fast our iteration cycles are, we believe that we cannot just write down our specifications and have someone outside our company build the product. We don’t do offshoring, outsourcing or use consultants to build our product. This doesn’t mean we don’t have independent workers (i.e. non-salaried employees who work at TransferWise engineering). We do. Some of them have been with us for a long time and are critical contributors. But they are embedded within our teams and operate the same way any other employee does. They get close to our customers, take part in all decisions.

Some rules, more common-sense

We have a few rules that are standard across the entire product engineering organization. We believe teams should be able to pick the tools to get the job done within certain limits (more below on limits). All our teams run sprints but it’s up to them to define their sprint cadence. It has just happened that most teams run a one week sprint but now we are seeing some teams looking to move to a two-week sprint as their projects get more complex. Similarly, some teams follow scrum to the book, while some do kanban and others run their own variation on scrum.

That said, we have a few common sense rules:

  • Product engineers to own their code in production. This means managing your own releases, monitoring your code in production, getting alerts when something goes wrong and being present if your system is having an issue. We believe this incentivizes the right behavior. When you know you will be alerted at 2AM when something goes wrong in production, the quality of code that gets shipped tends to be better.
  • We have weekly technical exchange sessions called “TeX”. It’s a forum where product engineers share knowledge on various technical topics. These can range from design discussions, changes made to a specific part of our system, new technologies we should be investigating.
  • We are a JVM shop. We are open to other languages. We have some PHP, Node running around but our main stack has always been a JVM with our monolith application written in Groovy on Grails and our microservices written in Java 8 on Spring Boot. We believe language wars are good conversations over beers but try to avoid them at work and get on with building product.
  • If you want to introduce a new language or that shiny new technology to our system, it’s simple! Do a TeX and invite your fellow engineers. Explain to them the specific benefits of introducing this technology. Do an honest pro and con analysis and explain why it’s worth the rest of the engineers to go through the learning curve to pick this technology up. This is crucial! As we have weak code ownership people need to be able to make changes to parts of the system they don’t own. So new technologies introduced not only impact the specific team owning the service but also impact other engineering teams.

Honest blameless postmortems

This one is probably our favorite principle. Everyone makes mistakes and when you move fast, things break. The key is how we recover from these mistakes, what we learn and how we prevent them in the future.

In most companies, an individual isn’t really incentivized to ship fast and learn with the fear of breaking things. One is rarely rewarded for taking massive risks to improve something tenfold as the risk of breaking something is much higher. People tend to get dinged on their annual reviews when they break something leading to a production outage. So what ends up happening is people triple check their work and put in more and more safeguards for that one possible failure that can happen.

We want to build a culture where we aren’t afraid to fail, but are accountable for our failures and make the cost of a failure smaller and smaller.

One major tool we use to reflect, learn and be accountable for our failures is public honest blameless postmortems.

Vincent wrote a great post on what makes a good postmortem. The intent of the post mortem is to go deep into why something happened, how many customers were impacted, what were our learnings and what measures did we put in place to make sure this doesn’t happen again. People challenge postmortems publicly if the postmortem isn’t deep enough or doesn’t have real learnings.
Culturally this is one of the most empowering and powerful tools we have in the company. We started this in product engineering but this has evolved where we do public postmortems across the company on most teams.

Challenges

Like any model of organizing, this model has challenges too. Below are a few challenges we have learned along the way:

  • Duplication of effort: With autonomous independent teams, we can have some overlap in work done by different teams. We try and counter this by having a few people in the organization who spend time across teams and have a view on what different teams are building. This would include engineering leads who spend time with different teams and get an understanding of successes and challenges each team has. So when a team starts building a new service with similarities to another service being worked on by another team, we try to consolidate effort and get both teams on the same page to hopefully not duplicate effort.
  • Collective decision making: Sometimes it’s just hard to get the whole team to align on a decision taking varied opinions into consideration. We counter this some of this by running small teams so there are fewer people who need to get on the same page. Also when teams get stuck they seek out help from others in the organization who have been in a similar situation before or could help them break a gridlock.
  • Singularity in vision: Given we have devolved decision making to teams, there’s no one person who calls all the shots. We have a company mission but teams can decide their own way to achieve the mission. This can be especially unnerving to some folks given they can't just go over to one person and ask for direction or say "I am doing this as the CEO wants it."
  • Communication: With teams being independent and working on their specific problems, we tend to run the spectrum of teams that over communicate to make sure others know what they are working on and to those who under communicate. TransferWise runs primarily on Slack. We have specific channels for sharing things cross team. We also have horizontal virtual teams called guilds where engineers get together to work on a problem that cuts across the organization. For example, we have a performance guild which has representatives from different teams. This is a group of individuals who are interested in building tooling, standards, and practices to help all our teams improve the performance of our systems. They focus on building the required monitoring, alerting for everyone to use. That said, we are still learning how to improve communication across teams as our organization grows.

Why do we operate this way?

As a start up we have a major weapon - speed! When people closest to the customers are making the decisions, we can run more tests and iterate quicker as compared to a setup where teams rely on managers to make decisions. In the latter, managers become bottlenecks slowing down decision making. Additionally, they usually aren’t as close to the day to day operations and the customer feedback loop to make an informed decision. In our setup, we believe we get more fail and pass signals faster.

We fundamentally believe companies that iterate faster, fail faster and learn faster will succeed in the long run. That means to learn faster than others we need to optimize for speed with checks for building a high-quality customer experience that our customers love. This is the main reason for our setup.

We realize that running this way has its drawbacks as listed above but we believe we can take these drawbacks and solve for them while we optimize for speed.

This is, of course, something that has worked for us so far and we will have more learnings as our product and company evolves. We will share those learnings as we go along. We would love to hear your thoughts on what you optimize for, how you do it, and scenarios where you think the above setup doesn’t work.

Thanks to Harald, Jordan, Martin Magar, Taras, Vincent for their input.

December 05, 2016

Anton ArhipovTwitterfeed #2

So this is the second issue of my Twitterfeed, the news that I noticed in Twitter. Much more sophisticated compared to the first post, but still no structure and no definite periodicity.


Articles:


Java Annotated Monthly - December 2016. Nice collection of articles about Java 9, Java 8, libraries and frameworks, etc. With this, my Twitterfeed is now officially meta! 😃

RebelLabs published Java Generics cheat sheet. Print it out and put at the wall in your office!

Server side rendering with Spring and React. Interesting approach to UI rendering with React. Some parts of the UI are rendered at the server side, and some data is then rendered at the client side.

One year as a Developer Advocate. Vlad Mihalcea reflects on his achievements from the first year in the role of a Developer Advocate for Hibernate. Well done!

IDEA 2016.2 Icon Pack. IDEA 2016.3 update came with the new icons and some people don’t really like those. There is now a plugin to replace the new icons with the old icons. Enjoy!

Oh, and talking about IntelliJ IDEA, there is another great blog post related to 2016.3 release. Alasdair Nottingham writes about Liberty loos applications support in IDEA: Faster application development with IntelliJ IDEA 2016.3

Reactive programming vs Reactive systems. Jonas Boner and Viktor Klang make it clear, what is the difference between the two. "Messages have a clear (single) destination, while events are facts for others to observe".

Good Programmers Write Bug-Free Code, Don’t They? Yegor Bugayenko has a good point about the relation of good programming to a bug-free code.

Cyclops Java by Example: N-Queens. A coding kata for N-Queens problem using "cyclop's for-comprehensions".

Zero downtime deployment with the database. The name says it all.

RxJava 2.0 interview with David Karnok about the major release. Here comes support for Reactive Streams specification!

Reactor by Example. Reactor is very similar to RxJava, but it is also in the core of Spring Framework’s 5.0 reactive programming model.

An explanation of the different types of performance testing. I think this is quite important to make the difference.

Videos:


Spec-ulation by Rich Hickey. As usual, must watch!

Microservices evolution: how to break your monolithic database. Microservices are becoming mainstream, it seems. So we need best practices for building microservices based systems.

November 29, 2016

TransferWise Tech BlogWhy Over-Reusing is Bad

One of the Holy Grails of software development has always been reuse. In the following post I want to focus on reuse of application logic in its different forms. I will not cover reuse on more abstract levels like ideas or patterns. This will be a two part series where I explore this topic from both backend and frontend development perspective.

I believe in the following idea:

It does not matter how fast you can build the first version of your product. It only matters how fast you can change it later.

On the backend the predominant paradigm right now is microservices. I believe that one of the reasons why this approach is so successful is that it gets reuse quite right. Microservices are really good blocks for reuse as they align very well with the idea of splitting big system into multiple smaller bounded contexts. As per the concept of bounded context reusing any parts of the internal logic between different microservices is considered an anti-pattern.

Changing Shared Code is Inherently Hard

But what is so bad about sharing code? Isn't it something good that we should always strive for? Before answering lets take a step back. Why do we want to reuse stuff? Because we want to be able to change fast. Now here comes the paradox — by reusing code that has been already written we are able to save some coding time but everything that is shared inherently becomes itself harder to change. This is so because once our reusable thingy is out there we need to keep all the consumers in mind when changing it. As the number of consumers grows the harder it becomes to juggle between different requirements, nuances of each context. Essentially the risk of reusing the wrong abstraction grows over time. It is just so easy to introduce these tiny additional parameters that enable reusing maybe not all but perhaps something like 80% of the original logic or 60% or 40%.

The Knowledge Cap

Knowledge cap is another thing to keep in mind. As software development is about building knowledge then any piece that is built by someone else means we will have a potential cap in our team's knowledge. This happens even when this someone else is another team in the same organisation. Often this loss of knowledge is quite OK and totally fine - we don't want every team to re-implement their versions of AND/OR gates. However, ideally all the assets that are at the core of what the team is doing should be developed by the team itself.

Frequency of Change

In general we can say that reuse makes more sense for more peripheral/infrastructure things like accessing database or doing http calls. However, if some part of our infrastructure code needs to be changed very frequently then it might still make sense to roll out our own technical solution. Ideally high frequency of change means that it is somehow tied to the unique value proposition of our product and extra implementation effort makes sense anyway. So frequency of change should be at least as important (if not more) factor in deciding whether to reuse vs build ourselves.

Clear Boundaries

In case we need to reuse something then the best thing we can do is to make the boundaries of our code and the reused code as clear as possible. This is the reason why microservices offer superior form of reuse compared to components running in the same process. It requires much more discipline to keep the connection points few and explicit when something resides in the same process as opposed to something that lives on the other side of the network.

So reuse by itself is not bad. However, reuse on the wrong level of granularity or forcefully trying to reuse everything that looks similar at first sight can be much more harmful than duplication. Reuse has a cost as well.

November 28, 2016

TransferWise Tech BlogThe TransferWise Stack - heartbeat of our little revolution

The TransferWise Stack - heartbeat of our little revolution

As any tech startup that's passed its five-year mark, TransferWise has come quite a way from the first lines of code that powered it. Our product is a living organism, with changing needs. What was right yesterday isn't necessarily so today. Nor might our current technology choices withstand the test of time. We hope they do - but maybe they don't. And it's okay.

At conferences and meetups people often walk up to our engineers to ask what languages, tools and technologies we're using. Fairly so, as we haven't done a stellar job of telling our tech-savvier customers and fellow developers much about that. Hopefully we can rectify that a bit by taking time now to reflect in writing.

We'd love to hear back from you about the decisions you would have made differently.

Brief history

Once upon a time, in 2010, there was a founder who wanted to solve a problem. He knew a bit of Java and wanted to get moving quick. At that time, Groovy and Grails seemed to have brought some of the Ruby on Rails flare to the JVM world. Boom, here's a quick way to bootstrap! By end of 2013, about a dozen engineers were working on the codebase.

In early 2014, the existing Grails-powered system wasn't cutting it anymore for some workloads. It had been quick and easy to deliver new features but the team had made some system design shortcuts on the way. The time had come to extract some critical batch processing into a separate component. We've been following the path of moving code out ever since.

By late 2016, TransferWise has about 120 engineers working in two dozen teams. Measured by lines of code, more than half our business logic lives in separate services. We're looking at ways to scale the system across data centers and coping with the next 80 engineers joining during 2017. Let's get into the details of how we intend to enable these moves.

Microservices - Spring Boot and Spring Cloud

Few contenders got to the starting line when picking between the possible groundworks for our service stack. We were sure to stay on the JVM. We wanted something that would promise good support for future "cloud" concerns. These included service discovery, centralized config and transaction tracing.

Many people in the team trust the quality of thinking going into the Spring ecosystem, so Spring Boot quickly gained popularity. It provides a good, extensible platform for building and integrating services. We like its annotation-based autowiring of dependencies, YAML configuration files and reasonable conventions. We use a custom Spring Initializr to bootstrap new services. That helps us to make sure all the needed bootstrap config is in place and nobody gets a headache trying to manually bring in the right dependencies. It's all running on Java 8.

Spring Cloud adds integration for many useful tools in a service-oriented environment. Some are Spring projects, like the Config Server. Some leverage battle tested open source components like Netflix Eureka, Zuul and Hystrix.

We are actively adopting Config Server and Eureka, and use Hystrix and Zuul in appropriate places.

Grails and Groovy

Grails and Groovy currently power all our business logic that's not extracted into microservices. That makes up a bit under half of our lines of code. We've taken a clear direction to deprecate this codebase by end of next year.

When Groovy came along, it brought with itself a chance to write nice, succinct code and leverage the whole Java ecosystem. It continues to be a good language for DSL-s, for instance. Groovy used to have more benefits over Java, like functional constructs, easy collection processing and other boilerplace-reducing tidbits. Java gives us better compile-time checking and less runtime overhead. Hence we've gone with Java in our micro-services.

Neither is Grails to blame for anything. It allowed the early team to quickly iterate on a product in its infancy. The convenience features of Grails served against it over the years. Grails hides complexity away from the developer. It makes it easier to shoot oneself in the foot when trying to deliver value. By taking a decision to focus on scalability of the codebase sooner, we would have been able to postpone migration by another year or so. Yet, in our opinion, Grails makes sense as a platform for a single moderately-sized web app - rather than for dozens of microservices. This made the transition inevitable in any case.

It's worth noting that latest Grails version, 3.x is, also, built on top of Spring Boot. As we're quite happy with plain Spring Boot, we are not currently considering it.

Persistence layer - MySQL, PostgreSQL

We're a financial service company. This instructs us to always prioritise consistency over availability. BASE is fun and eventual consistency sounds like a cool topic to wrap one's head around. We want our persistence to meet the ACID criteria - Atomic, Consistent, Isolated and Durable.

TransferWise started with MySQL. The widespread adoption, ease of setup and loads of people with some basic experience made it a good choice. With growth, more questions have come up about our DB engine, like:

  • does it support our analytical workloads?
  • is it easy to build High Availability?

MySQL still holds most of our data in production. Yet, migrating our analytical datastore to PostgreSQL is already underway as our tests show it to be a better fit for our models and tooling. Our financial crime team relies on many nifty PostgreSQL features. We foresee Postgres to also be the platform of choice for our future service datastores. Mature availability and resilience features it offers out of the box drives this. We like Postgres.

It's likely that we'll be adopting NoSQL for some use cases down the road, where referential integrity doesn't add value.

Messaging & Kafka

A big part of our business logic swirls around the state changes of a transfer. There's many different things that need to happen around the changes - like fraud checks and analytics updates. Earlier, these were all done synchronously together with the state change.

As the business has grown in complexity, that obvious and simple approach doesn't scale so well. A good rule of thumb in designing data flows is that we can make every signal that doesn't need an immediate response asynchronous. If you can take a document to another department and not wait around for them to process it, you can process it asynchronously in the code as well. We want to unbundle the reactions to an event from the transactional changes of an event. That makes it easier to scale the different parts and isolate potential problems.

In the first iteration of our payout handling, we experimented with Artemis as a messaging platform. We didn't become confident about running Artemis in a HA setup in production, and now most of our messaging has moved to Apache Kafka.

The front end

That's all fine, but Kafka doesn't let you create smooth user experiences!

In 2014 TransferWise web UIs still used good old jQuery. We were not happy with the testability of that codebase. We also knew we'd need to modularize in a while due to team growth. In September that year, we launched our revamped transfer setup page built using AngularJS. We've now adopted Angular for almost all parts of our website.

We use Bower and NPM for dependency management. Grunt automates the builds. Jasmine and Protractor power the tests and Karma + PhantomJS run the tests. There's also webpack and Babel doing their part. Some parts of the code already use TypeScript. Whew.

Front-end, of course, isn't only code. Practices matter more than using the latest-and-greatest tools.

The teams in TransferWise are quite independent in how they evolve the product in their areas of ownership. This has, at times, meant trouble governing the visual identity and making sure things look neat, clean and consistent across the site.

To help with that we've taken a battle tested approach of implementing our style guide on top of Bootstrap. And, for the record, we think that LESS is more.

We believe in TDD, which is particularly important in places with less compile-time safety - like Javascript. Ürgo has blogged before about some of the thinking we apply. There's more to share. Stay tuned for future posts covering different aspects of our experience in detail.

Running in production

So, how do we keep it together? How do we make sure our systems are serving our customers well?

We want to keep our feedback loops short and tight, to keep signal-to-noise ratio high. It means that people building the systems must also operate them. If you own a part of the codebase, you also own the service it provides. This means being responsible for functional and non-functional characteristics of a feature. Ownership applies from feature creation to deprecation. To enable that, we need a shared toolkit for troubleshooting, monitoring and alerting.

Infrastructure monitoring and some operational metrics run on Zabbix. We are adopting Graphite/Grafana to keep an eye on business operations aspects of the system.

We use New Relic to check the HTTP endpoints. It's not cheap but works well, thanks to its instrumentation capabilities. We've defined performance and error alerts for our higher volume endpoints. Respective teams get alerted via VictorOps rotations if performance degrades. New Relic also provides some tracing features to see the time spent on different steps of request processing.

When an exception happens in production, it gets sent to Rollbar. Rollbar groups and counts the occurrences and sends alerts for spikes or new items. In general it allows us to spot glitches in the system and estimate how many customers they affect.

To analyze a specific problem or identify patterns in logs, we use the Elastic stack. The stack consists of LogStash, ElasticSearch and Kibana. They are, respectively, used to parse, index and search/visualize logs.

Slack helps us to channel alerts and comms into one place.

There's a video and slides covering some of this from Baltic DevOps 2016 where we were honored to speak.

Vincent has written a great post our way of doing post-mortems. We use them to learn from customer-affecting production issues.

Bits & pieces

Vahur wrapped up some of the other tooling that we use.

PHP - There's a few tools in the company built in PHP, some legacy, some purposefully kept in PHP.

Spark - While being a pretty buzzword-averse bunch of engineers, we do like to adopt the right tech for the right task. In our case the prime problem domain to benefit from machine learning has been our financial crime prevention subsystem. We're building up machine learning models on Apache Spark.

Ansible & Terraform - Our infrastructure is growing and humans make mistakes. That makes infrastructure a prime automation target. We're adopting Terraform for declarative infrastructure management. We use Ansible to configure the instances.

Context

We build TransferWise for our current and future customers. They don't care much about which libraries, frameworks and components we bring in to move their money. It's our job as engineers to pick the tools and practices that allow us to deliver value quickly and sustainably.

Often we've chosen the simplest tool to getting the job done, instead of the most powerful. In other times we've gone with a framework that's not the newest kid on the block but has wide production adoption.

We're all proper geeks. We love new tech. We love to play and learn, pilot and break stuff. We do so in our hobby projects and geek-out sessions over Coke and beer. When building for our customers, we optimize for their happiness over the chance of adopting the next left-pad 0.0.3 ;) And you're right - it's as boring as building the HyperLoop of financial networks for millions of people could ever be.

TransferWise stack on StackShare

Thanks to Peep, Vincent and Ürgo for their help.

November 23, 2016

Four Years RemainingWhat is the Covariance Matrix?

Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

    \[\mathrm{cov}(X,Y) = E[(X - E[X])(Y - E[Y])].\]

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as {\bf\Sigma}_{i,j} = \mathrm{cov}(X_i, X_j), and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

Meet the normal distribution

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector \bf x has a normal (or Gaussian) distribution with mean \bf \mu and covariance \bf \Sigma if:

    \[\Pr({\bf x}) =|2\pi{\bf\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}({\bf x} - {\bf\mu})^T{\bf\Sigma}^{-1}({\bf x} - {\bf \mu})\right).\]

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. {\bf\mu} = {\bf 0}) and refrain from writing out the normalizing constant |2\pi{\bf\Sigma}|^{-1/2}. Now, the definition of the (centered) multivariate Gaussian looks as follows:

    \[\Pr({\bf x}) \propto \exp\left(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}\right).\]

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

Transforming the symmetric Gaussian

Consider a symmetric Gaussian distribution, i.e. the one with {\bf \Sigma = \bf I} (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1)   \[P({\bf x}) \propto \exp(-0.5 {\bf x}^T {\bf x}).\]

Now let us apply a linear transformation {\bf A} to the points, i.e. let {\bf y} ={\bf Ax}. Suppose that, for the sake of this example, {\bf A} scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points {\bf y}:

What is the distribution of {\bf y}? Just substitute {\bf x}={\bf A}^{-1}{\bf y} into (1), to get:

(2)   \begin{align*} P({\bf y}) &\propto \exp(-0.5 ({\bf A}^{-1}{\bf y})^T({\bf A}^{-1}{\bf y}))\\ &=\exp(-0.5{\bf y}^T({\bf AA}^T)^{-1}{\bf y}) \end{align*}

This is exactly the Gaussian distribution with covariance {\bf \Sigma} = {\bf AA}^T. The logic works both ways: if we have a Gaussian distribution with covariance \bf \Sigma, we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some {\bf A}, and we are given {\bf AA}^T.

More generally, if we have any data, then, when we compute its covariance to be \bf\Sigma, we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation \bf A, and we just estimated the matrix {\bf AA}^T, corresponding to this transformation.

Note that we do not know the actual \bf A, and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if \bf A is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, {\bf AA}^T = {\bf I} for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix \bf \Sigma can be represented as:

(3)   \[{\bf \Sigma} = {\bf VDV}^T,\]

where {\bf V} is orthogonal (i.e. a rotation) and {\bf D} is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4)   \[{\bf \Sigma} = ({\bf VD}^{1/2})({\bf VD}^{1/2})^T = {\bf AA}^T,\]

where {\bf A} = {\bf VD}^{1/2}. This, in simple words, means that any covariance matrix \bf \Sigma could have been the result of transforming the data using a coordinate-wise scaling {\bf D}^{1/2} followed by a rotation \bf V. Just like in our example with \bf x and \bf y above.

Principal component analysis

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:

  1. What could have been the rotation \bf V and scaling {\bf D}^{1/2}, which produced our data from a symmetric cloud?
  2. What were the original, “symmetric-cloud” coordinates \bf x before this transformation was applied.
  3. Which original coordinates were scaled the most by \bf D and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?

All of those questions can be answered in a straightforward manner if we just decompose \bf \Sigma into \bf V and \bf D according to (3). But (3) is exactly the eigenvalue decomposition of \bf\Sigma. I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

The metric tensor

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard {\bf\Sigma}^{-1} as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation \bf A (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

Namely, let us redefine, for any two vectors \bf v and \bf w, their inner product as:

(5)   \[\langle {\bf v}, {\bf w}\rangle_{\Sigma^{-1}} = {\bf v}^T{\bf \Sigma}^{-1}{\bf w}.\]

To stay consistent we will also need to redefine the norm of any vector as

(6)   \[|{\bf v}|_{\Sigma^{-1}} = \sqrt{{\bf v}^T{\bf \Sigma}^{-1}{\bf v}},\]

and the distance between any two vectors as

(7)   \[|{\bf v}-{\bf w}|_{\Sigma^{-1}} = \sqrt{({\bf v}-{\bf w})^T{\bf \Sigma}^{-1}({\bf v}-{\bf w})}.\]

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8)   \[{\bf v}^T{\bf \Sigma}^{-1}{\bf w} = {\bf v}^T({\bf AA}^T)^{-1}{\bf w}=({\bf A}^{-1}{\bf v})^T({\bf A}^{-1}{\bf w}),\]

The following illustration might shed light on what is actually happening in this \Sigma-“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud \bf y is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor {\bf\Sigma}^{-1} (which is, as we know, equal to ({\bf AA}^T)^{-1}. The PCA now becomes a method for analyzing the deformation of space, how cool is that.

The dual space

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product {\bf w}^T{\bf v}. Think of this inner product as a function f_{\bf w}({\bf v}), which takes a vector \bf v and maps it to a real number, the dot product of \bf v and \bf w. Regard the \bf w here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals f_{\bf w} is known as the dual space to your “data space”.

Note that each linear functional is determined uniquely by the parameter vector \bf w, which has the same dimensionality as \bf v, so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element \bf v of your “data space” denotes, well, a data point. An element \bf w of the dual space denotes a function f_{\bf w}, which projects your data points on the direction \bf w (recall that if \bf w is unit-length, {\bf w}^T{\bf v} is exactly the length of the perpendicular projection of \bf v upon the direction \bf w). So, in some sense, if \bf v-s are “vectors”, \bf w-s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of \bf w would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If \bf v are elements of a skewed Euclidean space with the metric tensor {\bf\Sigma}^{-1}, is a function f_{\bf w}({\bf v}) = {\bf w}^T{\bf v} an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon \bf w nor can we say that \bf w is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy {\bf w}^T{\bf v}=0. Instead, they satisfy {\bf w}^T{\bf \Sigma}^{-1}{\bf v} = 0. Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf \Sigma}^{-1}{\bf v}. The new parameters \bf z could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. f_{\bf w}, without any ugly \bf\Sigma-s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9)   \[f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf\Sigma}^{-1}{\bf v} = ({\bf \Sigma}^{-1}{\bf z})^T{\bf v} = f_{\bf w}({\bf v}),\]

where {\bf w} = {\bf \Sigma}^{-1}{\bf z}. (Note that we can lose the transpose because \bf \Sigma is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form {\bf \Sigma}^{-1}{\bf z}. Or, in other words, {\bf\Sigma}^{-1} is a transformation, which maps from “skewed perpendiculars” to “true perpendiculars”. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the dual space. That is, it is the \bf z vector in a linear functional {\bf z}^T{\bf\Sigma}^{-1}{\bf v}. The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression {\bf z}^T{\bf\Sigma}^{-1}{\bf v} to evaluate the functional. However, why don’t we find the actual normal \bf w to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that {\bf w}={\bf\Sigma}^{-1}{\bf z} will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like {\bf w}^T{\bf\Sigma}^{-1}{\bf v} or ({\bf v}-{\bf w})^T{\bf\Sigma}^{-1}({\bf v}-{\bf w}), remember that those are simply inner products and (squared) distances in a skewed space, while {\bf \Sigma}^{-1}{\bf z} is a conversion from a skewed normal to a true normal. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

The dual space metric tensor

“But wait”, you should say here. “You have been talking about expressions like {\bf w}^T{\bf\Sigma}^{-1}{\bf v} all the time, while things like {\bf w}^T{\bf\Sigma}{\bf v} are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that {\bf w}^T{\bf\Sigma}{\bf v} is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression {\bf w}^T{\bf\Sigma}^{-1}{\bf v} denoted the “new inner product” between the points, the expression {\bf w}^T{\bf\Sigma}{\bf v} denotes the “new inner product” between the parameter vectors. Let us see why it is so.

Consider an example again. Suppose that our space transformation \bf A scaled all points by 2 along the x axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the x axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, |(2,0)|_{\Sigma^{-1}} will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if {\bf x}'={\bf Ax}, then

(10)   \begin{align*} f_{\bf w}({\bf x}) &= {\bf w}^T{\bf x} = {\bf w}^T{\bf A}^{-1}{\bf x}'\\ & =(({\bf A}^{-1})^T{\bf w})^T{\bf x}'=f_{({\bf A}^{-1})^T{\bf w}}({\bf x}'). \end{align*}

This means, that the transformation \bf A of the data points implies the transformation {\bf B}:=({\bf A}^{-1})^T of the dual vectors. The metric tensor for the dual space must thus be:

(11)   \[({\bf BB}^T)^{-1}=(({\bf A}^{-1})^T{\bf A}^{-1})^{-1}={\bf AA}^T={\bf \Sigma}.\]

Remember the illustration of the “unit circle” in the {\bf \Sigma}^{-1} metric? This is how the unit circle looks in the corresponding \bf\Sigma metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by \bf A becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that {\bf w}^T{\bf \Sigma}{\bf w} is exactly the variance of the data along the direction \bf w (assuming |{\bf w}|=1).

The covariance estimate

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that {\bf X}^T{\bf X}/n (where \bf X is the data matrix) must be related to the \bf\Sigma in the distribution equation \exp(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}).

Here is one possible way to see the connection. Firstly, let us take it for granted that if \bf X is sampled from a symmetric Gaussian, then {\bf X}^T{\bf X}/n is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, {\bf Y}={\bf XA}^T (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of \bf Y?

(12)   \[{\bf Y}^T{\bf Y}/n=({\bf XA}^T)^T{\bf XA}^T/n={\bf A}({\bf X}^T{\bf X}){\bf A}^T/n\approx {\bf AA}^T,\]

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution P with uncorrelated variables of unit variance and then transformed using some matrix \bf A, the expression {\bf X}^T{\bf X}/n will still be an estimate of {\bf AA}^T, the metric tensor for the corresponding (dual) space deformation.

However, note that out of all possible initial distributions P, the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.

November 22, 2016

Anton ArhipovTwitterfeed #1


Twitterfeed is the collection of news that I find via Twitter. I have no particular system or a method on how do I pick the news. Neither do I have a predefined period for grouping the news. It is neither daily or weekly or monthly - it is all just random. Enjoy! :)


Java 10 is now officially a project
IntelliJ IDEA 2016.3, my favourite IDE, was released!!! Yay!
CLion 2016.3, a IDE for C/C++ development was released
Rider, a new IDE for .NET is now publicly available via EAP
Akka 2.4.14 was released
Ceylon 1.3.1 was released
Fedora 25 was released

Heinz Kabutz teaches how to implement our own ArrayList in less than 10 minutes
Martin Kleppmann talks about conflict resolution for eventual consistency
Yegor Bugayenko rants about software architects
Roland Kuhn writes about understanding distribution


November 17, 2016

Four Years RemainingThe Secrets of Spring Motion

Mass on a spring

Imagine a weight hanging on a spring. Let us pull the weight a bit and release it into motion. What will its motion look like? If you remember some of your high-school physics, you should probably answer that the resulting motion is a simple harmonic oscillation, best described by a sinewave. Although this is a fair answer, it actually misses an interesting property of real-life springs. A property most people don't think much about, because it goes a bit beyond the high school curriculum. This property is best illustrated by

The Slinky Drop

The "slinky drop" is a fun little experiment which has got its share of internet fame.

The Slinky Drop

The Slinky Drop

When the top end of a suspended slinky is released, the bottom seems to patiently wait for the top to arrive before starting to fall as well. This looks rather unexpected. After all, we know that things fall down according to a parabola, and we know that springs collapse according to a sinewave, however neither of the two rules seem to apply here. If you browse around, you will see lots of awesome videos demonstrating or explaining this effect. There are news articles, forum discussions, blog posts and even research papers dedicated to the magical slinky. However, most of them are either too sketchy or too complex, and none seem to mention the important general implications, so let me give a shot at another explanation here.

The Slinky Drop Explained Once More

Let us start with the classical, "high school" model of a spring. The spring has some length L in the relaxed state, and if we stretch it, making it longer by \Delta L, the two ends of the spring exert a contracting force of k\Delta L. Assume we hold the top of the spring at the vertical coordinate y_{\mathrm{top}}=0 and have it balance out. The lower end will then position at the coordinate y_{\mathrm{bot}} = -(L+mg/k), where the gravity force mg is balanced out exactly by the spring force.

How would the two ends of the spring behave if we let go off the top now? Here's how:

The falling spring, version 1

The horozontal axis here denotes the time, the vertical axis - is the vertical position. The blue curve is the trajectory of the top end of the spring, the green curve - trajectory of the bottom end. The dotted blue line is offset from the blue line by exactly L - the length of the spring in relaxed state.

Observe that the lower end (the green curve), similarly to the slinky, "waits" for quite a long time for the top to approach before starting to move with discernible velocity. Why is it the case? The trajectory of the lower point can be decomposed in two separate movements. Firstly, the point is trying to fall down due to gravity, following a parabola. Secondly, the point is being affected by string tension and thus follows a cosine trajectory. Here's how the two trajectories look like separately:

They are surprisingly similar at the start, aren't they? And indeed, the cosine function does resemble a parabola up to o(x^3). Recall the corresponding Taylor expansion:

    \[\cos(x) = 1 - \frac{x^2}{2} + \frac{x^4}{24} + \dots \approx 1 - \frac{x^2}{2}.\]

If we align the two curves above, we can see how well they match up at the beginning:

Consequently, the two forces happen to "cancel" each other long enough to leave an impression that the lower end "waits" for the upper for some time. This effect is, however, much more pronounced in the slinky. Why so?

Because, of course, a single spring is not a good model for the slinky. It is more correct to regard a slinky as a chain of strings. Observe what happens if we model the slinky as a chain of just three simple springs:

Each curve here is the trajectory of one of the nodes inbetween the three individual springs. We can see that the top two curves behave just like a single spring did - the green node waits a bit for the blue and then starts moving. The red one, however, has to wait longer, until the green node moves sufficiently far away. The bottom, in turn, will only start moving observably when the red node approaches it close enough, which means it has to wait even longer yet - by that time the top has already arrived. If we consider a more detailed model, the movement  of a slinky composed of, say, 9 basic springs, the effect of intermediate nodes "waiting" becomes even more pronounced:

To make a "mathematically perfect" model of a slinky we have to go to the limit of having infinitely many infinitely small springs. Let's briefly take a look at how that solution looks like.

The Continuous Slinky

Let x denote the coordinate of a point on a "relaxed" slinky. For example, in the two discrete models above the slinky had 4 and 10 points, numbered 1,\dots, 4 and 1,\dots, 10 respectively. The continuous slinky will have infinitely many points numbered [0,1].

Let h(x,t) denote the vertical coordinate of a point x at time t. The acceleration of point x at time t is then, by definition \frac{\partial^2 h(x,t)}{\partial^2 t}, and there are two components affecting it: the gravitational pull -g and the force of the spring.

The spring force acting on a point x is proportional to the stretch of the spring at that point \frac{\partial h(x,t)}{\partial x}. As each point is affected by the stretch from above and below, we have to consider a difference of the "top" and "bottom" stretches, which is thus the derivative of the stretch, i.e. \frac{\partial^2 h(x,t)}{\partial^2 x}. Consequently, the dynamics of the slinky can be described by the equation:

    \[\frac{\partial^2 h(x,t)}{\partial^2 t} = a\frac{\partial^2 h(x,t)}{\partial^2 x} - g.\]

where a is some positive constant. Let us denote the second derivatives by h_{tt} and h_{xx}, replace a with v^2 and rearrange to get:

(1)   \[h_{tt} - v^2 h_{xx} = -g,\]

which is known as the wave equation. The name stems from the fact that solutions to this equation always resemble "waves" propagating at a constant speed v through some medium. In our case the medium will be the slinky itself. Now it becomes apparent that, indeed, the lower end of the slinky should not move before the wave of disturbance, unleashed by releasing the top end, reaches it. Most of the explanations of the slinky drop seem to refer to that fact. However when it is stated alone, without the wave-equation-model context, it is at best a rather incomplete explanation.

Given how famous the equation is, it is not too hard to solve it. We'll need to do it twice - first to find the initial configuration of a suspended slinky, then to compute its dynamics when the top is released.

In the beginning the slinky must satisfy h_t(x, t) = 0 (because it is not moving at all), h(0, t) = 0 (because the top end is located at coordinate 0), and h_x(1, t) = 0 (because there is no stretch at the bottom). Combining this with (1) and searching for a polynomial solution, we get:

    \[h(x, t) = h_0(x) = \frac{g}{2v^2}x(x-2).\]

Next, we release the slinky, hence the conditions h_t(x,t)=0 and h(0,t)=0 disappear and we may use the d'Alembert's formula with reflected boundaries to get the solution:

    \[h(x,t) = \frac{1}{2}(\phi(x-vt) + \phi(x+vt)) - \frac{gt^2}{2},\]

    \[\text{ where }\phi(x) = h_0(\mathrm{mod}(x, 2)).\]

Here's how the solution looks like visually:

Notice how the part of the slinky to which the wave has not arrived yet, stays completely fixed in place. Here are the trajectories of 4 equally-spaced points on the slinky:

Note how, quite surprisingly, all points of the slinky are actually moving with a constant speed, changing it abruptly at certain moments. Somewhat magically, the mean of all these piecewise-linear trajectories (i.e. the trajectory of the center of mass of the slinky) is still a smooth parabola, just as it should be:

The Secret of Spring Motion

Now let us come back to where we started. Imagine a weight on a spring. What will its motion be like? Obviously, any real-life spring is, just like the slinky, best modeled not as a Hooke's simple spring, but rather via the wave equation. Which means that when you let go off the weight, the weight will send a deformation wave, which will move along the spring back and forth, affecting the pure sinewave movement you might be expecting from the simple Hooke's law. Watch closely:

Here is how the movement of the individual nodes looks like:

The fat red line is the trajectory of the weight, and it is certainly not a sinewave. It is a curve inbetween the piecewise-linear "sawtooth" (which is the limit case when the weight is zero) and the true sinusoid (which is the limit case when the mass of the spring is zero). Here's how the zero-weight case looks like:

And this is the other extreme - the massless spring:

These observations can be summarized into the following obviously-sounding conclusion: the basic Hooke's law applies exactly only to the the massless spring. Any real spring has a mass and thus forms an oscillation wave traveling back and forth along its length, which will interfere with the weight's simple harmonic oscillation, making it "less simple and harmonic". Luckily, if the mass of the weight is large enough, this interference is negligible.

And that is, in my opinion, one of the interesting, yet often overlooked aspects of spring motion.

November 14, 2016

Raivo LaanemetsThe infrastructure setup

I run a number of services to make my work easier when providing services as a developer. Some of these are just for myself (like mail) but others are used by my clients who do not want to maintain their own development infrastructure. The services are:

  • Mail
  • Code repositories
  • Project tracker
  • Test server
  • Logging and monitoring
  • System backups
  • Other things

Mail

Mail is my main communication with many clients. I have ran my own mail server for the last 6 years. It is based on Debian's standard Postfix installation. Last 3 years I have been using WebMail as the webmail frontend. I do not recommend running your own mail server as there are lots of additional software needed and many pitfalls making your mail possibly get stuck in receipient's spam folders.

Code repositories

Git is currently the most popular code version control system. Such system holds source code and other files that need to be tracked for changes in time, often together with a project tracker. A central code repository makes distribution of code to server easier. I host central git repositories through Gitolite.

Project tracker

I use Redmine issue tracker for managing projects. It is a generic project tracker with built-in time tracking, multiproject support, multiteam support, repository viewer and lots of other useful features. I have used it since 2010 and have not seen anything better despite having used almost all popular trackers.

Test server

A new project gets its first deployment as soon as possible. This is how I work. There is a huge communication boost with the client when all work can be deployed immediately for quick feedback. The test server runs at my home. The server has a second role as NAS to keep larger files and my desktop backups.

Logging and monitoring

Monitoring makes sure that all deployments are up and that there are no upcoming issues such as expiring domains and expiring SSL certificates. Logging records the application activity and sends it to a central location. Error logs are monitored for new errors. The monitoring system sends mail alerts when something is down, soon to expire or runs with errors. I'm using a setup based on rsyslog and it is described here in more detail. The system uses a custom frontend to browse the logs and implement alerts.

System backups

The backup system makes file and database backups of every important deployment. This has saved us in couple of times when files have been accidentally modified or removed from production or when the whole environment has been destroyed. It is rare to happen but it is better to prepare. The backup system is a collection of bash scripts and I have described it here and here. I usually run backups for free during development and later in the maintenance phase for a fee.

Other things

Besides the important stuff above and the client systems I also run:

  • My tech blog.
  • The "watcher". This periodically monitors a number of online forums and Facebook groups to get possible job leads. I have made interesting contacts through it but the actual jobs have come from the other channels so far.
  • My company homepage. Promoting it some time ago worked well but it tends to bring potential clients with too big projects. So this has been kept quite minimal for now.
  • An instance of feeds. I read lots of blogs and they are my main source of tech news.
  • An instance of an online accounting software.

Hardware

The infrastructure is currently running on 3 separate machines:

  1. A Linode VPS instance (mail, code repositories, project tracker - important stuff).
  2. A Amazon EC2 instance (backups)
  3. Home server (test server, logging and monitoring)

The servers run Debian or a Debian-based distro. The current setup costs me about 100-200 EUR per month to run (without considering maintenance time). So far I think it's been worth it. It has been built over the last 6 years but might still get some changes in the future.

November 13, 2016

Raivo LaanemetsNow, 2016-11

This is an update on things related to this blog and my work.

Blogging

Upgrades to the blog code:

  • List of last posts under each post page. If it turns out to waste too much space I will remove it.
  • All posts page now has index by year.
  • I have optimized the article print view. It now contains less noise.
  • Flatter style: no border on inline code and code blocks. This improves readability.

I also did some minor edits to existing articles: fixed capitalization and normalized tags.

Work

  • I have updated the generic terms. I have covered some previously unregulated cases considering the services I provide and when I provide them.
  • I am back on the Scale Laser project. On the last 2 weeks I have been studying SVG more in depth for the drawing editor component.

Router upgrade

My home network now has a better router. I have replaced the old router with a hardware-NAT Asus RT-AC53. The old router caused disconnects for the main desktop and WiFi devices. The test server running at my home and hosting many projects was also affected by this. The router upgrade has made the issues disappear.

Open Source

  • Fast-feed supports Node.js 7.x. No actual changes were needed, I only updated the Travis conf. Readme now contains instructions how to install it on Windows.
  • Released MySQL-syslog-viewer. It was my weekend project that took more than one weekend in the end.
  • Released Session Viewer. It was another weekend project.
  • Blog-Core leaked identify processes. Fixed in release 1.1.1.
  • Minor Sublime EJS highlighter improvement.

Besides these I have 2 more weekend projects to announce. The code for them is already finished.

Other things

Docker issues

I ran into issues with Docker again. Idea of containers and automation is good but current Linux distros and software are not built for it and require too much manual interaction. This makes some things harder than it has to be.

Vue.js

I have looked at Vue.js as a modern JavaScript framework. I can see that it leads to cleaner code than KnockoutJS. I would like to evaluate it on a future project. I have already worked through some small examples and I like it a lot.

Apartment association

Much needed building renovations were delayed as apartment owners voted against it. There were 51 votes for it and 12 votes against. We have 102 apartments and the required number of votes for the loan was 50%+1 meaning that we missed 1 vote. We have already scheduled a new voting and hope to pass it. This decision has high impact on my schedule in 2017 as I won't be able to work in my home office anymore due to construction work.

To relieve the air quality problems before reconstruction I have bought a LB88 air humidifier. After renovations it should become unnecessary to use any air quality control equipment as the ventilation system will be completely rebuilt. At the moment we have slight overheating to provide livable temperatures in all apartments. In my apartment with old windows it causes the air humidity to drop below 20% during winters which results in breathing issues. I'm using the humidifier to keep it in the 40-50% range.

November 11, 2016

Four Years RemainingThe Meaning of the Gaussian Kernel

A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

Introductory Background

The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, x^T y, and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

    \begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}

Analogously, if you raise x^Ty to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: f(x) = w^Tx + b. We can "kernelize" it by first representing w as a linear combination of the data points (this is called a dual representation):

    \[f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,\]

and then swapping all the dot products x_i^T x with a custom kernel function:

    \[f(x) = \sum_i \alpha_i k(x_i,x) + b.\]

If we now substitute k(x,y) = (x^T y)^2 here, our model becomes a second degree polynomial regression. If k(x,y) = (x^T y)^5 it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function k, and, of course, the Gaussian function is one of these choices:

    \[k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).\]

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

The Gaussian Kernel

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power i makes a kernel with a feature space, which includes all i-wise products. Now let us examine what happens if we add two or more kernel functions. Consider k(x,y) = x^Ty + (x^Ty)^2, for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

    \[(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),\]

i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

Multiplying a kernel function with a constant c is also meaningful. It corresponds to scaling the corresponding features by \sqrt{c}. For example, k(x,y) = 4x^Ty = (2x)^T(2y).

Still with me? Great, now let us combine the tricks above and consider the following kernel:

    \[k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.\]

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by \sqrt{2} and all triple products scaled down by \sqrt{6}.

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

    \[\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).\]

We can conclude here that k(x,y) = \exp(x^Ty) is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector x to \frac{x}{|x|} before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel k(x,y) = x^Ty. If we apply it to normalized vectors, we get

    \[k\left(\frac{x}{|x|}, \frac{y}{|y|}\right) = \left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.\]

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

    \begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)}  \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}

Voilà, the Gaussian kernel. Well, it still lacks \sigma^2 in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by 1/\sigma

To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if i-th degree are scaled down by \sqrt{i!} before normalization). Simple, right?

An Example

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let x = 1, y = 2. The value of the Gaussian kernel k(x, y) for these inputs is:

    \[k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...\]

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

    \[\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).\]

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

----\phi'(1) = (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

----\phi'(2) = (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations \phi we just need to normalize:

----\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} = (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

----\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} = (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

Finally, we compute the simple dot product of these two vectors:

    \[\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....\]

In boldface are the decimal digits, which match the value of \exp(-0.5|1-2|^2). The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

A 2D Example

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

    \begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}},  \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. x_1x_2 = x_2x_1), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form \frac{x_1^a x_2^b}{\sqrt{n!}} must be repeated exactly \frac{n!}{a!b!} times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by \sqrt{\frac{n!}{a!b!}}. "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector (v, v, v), for example. Its norm is \sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|, exactly the same as the norm of the single-element vector (\sqrt{3}v).

As we do this scaling, each feature gets converted to a nice symmetric form:

    \[\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.\]

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with 21^2 = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

    \[\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).\]

Let us limit ourselves to this degree-3 example and let x = (0.7, 0.2), y = (0.1, 0.4) (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

----\phi'_3(0.7, 0.2) = (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

----\phi'_3(0.1, 0.4) = (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

After normalization:

----\phi_3(0.7, 0.2) = (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

----\phi_3(0.1, 0.4) = (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

The dot product of these vectors is 0.8196\dots, what about the exact Gaussian kernel value?

    \[\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.\]

Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

The features of the unnormalized d-dimensional Gaussian kernel are:

    \[\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is \sqrt{\exp({\bf x}^T{\bf x})}. Thus we may also state the following:

The features of the d-dimensional Gaussian kernel are:

    \[\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

That's it, now you have seen the soul of the Gaussian kernel.

November 10, 2016

TransferWise Tech BlogWhen to Throw an Exception

Simplest rule to figure out whether we need an exception is to think if having a stracktrace will give any value. For example, do we care about stacktrace when some business rule is violated?

One widely known principle is that exceptions should be used for programming errors. But what is a programming error? Maybe it is easier to understand if we say that exceptions should be used when something happens that cannot be avoided by using common knowledge. Common knowledge is something that we believe all well behaving consumers of our API should have.

For example, we can assume everybody to know that there are 31 days in December. So the following is a valid case to throw an exception:

newYearsEve = Date.of(32, 12, 2016)  

Now lets say we have a library in our application that provides currency exchange rates. In this example this library only supports rates for EUR source. When client executes this API with GBP-USD then we could throw an exception. However, this is not a programming error as we cannot expect everybody to know about this constraint in the library. We can solve this by exposing the list of supported routes as well and have following API:

Rate getRate(Route route)  
boolean isSupported(Route route)  

We are now exporting the knowledge about supported routes to client. This knowledge is certainly not part of common knowledge but by adding isSupported we have made it possible for the client to easily acquire that knowledge. Hence here we could again opt for throwing an exception when getRate gets called with an unsuppported route.

Read more about this topic: http://c2.com/cgi/wiki?AvoidExceptionsWheneverPossible

Raivo LaanemetsAnnouncement: Session Viewer

Some time ago I put together a web site visitor logger. At the first I did it as an experiment to see how many of my blog visitors are bots. I also wanted to see if I could use data from it to implement a better anti-bot solution for Google Analytics (GA).

I'm storing session data (a session in my app has the lifespan of a session cookie) in a MySQL database. The data includes easy-to-obtain numerical and boolean values. Besides running analytics I can now also browse individual sessions. This was not possible with GA. At the moment I'm seeing the most value of the app in that feature. Referrer spam has also disappeared but this is because the solution uses a custom protocol.

Bot detection

For bot detection I implemented the following techniques:

  1. Check for the specific JavaScript global properties described in this StackOverflow thread.
  2. Number of mouse/scroll events.
  3. Recording page view/session duration.

Results

I have put together some statistics based on the October data from this blog.

Total number of sessions1396
Total number of pageviews2148
Total number of distinct IP addresses1202

Global variables

By the number of sessions.

window.callPhantom30.21%
window.Buffer00.00%
window.emit00.00%
window.spawn00.00%
window.webdriver00.00%
window.domAutomation00.00%

Mouse and scroll events

Sessions with mouse events93566.98%
Sessions with scroll events89964.40%
Sessions with mouse or scroll events107476.93%

Session length

Sessions longer than 5 seconds126790.67%
Sessions longer than 15 seconds109678.51%
Sessions longer than 30 seconds88563.40%
Sessions longer than 60 seconds84960.81%
Sessions longer than 120 seconds61944.34%
Sessions longer than 240 seconds59242.41%

Results summary

Detection of bots through JavaScript global variables does not work well for generic bots. A number of bots, including Google's and Baidu's indexers, will pass the check. They are executing JavaScript and stay on a page for 5-10 seconds. I checked this by randomly picking short sessions and studying their IP locations (the app has built-in support for this through the IP-Lookup service from WhatIsMyIPAddress.com).

I did find some referral spam through manual inspection. There were 1-2 sessions per week with referral to free-share-buttons spam. The bot executed JavaScript but none of the globals above were set. The session length was less than 10 seconds and there were no mouse or keyboard events.

At the end I decided to filter sessions by 30 seconds. This seems to keep statistics clean enough without losing too much information. Most of the interesting data for me is in longer multi-page sessions anyway.

Session Viewer

I added a frontend with a couple of tables and charts once I had enough useful data in the database. It contains very few analytical features: only the last month top entry pages and top pages are shown. The goal of this was not to replace GA which is much more powerful with all the predefined and possible custom views but to complement it with the ability to browse individual sessions. This feature was very simple to add to the app. The session page just lists all visited pages by the view start time.

Screenshot: list of sessions

Session Viewer - List of sessions

Screenshot: single session

Session Viewer - Single session

Source code and documentation can be found in the GitHub repository.