Pinu planeet

March 31, 2017

TransferWise Tech BlogScaling our analytics database

Business intelligence is at the core of any great company, and Transferwise is no exception.
When I started my job as a data engineer in July 2016 my initial task was to solve a long running issue with the database used for the analytic queries.

The gordian knot of the analytics database

The original configuration was a MySQL community edition, version 5.6, with an Innodb buffer of 40 GB. The virtual machine’s memory was 70 GB with 18 CPU assigned. The total database size was about 600 GB.

The analysts ran their queries using SQL, Looker and Tableau. In order to get data in almost real time our live database was replicated into a dedicated schema. In order to protect our customer’s personal data a dedicated schema with a set of views was used to obfuscate the personal information. The same schema was used for pre-aggregating some heavy queries. Other schemas were copied from the microservice database on a regular basis.

The frog effect

If you drop a frog in a pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn up the heat it will be slowly boiled to death.

The performance issues worsened slowly over time. One of the reasons was the size of the database constantly increasing, combined with the personal data obfuscation.
When selecting from a view, if the dataset returned is large enough, the MySQL optimiser materialises the view on disk and executes the query. The temporary files are removed when the query ends.

As a result, the analytics tools were slow under normal load. In busy periods the database became almost unusable. The analysts had to spend a lot of time tuning the existing queries rather than write new ones.

The general thinking was that MySQL was no longer a good fit. However the new solution had to satisfy requirements that were quite difficult to achieve with a single product change.

  • The data for analytics should be almost real time with the live database
  • The PII(personally identifiable information) should be obfuscated for general access
  • The PII should be available in clear for restricted users
  • The system should be able to scale for several years
  • The systems should offer modern SQL for better analytics queries

The eye of the storm

The analyst team shortlisted a few solutions covering the requirements. These were:

Google BigQuery did not have the flexibility required for the new analytics DB. Redshift had more capability but was years behind snowflake and pure PostgreSQL in terms of modern SQL. So both were removed from the list.

Both PostgreSQL and Snowflake offered very good performance and modern SQL.
But neither of them was able to replicate data from a MySQL database.

Snowflake

Snowflake is a cloud based data warehouse service. It’s based on Amazon S3 and comes with different sizing. Their pricing system is very appealing and the preliminary tests showed Snowflake outperforming PostgreSQL.

The replica between our systems and Snowflake would happen using FiveTran, an impressive multi-technology data pipeline. Unfortunately there was just one little catch.
Fivetran doesn’t have native support for obfuscation.

Customer data security is of the highest priority at TransferWise - If for any reason customer data needs to move outside our perimeter it must always be obfuscated.

PostgreSQL

Foreseeing this issue, I decided to spend time building a proof of concept based on the replica tool pg chameleon. The tool is written in python and uses the python-mysql-replication library to read the MySQL replica protocol and replay the changes into a PostgreSQL database.

The initial tests on a reduced dataset were successful and adding support for the obfuscation in real time required minimal changes.

The initial idea was to use PostgreSQL to obfuscate the data before feeding it into FiveTran.

However, because PostgreSQL’s performance was good with margins for scaling as our data grows, we decided to use just PostgreSQL for our data analytics and keep our customer’s data behind our perimeter.

A ninja elephant

PostgreSQL offers better performance, and a stronger security model with improved resource optimisation.

The issues with the views validity and speed are now just a bad memory.

Analysts can now use the complex analytics functions offered by version PostgreSQL 9.5.
Large tables, previously unusable because of their size, are now partitioned with pg pathman and their data is usable again.

Some code was optimised inside, but actually very little - maybe 10-20% was improved. We’ll do more of that in the future, but not yet. The good thing is that the performance gains we have can mostly be attributed just to PG vs MySQL. So there’s a lot of scope to improve further.
Jeff McClelland - Growth Analyst, data guru

Timing

Procedure MySQL PgSQL PgSQL cached
Daily ETL script 20 hours 4 hours N/A
Select from small table
with complex aggregations
Killed after 20 minutes 3 minutes 1 minute
Large table scan with simple filters 6 minutes 2 minutes 6 seconds

Resources

Resource MySQL PostgreSQL
Storage 940 GB 670 GB
CPU 18 8
RAM 68 GB 48 GB
Shared Memory 40 GB 5 GB

Lessons learned

Never underestimate the resource consumption

During the development of the replica tool the initialisation process required several improvements.

The resources are always finite and the out of memory killer is always happy to remind us this simple, but hard to understand concept. Some tables required a custom slice size because the size of row length triggered the OOM killer when pulling out the data.

However, even after fixing the memory issues the initial copy took 6 days.

Tuning the copy speed with the unbuffered cursors and the row number estimates improved the initial copy speed which now completes in 30 hours, including the time required for the index build.

Strictness is an illusion. MySQL doubly so

MySQL's lack of strictness is not a mystery.

The replica stopped because of the funny way the NOT NULL is managed by MySQL.

To prevent any further replica breakdown the fields with NOT NULL added with ALTER TABLE after the initialisation are created in PostgreSQL as NULLable fields.

MySQL truncates the strings of characters at the varchar size automatically. This is a problem if the field is obfuscated on PostgreSQL because the hashed string could not fit into the corresponding varchar field. Therefore all the character varying on the obfuscated schema are always text.

Idle in transaction can kill your database

Overtime I saw the PostgreSQL tables used for storing the MySQL's row images growing to unacceptable size (10th of GB). This was caused by misbehaving sessions left idle in transaction.

An idle in transaction session holds a database snapshot until it is committed or rolled back. This is bad because the normal vacuuming doesn't reclaim the dead rows which could be seen by the snapshot.

The quick fix was a cron job which removes those sessions. The long term fix was to address why those sessions appeared and fix the code causing the issue.

March 29, 2017

Four Years RemainingBlockchain in Simple Terms

The following is an expanded version of an explanatory comment I posted here.

Alice's Diary

Alice decided to keep a diary. For that she bought a notebook, and started filling it with lines like:

  1. Bought 5 apples.
  2. Called mom.
    ....
  3. Gave Bob $250.
  4. Kissed Carl.
  5. Ate a banana.
    ...

Alice did her best to keep a meticulous account of events, and whenever she had a discussion with friends about something that happened earlier, she would quickly resolve all arguments by taking out the notebook and demonstrating her records. One day she had a dispute with Bob about whether she lent him $250 earlier or not. Unfortunately, Alice did not have her notebook at hand at the time of the dispute, but she promised to bring it tomorrow to prove Bob owed her money.

Bob really did not want to return the money, so that night he got into Alice's house, found the notebook, found line 132 and carefully replaced it with "132. Kissed Dave". The next day, when Alice opened the notebook, she did not find any records about money being given to Bob, and had to apologize for making a mistake.

Alice's Blockchain

A year later Bob's conscience got to him and he confessed his crime to Alice. Alice forgave him, but decided to improve the way she kept the diary, to avoid the risk of forging records in the future. Here's what she came up with. The operating system Linups that she was using had a program named md5sum, which could convert any text to its hash - a strange sequence of 32 characters. Alice did not really understand what the program did with the text, it just seemed to produce a sufficiently random sequence. For example, if you entered "hello" into the program, it would output "b1946ac92492d2347c6235b4d2611184", and if you entered "hello " with a space at the end, the output would be "1a77a8341bddc4b45418f9c30e7102b4".

Alice scratched her head a bit and invented the following way of making record forging more complicated to people like Bob in the future: after each record she would insert the hash, obtained by feeding the md5sum program with the text of the record and the previous hash. The new diary now looked as follows:

  1. 0000 (the initial hash, let us limit ourselves with just four digits for brevity)
  2. Bought 5 apples.
  3. 4178 (the hash of "0000" and "Bought 5 apples")
  4. Called mom.
  5. 2314 (the hash of "4178" and "Called mom")
    ...
    4492
  6. Gave Bob $250.
    1010 (the hash of "4492" and "Gave Bob $250")
  7. Kissed Carl.
    8204 (the hash of "1010" and "Kissed Carl")
    ...

Now each record was "confirmed" by a hash. If someone wanted to change the line 132 to something else, they would have to change the corresponding hash (it would not be 1010 anymore). This, in turn, would affect the hash of line 133 (which would not be 8204 anymore), and so on all the way until the end of the diary. In order to change one record Bob would have to rewrite confirmation hashes for all the following diary records, which is fairly time-consuming. This way, hashes "chain" all records together, and what was before a simple journal became now a chain of records or "blocks" - a blockchain.

Proof-of-Work Blockchain

Time passed, Alice opened a bank. She still kept her diary, which now included serious banking records like "Gave out a loan" or "Accepted a deposit". Every record was accompanied with a hash to make forging harder. Everything was fine, until one day a guy named Carl took a loan of $1000000. The next night a team of twelve elite Chinese diary hackers (hired by Carl, of course) got into Alice's room, found the journal and substituted in it the line "143313. Gave out a $1000000 loan to Carl" with a new version: "143313. Gave out a $10 loan to Carl". They then quickly recomputed all the necessary hashes for the following records. For a dozen of hackers armed with calculators this did not take too long.

Fortunately, Alice saw one of the hackers retreating and understood what happened. She needed a more secure system. Her new idea was the following: let us append a number (called "nonce") in brackets to each record, and choose this number so that the confirmation hash for the record would always start with two zeroes. Because hashes are rather unpredictable, the only way to do it is to simply try out different nonce values until one of them results in a proper hash:

  1. 0000
  2. Bought 5 apples (22).
  3. 0042 (the hash of "0000" and "Bought 5 apples (22)")
  4. Called mom (14).
  5. 0089 (the hash of "0042" and "Called mom (14)")
    ...
    0057
  6. Gave Bob $250 (33).
    0001
  7. Kissed Carl (67).
    0093 (the hash of "0001" and "Kissed Carl (67)")
    ...

To confirm each record one now needs to try, on average, about 50 different hashing operations for different nonce values, which makes it 50 times harder to add new records or forge them than previously. Hopefully even a team of hackers wouldn't manage in time. Because each confirmation now requires hard (and somewhat senseless) work, the resulting method is called a proof-of-work system.

Distributed Blockchain

Tired of having to search for matching nonces for every record, Alice hired five assistants to help her maintain the journal. Whenever a new record needed to be confirmed, the assistants would start to seek for a suitable nonce in parallel, until one of them completed the job. To motivate the assistants to work faster she allowed them to append the name of the person who found a valid nonce, and promised to give promotions to those who confirmed more records within a year. The journal now looked as follows:

  1. 0000
  2. Bought 5 apples (29, nonce found by Mary).
  3. 0013 (the hash of "0000" and "Bought 5 apples (29, nonce found by Mary)")
  4. Called mom (45, nonce found by Jack).
  5. 0089 (the hash of "0013" and "Called mom (45, nonce found by Jack)")
    ...
    0068
  6. Gave Bob $250 (08, nonce found by Jack).
    0028
  7. Kissed Carl (11, nonce found by Mary).
    0041
    ...

A week before Christmas, two assistants came to Alice seeking for a Christmas bonus. Assistant Jack, showed a diary where he confirmed 140 records and Mary confirmed 130, while Mary showed a diary where she, reportedly, confirmed more records than Jack. Each of them was showing Alice a journal with all the valid hashes, but different entries! It turns out that ever since having found out about the promotion the two assistants were working hard to keep their own journals, such that all nonces would have their names. Since they had to maintain the journals individually they had to do all the work confirming records alone rather than splitting it among other assistants. This of course made them so busy that they eventually had to miss some important entries about Alice's bank loans.

Consequently, Jacks and Mary's "own journals" ended up being shorter than the "real journal", which was, luckily, correctly maintained by the three other assistants. Alice was disappointed, and, of course, did not give neither Jack nor Mary a promotion. "I will only give promotions to assistants who confirm the most records in the valid journal", she said. And the valid journal is the one with the most entries, of course, because the most work has been put into it!

After this rule has been established, the assistants had no more motivation to cheat by working on their own journal alone - a collective honest effort always produced a longer journal in the end. This rule allowed assistants to work from home and completely without supervision. Alice only needed to check that the journal had the correct hashes in the end when distributing promotions. This way, Alice's blockchain became a distributed blockchain.

Bitcoin

Jack happened to be much more effective finding nonces than Mary and eventually became a Senior Assistant to Alice. He did not need any more promotions. "Could you transfer some of the promotion credits you got from confirming records to me?", Mary asked him one day. "I will pay you $100 for each!". "Wow", Jack thought, "apparently all the confirmations I did still have some value for me now!". They spoke with Alice and invented the following way to make "record confirmation achievements" transferable between parties.

Whenever an assistant found a matching nonce, they would not simply write their own name to indicate who did it. Instead, they would write their public key. The agreement with Alice was that the corresponding confirmation bonus would belong to whoever owned the matching private key:

  1. 0000
  2. Bought 5 apples (92, confirmation bonus to PubKey61739).
  3. 0032 (the hash of "0000" and "Bought 5 apples (92, confirmation bonus to PubKey61739)")
  4. Called mom (52, confirmation bonus to PubKey55512).
  5. 0056 (the hash of "0032" and "Called mom (52, confirmation bonus to PubKey55512)")
    ...
    0071
  6. Gave Bob $250 (22, confirmation bonus to PubKey61739).
    0088
  7. Kissed Carl (40, confirmation bonus to PubKey55512).
    0012
    ...

To transfer confirmation bonuses between parties a special type of record would be added to the same diary. The record would state which confirmation bonus had to be transferred to which new public key owner, and would be signed using the private key of the original confirmation owner to prove it was really his decision:

  1. 0071
  2. Gave Bob $250 (22, confirmation bonus to PubKey6669).
    0088
  3. Kissed Carl (40, confirmation bonus to PubKey5551).
    0012
    ...
    0099
  4. TRANSFER BONUS IN RECORD 132 TO OWNER OF PubKey1111, SIGNED BY PrivKey6669. (83, confirmation bonus to PubKey4442).
    0071

In this example, record 284 transfers bonus for confirming record 132 from whoever it belonged to before (the owner of private key 6669, presumably Jack in our example) to a new party - the owner of private key 1111 (who could be Mary, for example). As it is still a record, there is also a usual bonus for having confirmed it, which went to owner of private key 4442 (who could be John, Carl, Jack, Mary or whoever else - it does not matter here). In effect, record 284 currently describes two different bonuses - one due to transfer, and another for confirmation. These, if necessary, can be further transferred to different parties later using the same procedure.

Once this system was implemented, it turned out that Alice's assistants and all their friends started actively using the "confirmation bonuses" as a kind of an internal currency, transferring them between each other's public keys, even exchanging for goods and actual money. Note that to buy a "confirmation bonus" one does not need to be Alice's assistant nor register anywhere. One just needs to provide a public key.

This confirmation bonus trading activity became so prominent that Alice stopped using the diary for her own purposes, and eventually all the records in the diary would only be about "who transferred which confirmation bonus to whom". This idea of a distributed proof-of-work-based blockchain with transferable confirmation bonuses is known as the Bitcoin.

Smart Contracts

But wait, we are not done yet. Note how Bitcoin is born from the idea of recording "transfer claims", cryptographically signed by the corresponding private key, into a blockchain-based journal. There is no reason we have to limit ourselves to this particular cryptographic protocol. For example, we could just as well make the following records:

  1. Transfer bonus in record 132 to whoever can provide signatures, corresponding to PubKey1111 AND PubKey3123.

This would be an example of a collective deposit, which may only be extracted by a pair of collaborating parties. We could generalize further and consider conditions of the form:

  1. Transfer bonus in record 132 to whoever first provides x, such that f(x) = \text{true}.

Here f(x) could be any predicate describing a "contract". For example, in Bitcoin the contract requires x to be a valid signature, corresponding to a given public key (or several keys). It is thus a "contract", verifying the knowledge of a certain secret (the private key). However, f(x) could just as well be something like:

    \[f(x) = \text{true, if }x = \text{number of bytes in record #42000},\]

which would be a kind of a "future prediction" contract - it can only be evaluated in the future, once record 42000 becomes available. Alternatively, consider a "puzzle solving contract":

    \[f(x) = \text{true, if }x = \text{valid, machine-verifiable}\]

    \[\qquad\qquad\text{proof of a complex theorem},\]

Finally, the first part of the contract, namely the phrase "Transfer bonus in record ..." could also be fairly arbitrary. Instead of transferring "bonuses" around we could just as well transfer arbitrary tokens of value:

  1. Whoever first provides x, such that f(x) = \text{true} will be DA BOSS.
    ...
  2. x=42 satisifes the condition in record 284.
    Now and forever, John is DA BOSS!

The value and importance of such arbitrary tokens will, of course, be determined by how they are perceived by the community using the corresponding blockchain. It is not unreasonable to envision situations where being DA BOSS gives certain rights in the society, and having this fact recorded in an automatically-verifiable public record ledger makes it possible to include the this knowledge in various automated systems (e.g. consider a door lock which would only open to whoever is currently known as DA BOSS in the blockchain).

Honest Computing

As you see, we can use a distributed blockchain to keep journals, transfer "coins" and implement "smart contracts". These three applications are, however, all consequences of one general, core property. The participants of a distributed blockchain ("assistants" in the Alice example above, or "miners" in Bitcoin-speak) are motivated to precisely follow all rules necessary for confirming the blocks. If the rules say that a valid block is the one where all signatures and hashes are correct, the miners will make sure these indeed are. If the rules say that a valid block is the one where a contract function needs to be executed exactly as specified, the miners will make sure it is the case, etc. They all seek to get their confirmation bonuses, and they will only get them if they participate in building the longest honestly computed chain of blocks.

Because of that, we can envision blockchain designs where a "block confirmation" requires running arbitrary computational algorithms, provided by the users, and the greedy miners will still execute them exactly as stated. This general idea lies behind the Ethereum blockchain project.

There is just one place in the description provided above, where miners have some motivational freedom to not be perfectly honest. It is the decision about which records to include in the next block to be confirmed (or which algorithms to execute, if we consider the Ethereum blockchain). Nothing really prevents a miner to refuse to ever confirm a record "John is DA BOSS", ignoring it as if it never existed at all. This problem is overcome in modern blockchains by having users offer additional "tip money" reward for each record included in the confirmed block (or for every algorithmic step executed on the Ethereum blockchain). This aligns the motivation of the network towards maximizing the number of records included, making sure none is lost or ignored. Even if some miners had something against John being DA BOSS, there would probably be enough other participants who would not turn down the opportunity of getting an additional tip.

Consequently, the whole system is economically incentivised to follow the protocol, and the term "honest computing" seems appropriate to me.

Now that you know how things work, feel free to transfer all your bitcoins (i.e. block confirmation bonuses for which you know the corresponding private keys) to the address 1JuC76CX4FGo3W3i2Xv7L86Vz4chHHg71m (i.e. a public key, to which I know the corresponding private key).

March 27, 2017

Four Years RemainingImplication and Provability

Consider the following question:

Which of the following two statements is logically true?

  1. All planets of the Solar System orbit the Sun. The Earth orbits the Sun. Consequently, the Earth is a planet of the Solar System.
  2. God is the creator of all things which exist. The Earth exists. Consequently, God created the Earth.

implicationI've seen this question or variations of it pop up as "provocative" posts in social networks several times. At times they might invite lengthy discussions, where the participants would split into camps - some claim that the first statement is true, because Earth is indeed a planet of the Solar System and God did not create the Earth. Others would laugh at the stupidity of their opponents and argue that, obviously, only the second statement is correct, because it makes a valid logical implication, while the first one does not.

Not once, however, have I ever seen a proper formal explanation of what is happening here. And although it is fairly trivial (once you know it), I guess it is worth writing up. The root of the problem here is the difference between implication and provability - something I myself remember struggling a bit to understand when I first had to encounter these notions in a course on mathematical logic years ago.

Indeed, any textbook on propositional logic will tell you in one of the first chapters that you may write

    \[A \Rightarrow B\]

to express the statement "A implies B". A chapter or so later you will learn that there is also a possibility to write

    \[A \vdash B\]

to express a confusingly similar statement, that "B is provable from A". To confirm your confusion, another chapter down the road you should discover, that A \Rightarrow B is the same as \vdash A \Rightarrow B, which, in turn, is logically equivalent to A \vdash B. Therefore, indeed, whenever A \Rightarrow B is true, A \vdash B is true, and vice-versa. Is there a difference between \vdash and \Rightarrow then, and why do we need the two different symbols at all? The "provocative" question above provides an opportunity to illustrate this.

The spoken language is rather informal, and there can be several ways of formally interpreting the same statement. Both statements in the puzzle are given in the form "A, B, consequently C". Here are at least four different ways to put them formally, which make the two statements true or false in different ways.

The Pure Logic Interpretation

Anyone who has enough experience solving logic puzzles would know that both statements should be interpreted as abstract claims about provability (i.e. deducibility):

    \[A, B \vdash C.\]

As mentioned above, this is equivalent to

    \[(A\,\&\, B) \Rightarrow C.\]

or

    \[\vdash (A\,\&\, B) \Rightarrow C.\]

In this interpretation the first statement is wrong and the second is a correct implication.

The Pragmatic Interpretation

People who have less experience with math puzzles would often assume that they should not exclude their common sense knowledge from the task. The corresponding formal statement of the problem then becomes the following:

    \[[\text{common knowledge}] \vdash (A\,\&\, B) \Rightarrow C.\]

In this case both statements become true. The first one is true simply because the consequent C is true on its own, given common knowledge (the Earth is indeed a planet) - the antecedents and provability do not play any role at all. The second is true because it is a valid reasoning, independently of the common knowledge.

This type of interpretation is used in rhetorical phrases like "If this is true, I am a Dutchman".

The Overly Strict Interpretation

Some people may prefer to believe that a logical statement should only be deemed correct if every single part of it is true and logically valid. The two claims must then be interpreted as follows:

    \[([\text{common}] \vdash A)\,\&\, ([\text{common}] \vdash B)\,\&\, (A, B\vdash C).\]

Here the issue of provability is combined with the question about the truthfulness of the facts used. Both statements are false - the first fails on logic, and the second on facts (assuming that God creating the Earth is not part of common knowledge).

The Oversimplified Interpretation

Finally, people very unfamiliar with strict logic would sometimes tend to ignore the words "consequently", "therefore" or "then", interpreting them as a kind of an extended synonym for "and". In their minds the two statements could be regarded as follows:

    \[[\text{common}] \vdash A\,\&\, B\,\&\, C.\]

From this perspective, the first statement becomes true and the second (again, assuming the aspects of creation are not commonly known) is false.

Although the author of the original question most probably did really assume the "pure logic" interpretation, as is customary for such puzzles, note how much leeway there can be when converting a seemingly simple phrase in English to a formal statement. In particular, observe that questions about provability, where you deliberately have to abstain from relying on common knowledge, may be different from questions about facts and implications, where common sense may (or must) be assumed and you can sometimes skip the whole "reasoning" part if you know the consequent is true anyway.

Here is an quiz question to check whether you understood what I meant to explain.

"The sky is blue, and therefore the Earth is round." True or false?

March 26, 2017

TransferWise Tech Blog5 Tips for Getting More Out of Your Unit Tests

State of Application Design

5 Tips for Getting More Out of Your Unit Tests

In vast majority of applications I have seen the domain logic is implemented using a set of Service classes (Transaction Scripts). Majority of these are based on the DB structure. Entities are typically quite thin DTOs that have little or no logic.

The main benefit of this kind of architecture is that it is very simple and indeed often good enough as a starting point. However, the problem is that over time when the application gets more complex this kind of approach does not scale too well. Often you end up with Services that call 6 - 8 other Services. Many of these Services have no clear responsibilities but are built in an ad-hoc manner as wrappers of existing Services adding tiny bits of logic needed for some specific new feature.

So how to avoid or dig yourself out from this kind of architecture? One approach I have found very useful is looking at the unit tests when writing them. By listening to what my tests are trying to tell me I will be able to build much better design. This is nothing else but the "Driven" part in TDD which everybody knows but is still quite hard to understand.

Indeed it is quite easy to write tests before production code but at the same time not let these tests have any significant effect on production code. Sometimes there is also this thinking that testing is supposed to be hard in which case it is particularly easy to ignore the "smells" coming from tests.

Following are some rules I try to follow when writing tests. I have found that these ideas help me to avoid fighting my tests and as a result not only are tests better but also the production code.

In the following text I use "spec" to refer to a single test class/file.

Rule 1: when spec is more than 120 lines then split it

When the spec is too long I will not be able to grasp it quickly anymore. Specific number does not matter but I have found around 120 lines to be a good threshold for myself. With very large test file it gets hard to detect duplication/overlap when adding new test methods. Also it becomes harder to understand the behavior being tested.

Rule 2: when test names have duplication it is often a sign that you should split the spec

Typically unit tests are 1:1 mapped to each production class. So tests often need to specify what exact part of the target class is being tested. This is especially common for the above mentioned Services which are often just collections of different kinds of procedures.

Lets say that we have a PaymentMethodService which has tests like:

def "when gets payment methods for EUR then returns single card method"()  
def "when gets payment methods for non-EUR then returns debit and credit as separate methods"()  
def "when gets payment methods then returns only enabled methods"()  
def "when gets payment methods for a known user then orders them based on past usage"()  
def "when gets payment methods for transfer amount > 2000 GBP then returns bank transfer as the first method"()  
...

These tests all repeat when gets payment methods. So maybe we can create a new spec for getting payment methods and we can just dump the duplicating prefix from all of the test names. Result will be:

class GetPaymentMethodsSpec {  
  def "returns only enabled methods"()
  def "when user is known then orders methods based on past usage"()
  def "for transfer amount > 2000 GBP bank transfer is the first method"()
  ...
}

Note that the spec name does not contain name of any production class. If I can find a good name that contains tested class I don't mind but if it gets in the way then I'm willing to let go of the Ctrl+Shift+T. This aligns with the idea of Uncle Bob that test and production code evolve in different directions.

Rule 3: when you have split too long spec then always think whether you should split/extract something in production code as well

If there are many tests for something then it means that the tested behavior is complex. If something is complex then it should be split apart. Often lines of code are not good indicator for complexity as you can easily hide multiple branches/conditions into single line.

From the previous example if we have multiple tests around the ordering of payment methods it may be a good sign that ordering could be extracted into a separate class like PaymentMethodOrder.

Rule 4: when test contains a lot of interactions then introduce some new concept in the production code

When looking at the tests for such Transaction Script Services then often they contain a lot of interactions. This makes writing tests very hard as it should be because there is clearly too much going on at once and we are better off splitting it.

Rule 5: extract new class when you find yourself wanting to stub out a method in tested class

When you think that you need to mock/stub some class partially then this is generally bad idea. What the test is telling you is that you have too much behavior cramped together.

You have 2 choices:

  • don't mock it and use the production implementation
  • if your test becomes too complex or you need too many similar tests then extract that logic out into separate class and test that part of behavior separately

You can also check out my post from few years ago for more tips for writing good unit tests.

Used still from Ridley Scott's Blade Runner

March 20, 2017

Four Years RemainingThe Schrödinger's Cat Uncertainty

Ever since Erwin Schrödinger described a thought experiment, in which a cat in a sealed box happened to be "both dead and alive at the same time", popular science writers have been relying on it heavily to convey the mysteries of quantum physics to the layman. Unfortunately, instead of providing any useful intuition, this example has instead laid solid base to a whole bunch of misconceptions. Having read or heard something about the strange cat, people would tend to jump to profound conclusions, such as "according to quantum physics, cats can be both dead and alive at the same time" or "the notion of a conscious observer is important in quantum physics". All of these are wrong, as is the image of a cat, who is "both dead and alive at the same time". The corresponding Wikipedia page does not stress this fact well enough, hence I thought the Internet might benefit from a yet another explanatory post.

The Story of the Cat

The basic notion in quantum mechanics is a quantum system. Pretty much anything could be modeled as a quantum system, but the most common examples are elementary particles, such as electrons or photons. A quantum system is described by its state. For example, a photon has polarization, which could be vertical or horizontal. Another prominent example of a particle's state is its wave function, which represents its position in space.

There is nothing special about saying that things have state. For example, we may say that any cat has a "liveness state", because it can be either "dead" or "alive". In quantum mechanics we would denote these basic states using the bra-ket notation as |\mathrm{dead}\rangle and |\mathrm{alive}\rangle. The strange thing about quantum mechanical systems, though, is the fact that quantum states can be combined together to form superpositions. Not only could a photon have a purely vertical polarization \left|\updownarrow\right\rangle or a purely horizontal polarization \left|\leftrightarrow\right\rangle, but it could also be in a superposition of both vertical and horizontal states:

    \[\left|\updownarrow\right\rangle + \left|\leftrightarrow\right\rangle.\]

This means that if you asked the question "is this photon polarized vertically?", you would get a positive answer with 50% probability - in another 50% of cases the measurement would report the photon as horizontally-polarized. This is not, however, the same kind of uncertainty that you get from flipping a coin. The photon is not either horizontally or vertically polarized. It is both at the same time.

Amazed by this property of quantum systems, Schrödinger attempted to construct an example, where a domestic cat could be considered to be in the state

    \[|\mathrm{dead}\rangle + |\mathrm{alive}\rangle,\]

which means being both dead and alive at the same time. The example he came up with, in his own words (citing from Wikipedia), is the following:

Schrodingers_cat.svgA cat is penned up in a steel chamber, along with the following device (which must be secured against direct interference by the cat): in a Geiger counter, there is a tiny bit of radioactive substance, so small, that perhaps in the course of the hour one of the atoms decays, but also, with equal probability, perhaps none; if it happens, the counter tube discharges and through a relay releases a hammer that shatters a small flask of hydrocyanic acid. If one has left this entire system to itself for an hour, one would say that the cat still lives if meanwhile no atom has decayed. The first atomic decay would have poisoned it.

The idea is that after an hour of waiting, the radiactive substance must be in the state

    \[|\mathrm{decayed}\rangle + |\text{not decayed}\rangle,\]

the poison flask should thus be in the state

    \[|\mathrm{broken}\rangle + |\text{not broken}\rangle,\]

and the cat, consequently, should be

    \[|\mathrm{dead}\rangle + |\mathrm{alive}\rangle.\]

Correct, right? No.

The Cat Ensemble

Superposition, which is being "in both states at once" is not the only type of uncertainty possible in quantum mechanics. There is also the "usual" kind of uncertainty, where a particle is in either of two states, we just do not exactly know which one. For example, if we measure the polarization of a photon, which was originally in the superposition \left|\updownarrow\right\rangle + \left|\leftrightarrow\right\rangle, there is a 50% chance the photon will end up in the state \left|\updownarrow\right\rangle after the measurement, and a 50% chance the resulting state will be \left|\leftrightarrow\right\rangle. If we do the measurement, but do not look at the outcome, we know that the resulting state of the photon must be either of the two options. It is not a superposition anymore. Instead, the corresponding situation is described by a statistical ensemble:

    \[\{\left|\updownarrow\right\rangle: 50\%, \quad\left|\leftrightarrow\right\rangle: 50\%\}.\]

Although it may seem that the difference between a superposition and a statistical ensemble is a matter of terminology, it is not. The two situations are truly different and can be distinguished experimentally. Essentially, every time a quantum system is measured (which happens, among other things, every time it interacts with a non-quantum system) all the quantum superpositions are "converted" to ensembles - concepts native to the non-quantum world. This process is sometimes referred to as decoherence.

Now recall the Schrödinger's cat. For the cat to die, a Geiger counter must register a decay event, triggering a killing procedure. The registration within the Geiger counter is effectively an act of measurement, which will, of course, "convert" the superposition state into a statistical ensemble, just like in the case of a photon which we just measured without looking at the outcome. Consequently, the poison flask will never be in a superposition of being "both broken and not". It will be either, just like any non-quantum object should. Similarly, the cat will also end up being either dead or alive - you just cannot know exactly which option it is before you peek into the box. Nothing special or quantum'y about this.

The Quantum Cat

"But what gives us the right to claim that the Geiger counter, the flask and the cat in the box are "non-quantum" objects?", an attentive reader might ask here. Could we imagine that everything, including the cat, is a quantum system, so that no actual measurement or decoherence would happen inside the box? Could the cat be "both dead and alive" then?

Indeed, we could try to model the cat as a quantum system with |\mathrm{dead}\rangle and |\mathrm{alive}\rangle being its basis states. In this case the cat indeed could end up in the state of being both dead and alive. However, this would not be its most exciting capability. Way more suprisingly, we could then kill and revive our cat at will, back and forth, by simply measuring its liveness state appropriately. It is easy to see how this model is unrepresentative of real cats in general, and the worry about them being able to be in superposition is just one of the many inconsistencies. The same goes for the flask and the Geiger counter, which, if considered to be quantum systems, get the magical abilities to "break" and "un-break", "measure" and "un-measure" particles at will. Those would certainly not be a real world flask nor a counter anymore.

The Cat Multiverse

There is one way to bring quantum superposition back into the picture, although it requires some rather abstract thinking. There is a theorem in quantum mechanics, which states that any statistical ensemble can be regarded as a partial view of a higher-dimensional superposition. Let us see what this means. Consider a (non-quantum) Schrödinger's cat. As it might be hopefully clear from the explanations above, the cat must be either dead or alive (not both), and we may formally represent this as a statistical ensemble:

    \[\{\left|\text{dead}\right\rangle: 50\%, \quad\left|\text{alive}\right\rangle: 50\%\}.\]

It turns out that this ensemble is mathematically equivalent in all respects to a superposition state of a higher order:

    \[\left|\text{Universe A}, \text{dead}\right\rangle + \left|\text{Universe B}, \text{alive}\right\rangle,\]

where "Universe A" and "Universe B" are some abstract, unobservable "states of the world". The situation can be interpreted by imagining two parallel universes: one where the cat is dead and one where it is alive. These universes exist simultaneously in a superposition, and we are present in both of them at the same time, until we open the box. When we do, the universe superposition collapses to a single choice of the two options and we are presented with either a dead, or a live cat.

Yet, although the universes happen to be in a superposition here, existing both at the same time, the cat itself remains completely ordinary, being either totally dead or fully alive, depending on the chosen universe. The Schrödinger's cat is just a cat, after all.

March 07, 2017

Four Years RemainingThe Difficulties of Self-Identification

Ever since the "Prior Confusion" post I was planning to formulate one of its paragraphs as the following abstract puzzle, but somehow it took me 8 years to write it up.

According to fictional statistical studies, the following is known about a fictional chronic disease "statistite":

  1. About 30% of people in the world have statistite.
  2. About 35% of men in the world have it.
  3. In Estonia, 20% of people have statistite.
  4. Out of people younger than 20 years, just 5% have the disease.
  5. A recent study of a random sample of visitors to the Central Hospital demonstrated that 40% of them suffer from statistite.

Mart, a 19-year Estonian male medical student is standing in the foyer of the Central Hospital, reading these facts from an information sheet and wondering: what are his current chances of having statistite? How should he model himself: should he consider himself as primarily "an average man", "a typical Estonian", "just a young person", or "an average visitor of the hospital"? Could he combine the different aspects of his personality to make better use of the available information? How? In general, what would be the best possible probability estimate, given the data?

March 02, 2017

Ingmar TammeväliMetsade kaitseks…

Pole siiani olnud vajadust kiruda ja vanduda, sellest meil tuhandeid delfiilikuid.

Aga nüüd pidin lausa tehnikakauge postituse tegema, ma pole metsanduse spetsialist, aga mida minu silmad näevad on jube.
Kuna seda jubedust näevad juba enamik inimesi, siis leidsin oleks aeg ka sõna võtta.

Hetkel alanud mingi sõda Eesti metsa vastu, sisuliselt kus ka ei sõida on läbustatud metsakrundid, kus ei kasva enam midagi.
Tekkinud mingi X firmad, mis käivad mööda kinnisturegistreid ja metsateatisi ning teevad omanikele pressingut telefoni kaudu, et müüge müüge.

Sisuliselt ilma irooniata meie ilusad metsad näevad juba välja nagu ebaõnnestunud brasiilia vahatus lamba peal…

Mul tekkinud küsimus:
* miks lubatakse suured metsamassiivid maha raiuda nii, et ei pea midagi asemele istutama.
Minu ettepanek, enne kui tohib üldse raiega alustada, siis metsaametnik ntx valla poolne teeb hindamise ja kui tehakse raie, siis panditasu on 35% summa metsa väärtusest.
Ehk emakeeli, istutad uue metsa asemele (6 kuu jooksul), saad 35% raha tagasi, ei istuta, oled rahast ilma

* metsaveotraktoritega lõhutakse ära külateed, samuti metsaveoautodega. See taastamise nõue oli vist 6-7 kuu jooksul naljanumber, enamus metsafirmasid ei tee seda ja ametnikud suht hambutud. Politsei ei viitsi nendega tegeleda, emakeeli … neil pole ressursse.

* miks langetati kuusikute vanusepiiri, mida tohib raiuda

Ehk kogu teksti sisu see, et lp. poliitikud kui teil mingitki austust Eestimaa väärtuste vastu, lõpetage see maffia stiilis metsade majandamine, see pole majandamine vaid lageraie !

OECD: Eestist intensiivsemalt raiub oma metsi vaid üks arenenud tööstusriik


February 23, 2017

Kuido tehnokajamEkraanikattega akna sulgemine ESC klahvivajutusega

Üllataval kombel ei leidnud selleks lihtsat lahendust, tuleb Javascriptiga jännata Visuaalselt näeb välja nii, et klikid kugugi ja avaneb ekraaninkattega aken. Kasutame ModalPopupExtender-it mille ees näitab UserControli sisu <asp:Label runat="server" ID="HForModal" style="display: none" /> <asp:Panel runat="server" ID="P1" ScrollBars="Auto" Wrap="true"  Width="80%" CssClass="modalPopup">

Raivo LaanemetsChrome 56 on Slackware 14.1

Chrome 56 on Slackware 14.1 requires the upgraded mozilla-nss package. Without the upgraded package you get errors on some HTTPS pages, including on google.com itself:

Your connection is not private.

with a detailed error code below:

NET::ERR_CERT_WEAK_SIGNATURE_ALGORITHM

The error comes from a bug in the NSS package. This is explained here in more detail. Slackware maintainers have released upgrades to the package. Upgrading the package and restarting Chrome fixes the error.

February 18, 2017

Anton ArhipovJava EE meets Kotlin

Here's an idea - what if one tries implementing Java EE application with Kotlin programming language? So I though a simple example, a servlet with an injected CDI bean, would be sufficient for a start.

Start with a build script:

<script src="https://gist.github.com/antonarhipov/db4f4002c6a1813d349b.js"></script>

And the project structure is as follows:

Here comes the servlet:

<script src="https://gist.github.com/antonarhipov/4fbf350a6a0cdb06ff86.js"></script>

What's cool about it?

First, it is Kotlin, and it works with the Java EE APIs - that is nice! Second, I kind of like the ability to set aliases for the imported classes: import javax.servlet.annotation.WebServlet as web, in the example.


What's ugly about it?

Safe calls everywhere. As we're working with Java APIs, we're forced to use safe calls in Kotlin code. This is ugly.


Next, in Kotlin, the field has to be initialized. So initializing the 'service' field with the null reference creates a "nullable" type. This also forces us to use either the safe call, or the !! operator later in the code. The attempt to "fix" this by using the constructor parameter instead of the field failed for me, the CDI container could not satisfy the dependency on startup.


Alternatively, we could initialize the field with the instance of HelloService. Then, the container would re-initialize the field with the real CDI proxy and the safe call would not be required.


Conclusions

It is probably too early to say anything for sure, as the demo application is so small. One would definitely need to write much more code to uncover the corner cases. However, some of the outcomes are quite obvious:

  • Using Kotlin in Java web application appears to be quite seamless.
  • The use of Java APIs creates the need for safe calls in Kotlin, which doesn't look very nice.

February 06, 2017

TransferWise Tech BlogWhen to Adopt the Next Cool Technology?

What should be the criteria for an organization to decide when is it a good time to update its toolbox?

Recently there has been a lot of discussion about the fatigue around JavaScript and frontend tools in general. Although it seems to be more painful on frontend the problem is not specific to frontend neither is it anything new or recent. There are two sides to this. One is the effect it has on one's personal development. Other side is how it affects organizations. More specifically how should an organization decide when is it a good time to bring in new tool/framework/language X?

When we recently discussed this topic my colleague Jordan Valdma came up with the following formula to decide when adoption makes sense:


new features + developer coolness > cost of adoption

Cost of Adoption

Introducing anything new means loss of efficiency until you have mastered it well enough. Following the model of Shu-Ha-Ri (follow-detach-fluent) it may be relatively easy to get to the first level - "following". However, it is only when moving to the next levels when one starts cashing in more of the potential value. That means looking beyond the specific feature set of the tool, searching for ways how to decouple oneself from it and employ it for something more principal. One of my favorite examples is using hexagonal architecture with Ruby on Rails.

New Features

By new features I mean the things that are actually valuable for your product. There are many aspects for any new thing that are hard to measure and are quite subjective. These should not go here. For example, "allows to write code that is more maintainable". This is very hard to prove and seems more like something that one may choose to believe or not. However, there are also things like "supports server-side rendering". If we know our product could take advantage of this then this is good objective reason for adoption.

Developer Coolness

I think when it comes to new/cool technologies it is always good to be pragmatic. In an organization that is heavily business/outcome oriented it may seem that there should be no room for non-rational arguments like how someone feels about some new language/library.

However, it is quite dangerous to completely ignore the attractiveness aspect of technology. There are two points to keep in mind. First, all good devs like to expand their skill set. Second, technologies that have certain coolness about them tend to build stronger communities around them hence have the potential of growing even more compelling features.

February 01, 2017

TransferWise Tech BlogBuilding TransferWise or the road to a product engineer

Soon it is my 5 years anniversary at TransferWise. I looked back. I wrote down what has come to my mind.

I was hired as an engineer. I thought I was hired to write the code and that is what I started doing. Simple duty. Take a task from a ticketing system, implement it and move on to the next one. Easy. One of my first tickets was following: "add a checkbox to the page with a certain functionality". Easy. I did that and then Kristo asks for a call and asks me a very simple question: "Why have you done it?". I've tried to reply something but other questions followed... You know how I felt? I felt miserable, confused and disoriented. I remember I said clearly "I feel very stupid.". Kristo replied "It is fine." Then we have had a long chat and I spent next couple weeks on that task. I talked to people trying to understand why that checkbox is needed and what does that mean after all. I designed new layout for the page. I implemented the solution. Since then I kept coding. I still believed that it was my duty and this is what I was hired for. But you guess what? Kristo kept asking questions. Slowly but steady it came to my mind that it is not the coding that I am suppose to be doing. I found my self doing a variety of activities. Talking to customers and analysing their behavior. Supporting new joiners and building the team. Designing pages. Building a vision. Many other things and of course writing the code.

At some point I understood. This stopped being easy. It has become very hard and challenging. All variety of questions were floating through my head including following. "Why at all I am hired?". "What I should be doing?". "Am I valuable?". "What is my value?". "What was my impact lately?" An example from my own life helped me to clear this out. I have a piece of land and I went to build a house. I researched the topic. I earned necessary money to fund it. I chose an architecture plan. I found workers. I organised building materials delivery. If I am to be asked about it I will clearly say: "I am building a house". I also realised. What if the workers whom I've found will be asked as well? Their reply will be exactly the same: "I am building a house". This fact amazed me. Our activities are quite different but all together we are building that house.

This analogy helped me massively. I got to a simple conclusion. I am here to build and grow TransferWise. Building TransferWise is what expected from me. Building TransferWise means variety of different activities. It may be putting bricks together to create a wall. It may be designing interior and exterior. It may be organising materials delivery. It may be talking to others who have build houses and are living in those. It may be finding and hiring builders. It might be visiting builders in a hospital when they get sick.

It also helped me to understand why am I doing it after all. With my own house it is easy because it is me who will be living there :) Apparently all the other houses in the world are constructed for someone to live there. I can’t imagine builders going for: “Let’s start building walls and then we will figure out how many floors we can get to and see if anyone will happen to live in that construction.” It will always start from consideration of people, their needs and their wishes. In case of TransferWise from thinking of customers who will be using it.

That said. I was foolish when I was evaluating myself by an engineering tasks I've finished. I was foolish to think that what I'm used to be doing is what I should be doing. Nowadays my aim is to make things happen. My aim is to figure out what needs to be done and do it. My measurement of myself is not the lines of code or a number of meetings I've had. It is not about the number of bricks I’ve placed. My goal is to have people living in the houses I’ve build. My goal is to see them living a happy life there. My goal is to see a happy TransferWise customers.

Eventually my title changed from an engineer to a product engineer and then to a product manager. I am not skilled to do my job and constantly do mistakes. But I try and keep trying. My life has become easy again. I found a better way to be an engineer.

January 22, 2017

Anton ArhipovTwitterfeed #4

Welcome to the fourth issue of my Twitterfeed. I'm still quite irregular on posting the links. But here are some interesting articles that I think are worth sharing.

News, announces and releases


Atlassian aquired Trello. OMG! I mean... happy for Trello founders. I just hope that the product would remain as good as it was.

Docker 1.13 was released. Using compose-files to deploy swarm mode services is really cool! The new monitoring and build improvements are handy. Also Docker is now AWS and Azure-ready, which is awesome!

Kotlin 1.1 beta was published with a number of interesting new features. I have mixed feelings, however. For instance, I really find type aliases an awesome feature, but the definition keyword, "typealias", feels too verbose. Just "alias" would have been much nicer.
Meanwhile, Kotlin support was announced for Spring 5. I think this is great - Kotlin suppot in the major frameworks will definitely help the adoption.

Is there anyone using Eclipse? [trollface] Buildship 2.0 for Eclipse is available, go grab it! :)

Resonating articles


RethinkDB: Why we failed. Probably the best post-mortem that I have ever read. You will notice a strange kvetch at first about the tough market and how noone wants to pay. But then reading forward the author honestly lists what was really wrong. Sad that it didn't take off, it was a great project.

The Dark Path - probably the most contradicting blog post I've read recently. Robert Martin takes his word on Swift and Kotlin. A lot of people, the proponents of strong typing, reacted to this blog post immediately. "Types are tests!", they said. However, I felt like Uncle Bob just wrote this articles to repeat his point about tests: "it doesn't matter if your programming language strongly typed or not, you should write tests". No one would disagree with this statement, I believe. However, the followup article was just strange: "I consider the static typing of Swift and Kotlin to have swung too far in the statically type-checked direction." OMG, really!? Did Robert see Scala or Haskell? Or Idris? IMO, Swift and Kotlin hit the sweet spot in regards to type system that would actually _help_ the developers without getting in the way. Quite a disappointing read, I have to say..

Java 9


JDK 9 is feature complete. Those are great news. Now, it would be nice to see how will the ecosystem survive with all the issues related to reflective access. Workarounds exist, but there should be a proper solution without such hacks. Jigsaw caused a lot of concerns here and there but the bet is that in the long run, the benefits will outweigh the inconveniences.

Misc


The JVM is not that heavy
15 tricks for every web dev
Synchronized decorators
Code review as a gateway
How to build a minimal JVM container with Docker and Alpine Linux
Lagom, the monolith killer
Reactive Streams and the weird case of backpressure
Closures don’t mean mutability.
How do I keep my git fork up to date?

Predictions for 2017


Since it is the beginning of 2017, it is trendy to make predictions for the trends of the upcoming year. Here are some prediction by the industry thought leaders:

Adam Bien’s 2017 predictions
Simon Ritter’s 2017 predictions
Ted Neward’s 2017 predictions

January 04, 2017

TransferWise Tech BlogEffective Reuse on Frontend

In my previous post I discussed cost of reuse and some strategies how to deal with it on the backend. What about frontend? In terms of reuse both are very similar to each other. When we have more than just a few teams regularly contributing to frontend we need to start thinking how we approach reuse across different contexts/teams.

Exposing some API of our microservice to other teams makes it a published interface. Once this is done we cannot change it that easily anymore. Same happens on frontend when a team decides to "publish" some frontend component to be reused by other teams. The API (as well as the look) of this component becomes part of the contract exposed to the outside world.

Hence I believe that:

We should split web frontend into smaller pieces — microapps — much the same way as we split backend into microservices. Development and deployment of these microapps should be as independent of each other as possible.

This aligns quite well with the ideas of Martin Fowler, James Lewis and Udi Dahan who suggest that "microservice" is not a backed only concept. Instead of process boundaries it should be defined by business capabilities and include its own UI if necessary.

Similarly to microservices we want to promote reuse within each microapp while we want to be careful with reuse across different microapps/teams.

January 02, 2017

Raivo LaanemetsNow, 2017-01, summary of 2016 and plans for 2017

This is an update on things related to this blog and my work.

Last month

Blogging

  • Added an UX improvement: external links have target="_blank" to make them open in a new tab. The justification can be found in this article. It is implemented using a small piece of script in the footer.
  • Updated the list of projects to include work done in 2016.
  • Updated the visual style for better readability. The article page puts more focus on the content and less on the related things.
  • Updated the CV.
  • Found and fixed some non-valid HTML markup on some pages.
  • Wrote announcements to the last of my Open Source projects: DOM-EEE and Dataline.

I also discovered that mail notifications were not working. The configuration was broken for some time and I had disabled alerts on the blog engine standard error stream. I have fixed the mail configuration and monitor the error log for mail sending errors.

Work

I built an Electron-based desktop app. I usually do not build desktop applications and consider them a huge pain to build and maintain. This was a small project taking 2 weeks and I also used it as a chance to evaluate the Vue.js framework. Vue.js works very well with Electron and was very easy to pick up thanks to the similarities with the KnockoutJS library. I plan to write about the both in separate articles.

The second part of my work included a DXF file exporter. DXF is a vector drawing format used by AutoCAD and industrial machines. My job was to convert and combine SVG paths from an online CAD editor into a single DXF file for a laser cutter.

During filing my annual report I was positively surprised that I need to file very little paperwork. It only required a balance sheet + a profit/loss statement + 3 small additional trivial reports. On the previous years I had to file a much more comprehensive report now required from mid-size (Estonian scale) companies with about 250 employees.

Infrastructure

I have made some changes to my setup:

  • Logging and monitoring was moved to an OVH VPS.
  • Everything else important is moved away from the home server. Some client systems are still waiting to be moved.

The changes were necessary as I might travel a bit in 2017 and it won't be possible to fix my own server at home when an hardware failure occurs. I admit it was one of the stupidest decisions to run my own server hardware.

Besides these changes:

  • infdot.com now redirects to rlaanemets.com. I am not maintaining a separate company homepage anymore. This gives me more free time for the important things.
  • Rolled out SSL to my every site/app where I enter passwords. All the new certs are from Lets Encrypt and are renewed automatically.
  • I am now monitoring my top priority web servers through UptimeRobot.
  • The blog frontend is monitored by Sentry.

Other things

The apartment buildings full-scale renovations were finally accepted by the other owners and the contract has been signed with the building company. The constructions start ASAP. I have been looking for possible places to rent a quiet office space as the construction noise likely makes work in the home office impossible.

Yearly summary and plans

2016 was incredibly busy and frustrating year for me. A project at the beginning of the year was left partially unpaid after it turned out to be financially unsuccessful for the client. The project did not have a solid contract and a legal action against the client would have been very difficult. This put me into a tight situation where I took more work than I could handle to compensate my financial situation. As the work accumulated:

  • I was not able to keep up with some projects. Deadlines slipped.
  • I was not able to accept better and more paying work due to the existing work.
  • Increasing workload caused health issues: arm pains, insomnia.

In the end of the year I had to drop some projects as there was no other ways to decrease the work load. Last 2 weeks were finally pretty OK.

In 2017 I want to avoid such situations. Financially I'm already in a much better position. I will be requiring a bit stricter contracts from my clients and select projects more carefully.

Considering technology, I do not see year 2017 bring many changes. My preferred development platforms are still JavaScript (browsers, Node.js, Electron, PhantomJS) and SWI-Prolog.

December 28, 2016

Anton ArhipovTwitterfeed #3

Welcome to the third issue of my Twitterfeed. Over two weeks since the last post I've accumulated a good share of links to the news and blog posts, so it is a good time "flush the buffer".


Let's start with something more fundamental than just the news about frameworks and programming languages. "A tale of four memory caches" is a nice explanation of how browser caching works. Awesome read, nice visuals, useful takeaways. Go read it!

Machine Learning seems is becoming more and more popular. So here's a nicely structured knowledge-base at your convenience: "Top-down learning path: Machine Learning for Software Engineers".

Next, let's see what's new about all the reactive buzz. The trend is highly popular so I've collected a few links to the blog posts about RxJava and related.

First, "RxJava for easy concurrency and backpressure" is my own writeup about the beauty of the RxJava for a complex problem like backpressure combined with concurrent task scheduling.

Dávid Karnok published benchmark results for the different reactive libraries.

"Refactoring to Reactive - Anatomy of a JDBC migration" explains how reactive approach can be introduced incrementally into the legacy applications.

The reactive approach is also suitable for the Internet of Things area. So here's the article about Vert.x being used for IoT world.

IoT is actually not only about the devices but also about the cloud. Arun Gupta published a nice write up about using the AWS IoT Button with AWS Lambda and Couchbase. Looks pretty cool!

Now onto the news related to my favourite programming tool, IntelliJ IDEA!

IntelliJ IDEA 2017.1 EAP has started! Nice, but I'm not amused. Who needs those emojis anyway?! I hope IDEA developers will find something more useful in the bug tracker to fix and improve.

Andrey Cheptsov experiments with code folding in IntelliJ IDEA. The Advanced Expressions Folding plugin is available for download - give it a try!

Claus Ibsen announced that the work has started on Apache Camel IntelliJ plugin.

Since we are at the news about IntelliJ IDEA, I think it makes sense to see what's up with Kotlin as well. Kotlin 1.0.6 has been released, which is the new bugfix and tooling update. Seems like Kotlin is getting more popularity and people try to use it in conjunction with popular frameworks like Spring Boot and Vaadin.

Looks like too many links already so I'll stop here. I should start posting those more often :)

December 22, 2016

Raivo LaanemetsAnnouncement: Dataline chart library

Some time ago I built a small library to draw some line charts using the HTML5 canvas. I have been using it in some projects requiring simple responsive line charts. It can do this:

  • Draws min/max/zero line.
  • Draws min/max labels.
  • Single line.
  • Width-responsive.

Multiple lines, ticks, x-axis labels etc. are not support. There are other libraries that support all of these. It has no dependencies but requires ES5, canvas and requestAnimationFrame support. The library is extremely lightweight and uses very few resources.

Example

This is the HTML code containing the canvas and input data. The data is embedded directly by using the data-values attribute:

<canvas class="chart" id="chart"
  data-values="1,2,3,-1,-3,0,1,2"></canvas>        
<script src="dataline.js"></script>
<script>Dataline.draw('chart');</script>

And the CSS code to set the chart size:

.chart { width: 100%; height: 200px; }

Live rendering output:

<canvas class="chart" data-values="1,2,3,-1,-3,0,1,2" id="chart" style="width: 100%; height: 200px;"></canvas> <script src="https://rlaanemets.com/announcement-dataline-chart-library/dataline.min.js"></script> <script>Dataline.draw('chart');</script>

The source code of the library, documentation, and the installation instructions can be found in the project repository.

December 21, 2016

Kuido tehnokajamMärkeruudu ärakaotamine CheckBoxListi grupeerimisel

ASP.NET CheckBoxList komponendil mõnikord vaja nimekirja grupeerida erinevatel põhjustel Kogu trikk põhineb CSS3 kasutamisel, kuna see võib tekitada segadusi projekti piirides, siis mõtekas kasutada "inline CSS", näiteks vajaliku userControli sees <style>     #CheckBoxListOtsinguMajad input:disabled {         display: none;     } </style> reegli mõju piirame CheckBoxListOtsinguMajad

December 17, 2016

Raivo LaanemetsAnnouncement: DOM-EEE

DOM-EEE is a library to extract structured JSON data from DOM trees. The EEE part in the name means Extraction Expression Evaluator. The library takes a specification in the form of a JSON document containing CSS selectors and extracts data from the page DOM tree. The output is also a JSON document.

I started developing the library while dealing with many web scraping projects. There have been huge differences in navigation logics, page fetch strategies, automatic proxying, and runtimes (Node.js, PhantomJS, browser userscripts) but the data extraction code has been similar. I tried to cover these similarities in this library while making it working in the following environments:

  • Browsers (including userscripts)
  • PhantomJS
  • Cheerio (Node.js)
  • jsdom (Node.js)
  • ES5 and ES6 runtimes

The library is a single file that is easy to inject into any of these environments. As the extraction expressions are kept in the JSON format, and the output is a JSON document, any programming platform supporting JSON and HTTP can be coupled to PhantomJS, an headless web browser with a built-in server to drive the scraping process.

Example usage

This example uses cheerio, a jQuery implementation for Node.js:

var cheerio = require('cheerio');
var eee = require('eee');
var html = '<ul><li>item1</li><li>item2 <span>with span</span></li></ul>';
var $ = cheerio.load(html);
var result = eee($.root(),
    {
        items: {
            selector: 'li',
            type: 'collection',
            extract: { text: { selector: ':self' } },
            filter: { exists: 'span' }
        }
    },
    { env: 'cheerio', cheerio: $ });
console.log(result);

This code will print:

{ items: [ { text: 'item2 with span' } ] }

Alternatives

There is a number of similar projects. Most of them assume a specific runtime environment or try to do too much to be portable. Some examples:

  • artoo.js (client side).
  • noodle (Node.js, not portable enough).
  • x-ray (not portable, coupled with HTTP and pagination and 100 other things).

Documentation

Full documentation of the JSON-based expression language and further examples can be found in the project's code repository.

December 09, 2016

Raivo LaanemetsHello world from DXF

Last week I worked on a code to convert SVG to DXF. SVG (Scalable Vector Graphics) is a vector graphics format supported by most browsers. DXF (Drawing Exchange Format) is another vector format, mostly used by CAD applications. Our CAD editor at Scale Laser, a startup focusing on model railroad builders, uses a SVG-based drawing editor in the browser but the software controlling the actual cutting hardware uses DXF. There were no usable generic SVG to DXF converters available and we had to write our own. We only deal with SVG <path> elements and do not have to support other SVG elements.

DXF is fairly well specified through a 270-line PDF file here. The low-level data serialization format feels ancient compared to more structured XML and JSON. Also, it is quite hard to put together a minimal DXF file which can be opened by the most programs claiming DXF compatibility or that can be opened with AutoCAD itself. AutoCAD is the original program to use the DXF format.

I have put together a minimal file by trial-and-error. I kept adding stuff until I got the file loading in AutoCAD. The file follows and I explain the parts of it.

Sections

A DXF file consists of sections. The most important section is ENTITIES that contains graphical objects. Another important section is HEADER:

  0
SECTION
  2
HEADER
  9
$ACADVER
  1
AC1009
  0
ENDSEC

All sections are made up using group code-value pairs. A such pair is formatted like:

  <code>
<value>

The group code specifies either the type of the value (string, float, etc) or its semantic meaning (X coordinate) or both the type of the value and the meaning. A section begins with the SECTION keyword and section's name. A section ends with the ENDSEC keyword.

I found it was necessary to specify the file/AutoCAD version in the header. Without it, some tools, including AutoCAD, would give errors upon opening the file. This is accomplished by two code-value pairs:

  9
$ACADVER
  1
AC1009

This corresponds to the versions R11 and R12.

Lines

After the header comes the actual content section ENTITIES. It contains a rectangle made up of 4 lines (snippet truncated to show a single line only):

  0
SECTION
  2
ENTITIES
  0
LINE
  8
0
  62
8
  10
169.50
  20
42.33
  11
169.50
  21
94.19
...
  0
ENDSEC
  0

Graphical objects are specified one after another, without any further structure. A line starts with the LINE keyword and ends with the start of another object or with the section end. The line object here has the following properties.

The layer index (group code 8, value 0). I was not able to make the file display on most viewers without it:

  8
0

The line color (group code 62, value 8 - gray). Nothing was visible in some viewers without setting it:

  62
8

After that come the start and end coordinates of the line (X1, Y1, X2, Y2 as 10, 20, 11, 21 respectively):

  10
169.50
  20
42.33
  11
169.50
  21
94.19

DXF coordinates have no units such as pixel, mm etc. Interpretetion of units seems to be implicit and application-specific. For example, our laser software assumes mm as the unit.

Rendering output

This is the rendering output in de-caff, a simple Java-based DXF viewer:

Minimal DXF rectangle in de-caff

This is the rendering output in AutoCAD 2017:

Minimal DXF rectangle in AutoCAD

The full file containing the rectangle and the header section can be downloaded from here.