Friday, 24 April 2015

ioT: Internet of Things links




This article by Alok Batra started me thinking about the unique differences between the Enterprise vs Mobile vs IoT development spaces and my own personal journey down this path. I am sure my thinking will change as my IoT skills and knowledge mature - this is just a moment in time - but I thought writing it all down would be valuable.

Source: https://www.linkedin.com/pulse/becoming-iot-developer-alok-batra


Thursday, 23 April 2015

How B-tree indexing works in MySQL

http://stackoverflow.com/questions/2362667/how-b-tree-indexing-works-in-mysql

Q.  When I create an index for a table in mysql, I see that the index_type is type BTREE. Now although I understand about btree(s), I do not quiet understand how it stores the index and how the database searches the records based on this.
I mean, btree is excellent for databases to perform read and writes large blocks of data, when we create an index for column type of Primary key, what I understand is, it creates a tree and splitting the values for the root based on the value type of the root.
Now, does it store only the the primary key ID under the trees or the whole data associated with that primary key?
After finding the wanted primary ID, how does the database extract the record?

A. The database stores the value indexed as a B-Tree key, and the record pointer as a B-Tree value.
Whenever you search for a record holding a certain value of an indexed column, the engine locates the key holding this value in the B-Tree, retrieves the pointer to the record and fetches the record.
What exactly is a "record pointer", depends on the storage engine.
  • In MyISAM, the record pointer is an offset to the record in the MYI file.
  • In InnoDB, the record pointer is the value of the PRIMARY KEY.
In InnoDB, the table itself is a B-Tree with a PRIMARY KEY as a B-Tree key. This is what called a "clustered index" or "index-organized table". In this case, all other fields are stored as a B-Treevalue.
In MyISAM, the records are stored without any special order. This is called "heap storage".

Wednesday, 22 April 2015

Basic Search Algorithms


In computer sciencebinary search trees (BST), sometimes called ordered or sorted binary trees, are a class of data structures used to implement lookup tables and dynamic sets. They store data items, known as keys, and allow fast insertion and deletion of such keys, as well as checking whether a key is present in a tree.
Binary search trees keep their keys in sorted order, so that lookup and other operations can use the principle of binary search: when looking for a key in a tree (or a place to insert a new key), they traverse the tree from root to leaf, making comparisons to keys stored in the nodes of the tree and deciding, based on the comparison, to continue searching in the left or right subtrees. On average, this means that each comparison allows the operations to skip over half of the tree, so that each lookup/insertion/deletion takes time proportional to the logarithm of the number of items stored in the tree. This is much better than the linear time required to find items by key in an unsorted array, but slower than the corresponding operations on hash tables.

General Big Data, Big Analytics links

5 technologies that will help big data cross the chasm

Apache Spark

Spark’s popularity is aided by the YARN resource manager for Hadoop and the Apache Mesos cluster-management software, both of which make it possible to run Spark, MapReduce and other processing engines on the same cluster using the same Hadoop storage layer. I wrote in 2012 about the move away from MapReduce as one of five big trends helping us rethink big data, and Spark has stepped up as the biggest part of that migration

2015: Thomas W. Dinsmore: Predictions for Big Analytics

2015: Predictions for Big Analytics

Apache Spark usage will explode.
Analytics in the cloud will take off.
Python will continue to gain on R as the preferred open source analytics platform.
H2O will continue to win respect and customers in the Big Analytics market.
SAS customers will continue to seek alternatives.


Business Intelligence links

Gartner: Survey Analysis: Customers Rate Their BI Platform Vendor, 2014


High Scalability links

Paper: Staring Into The Abyss: An Evaluation Of Concurrency Control With One Thousand Cores

We implemented seven concurrency control algorithms on a main-memory DBMS and using computer simulations scaled our system to 1024 cores. Our analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art DBMSs suffer from these limitations.

General database links


tl;dr: ACID and NewSQL databases rarely provide true ACID guarantees by default, if they are supported at all. See the table.



Friday, 17 April 2015

Software Architecture links

Ben Morris: What role do architects have in agile development?

Enter the “master developer”

In this case, some kind of design authority is generally required to work across the teams to ensure the integrity of the overall system and spot issues before they become obstacles. This role shouldn’t be confused with governance, where design edicts are sent down to teams from the “ivory towers” and architecture boards. It can only be effective with the consent of the teams.

Akka links

Why do I hate akka?

If you have talked to me for more than a few minutes about the current state of the world of scala programming, you probably have learned that at some point I started hating akka. Why? There are many reasons, but the one I will name first is the one that many will name first. That a partial function Any => Unit is a horrible type to build a framework around.

General Computer Science links

2 + 2 = 5

Interesting attempts to make this work on various popular programming languages.


Not light reading!

How to Design Programs

Structure and Interpretation of Computer Programs





Dean Wampler: SQL Strikes Back! Recent Trends in Data Persistence and Analysis


Traditional Data Warehouse:  Pros
–Mature
–Rich SQL, analytics functions
–Scales to “mid-size” data
Cons
–Expensive per TB
–Can’t scale to Hadoop-sized data sets

Data Warehouse vs. Hadoop? : • Data Warehouse
+Mature
+Rich SQL, analytics
–Scalability
–$$/TB
• Hadoop
–Maturity vs. DWs
+Growing SQL
+Massive scalability
+Excellent $$/TB

Facebook had data in Hadoop.  Facebook’s Data Analysts needed access to it...
 so they created Hive...

Career Development links

Five Signs You Should Be a Low-Code Developer
If some or most of these traits resonate with your approach to work, then you’ve got what it takes to be a low-code developer. In this role, you’d understand that software development is about reaching the business goal and helping end users. You’d want to talk to users, understand their requirements and work closely with them in short, iterative cycles. Most of all, you’d advocate for the business value of IT and find great job satisfaction from making your customers and end users happy.

techcrunch: On Secretly Terrible Engineers

That is the transformation we need in engineering. We need to start with the assumption that engineers are smart learners eager to know more about their craft. No, an individual may not know the specific framework you use for front-end development, but then again, there are so many that it is hard to know all of them. Engage them! Mentor them! Buy them a god damn book! 
We need to move beyond the algorithm bravado to engage more fundamentally with the craft. If people are wired for engineering logic and have programmed in some capacity in the past, they almost certainly can get up to speed in any other part of the field. Let them learn, or even better, help them learn.
I am not unbiased here, having gone through this process myself. I started programming in second grade. I wrote tens of thousands of lines of code in high school, programming games and my own web server. I got a Mathematical and Computational Science degree from Stanford and continued coding. I should have been a software developer, but after a series of interviews, I realized the field was never for me. So much hostility, so little love. 
No one ever offered me a book. No one even offered advice, or suggestions on what was interesting in the field or what was not. No one ever said, “Here is how we are going to bring your skills to the next level and ensure you will be quickly productive on our team.” The only answer I ever got was, “We expect every employee to be ready on day one.” What a scary proposition! Even McDonalds doesn’t expect its burger flippers to be ready from day one. 
That’s not typical in our economy, and as computer science expands in popularity, we need to ensure that the next generation of talent feels welcomed. There are far less secretly terrible engineers than we might expect if we give them mentorship and support to do great work. There is a whole group of secretly great engineers ready to be developed, if only we realized our field’s animosity.

Funny and true.
If Carpenters Were Hired Like Programmers


Facebook Coding Interview tips:
https://www.facebook.com/Engineering/videos/10153034561822200/

Aruoba/Fernández-Villaverde: Comparison of Programming Languages in Economics

http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11,Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R. We implement the same algorithm, value function iteration with grid search, in each of the languages. We report the execution times of the codes in a Mac and in a Windows computer and comment on the strength and weakness of each language

Maven, Git, Jenkins software build tool links

http://www.slideshare.net/tarkasteve/understanding-git-voxxed-vienna-2016

19 Tips For Everyday Git Use




I’ve been using git full time for the past 4 years, and I wanted to share the most practical tips that I’ve learned along the way. Hopefully, it will be useful to somebody out there.
If you are completely new to git, I suggest reading Git Cheat Sheet first. This article is aimed at somebody who has been using git for three months or more.
Table of Contexts:
  1. Parameters for better logging
  2. Log actual changes in a file
  3. Only Log changes for some specific lines in file
  4. Log changes not yet merged to the parent branch
  5. Extract a file from another branch
  6. Some notes on rebasing
  7. Remember the branch structure after a local merge
  8. Fix your previous commit, instead of making a new commit
  9. Three stages in git, and how to move between them
  10. Revert a commit, softly
  11. See diff-erence for the entire project (not just one file at a time) in a 3rd party diff tool
  12. Ignore the white space
  13. Only “add” some changes from a file
  14. Discover and zap those old branches
  15. Stash only some files
  16. Good commit messages
  17. Git Auto-completion
  18. Create aliases for your most frequently used commands
  19. Quickly find a commit that broke your feature (EXTRA AWESOME)

GitHub Pull Requests


GitHub’s mission is to make it easier to work together than alone. Throughout the company’s history, they have worked toward this goal by providing an easy way to host Git repositories online and surrounding those repositories with a growing set of collaborative mechanisms that work in the browser and through Git itself.
Pull Requests may be the most important of these innovations. They have enabled increased open-source contributions, provided new ways for enterprise teams to work together, and offered a full-featured code review mechanism—all at the cost of a few Git commands and a simple web user interface. Let’s take a look at how pull requests work and how to use them in open-source and enterprise environments.

Neal Ford: Why Everyone (Eventually) Hates (or Leaves) Maven

Which is why every project eventually hates Maven. Maven is a classic contextual tool: it is opinionated, rigid, generic, and dogmatic, which is exactly what is needed at the beginning of a project. Before anything exists, it’s nice for something to impose a structure, and to make it trivial to add behavior via plug-ins and other pre-built niceties. But over time, the project becomes less generic and more like a real, messy project. Early on, when no one knows enough to have opinions about things like lifecycle, a rigid system is good. Over time, though, project complexity requires developers to spawn opinions, and tools like Maven don’t care.

General Python links

The Hitchhiker’s Guide to Python!

Understanding Python Decorators in 12 Easy Steps!

https://caniusepython3.com/

Google Style Guide for Python

Profiling Python in Production

The Elements of Python Style

Code Like a Pythonista: Idiomatic Python
Other languages have "variables", Python has "names"

Records: SQL for Humans™

Python utilities that should be builtins

Python Mocking 101: Fake It Before You Make It
An Introduction to Mocking in Python


General Software Development process links

Microsoft Research: Exploding Software-Engineering Myths

The logical assumption would be that more code coverage results in higher-quality code. But what Nagappan and his colleagues saw was that, contrary to what is taught in academia, higher code coverage was not the best measure of post-release failures in the field. Code coverage is not indicative of usage.

What the research team found was that the TDD teams produced code that was 60 to 90 percent better in terms of defect density than non-TDD teams. They also discovered that TDD teams took longer to complete their projects—15 to 35 percent longer.

The team observed a definite negative correlation: more assertions and code verifications means fewer bugs. Looking behind the straight statistical evidence, they also found a contextual variable: experience. Software engineers who were able to make productive use of assertions in their code base tended to be well-trained and experienced, a factor that contributed to the end results. These factors built an empirical body of knowledge that proved the utility of assertions.


Thursday, 16 April 2015

zeroturnaround.com: Architecting Large Enterprise Java Projects with Markus Eisele

http://zeroturnaround.com/rebellabs/architecting-large-enterprise-java-projects-by-markus-eisele/

Developers built a lot of applications like that some time ago and even present day! These applications are still working and need maintenance. So we see them sometimes and call them legacy. They tend to have a release cycle of once or twice a year, depend on a proprietary application server environment and most importantly have a single database schema for all data. Naturally, you cannot move very fast with such a beast on your shoulders and must have a large team and QA department even just to maintain it.

The next step in the architecture design was the Enterprise Service Bus age. Understanding that changes have to be incorporated into even the oldest and the most legacy applications. We (Java developers) started breaking the huge apps into smaller ones. The biggest challenge was to integrate it all together, so the service bus seemed the best solution.
The change wasn’t that big for the operations teams, as they still have everything under their control and centralized, although it was a much more flexible approach. However, the same centralization that adds value, creates a raft of problems that the engineering team had to solve: most importantly challenges with testing and the single point of failure (SPOF).
We’re now we’re moving even further away from the monolithic apps and towards the trendingbuzzword of Microservices.

Then there’s a number of patterns you can use to organise the communication between your microservices, like the Aggregator or the Chain.

Visualisation, HTML5, UX and HTTP links

modeling-languages.com: 10 JavaScript libraries to draw your own diagrams

Comparative table of JavaScript drawing libraries

To finish here is a basic comparative table between the presented libraries.
LibraryLicenseLanguage / infrastructurehigh/low levelbuilt-in editorGithub (04/02/2015)
JointJSMPLHTML
Javascript
SVG
highNo1388 stars
265 forks
RappidCommercial
1 500,00 €
HTML
Javascript
SVG
highYes
MxgraphCommercial
4300.00 €
HTML
Javascript
SVG
highYes
GoJSCommercial
$1,350.00
HTML
Canvas
Javascript
HighYes
RaphaelMITHTML
Javascript
SVG
lowNo7105 stars
1078 forks
Draw2DGPL2
commercial
HTML
Javascript
SVG
mediumNo
D3BSDHTML
Javascript
SVG
lowNo36218 stars
9142 forks
FabricJSMITHTML
Canvas
javasript
lowNo4127 stars
705 forks
paperJSMITHTML
Canvas
javascript
lowNo4887 stars
496 forks
JsPlumbMIT/GPL2HTML
Javascript
mediumNo2161 stars
563 forks

Javascript links

The Original jQuery Source Code, Annotated by John Resig

The mind-boggling universe of JavaScript Module strategies

JavaScript's 'bind' Explained in 5 Minutes

Learn JS Data Data manipulation, munging, and processing in JavaScript


15 HELPFUL CHROME EXTENSIONS FOR DEVELOPERS YOU NEED TO KNOW

JavaScript Module Pattern: In-Depth

The module pattern is a common JavaScript coding pattern. It’s generally well understood, but there are a number of advanced uses that have not gotten a lot of attention. 

The Path to Parallel JavaScript

Javascript for Java Developers

You’re Missing the Point of Server-Side Rendered JavaScript Apps









Learn JavaScript Essentials (for all skill levels)

General Java/JavaEE links

Java 8 Stream Tutorial

How to use flatMap() in Java 8 - Stream Example Tutorial

Continuous Delivery with Docker Containers and Java EE

Microservices, DevOps and PaaS - The Impact on Modern Java EE Architecture

What Would ESBs Look Like If They Were Done Today?

EJB and CDI - Alignment and Strategy

Upgrading to Java 8 at Scale

Productive Java EE 7 on Java 8 At Commerzbank

Basics of scaling Java EE applications

Design pattern samples in Java

Java 8’s Method References Put Further Restrictions on Overloading


Dismantling invokedynamic

Javascript for Java Developers














A curated list of awesome Java frameworks, libraries and software

Java 8 Streams cheat sheet

Palladium: Predictive Analytics, Machine Learning framework

Palladium provides means to easily set up predictive analytics services as web services. It is apluggable framework for developing real-world machine learning solutions. It provides generic implementations for things commonly needed in machine learning, such as dataset loading, model training with parameter search, a web service, and persistence capabilities, allowing you to concentrate on the core task of developing an accurate machine learning model. Having a well-tested core framework that is used for a number of different services can lead to a reduction of costs during development and maintenance due to harmonization of different services being based on the same code base and identical processes. Palladium has a web service overhead of a few milliseconds only, making it possible to set up services with low response times.....

blog.xebialabs.com: Before You Go Over the Container Cliff with Docker, Mesos etc: Points to Consider

I’m personally really excited about the potential of microservices and containers, and typically recommend pretty emphatically that our users should research them. But I also add that doing research is absolutely not the same thing as deciding up front to go for full-scale adoption.
container-fallGiven the incredibly rapid pace of change in this area, it’s essential to develop a clear understanding of the capabilities of the technology in your environment before making any decisions: production is not usually a good arena for R&D.
Based on what we have learned from our users and partners that have been undertaking such research, our own experiences (we use containers quite a lot internally) and lessons from companies such a eBay and Google, here are six important criteria to bear in mind when deciding whether to move from research to adoption....

Java Performance links

Includes some stuff on JVisualVM: Hunting Memory Leaks in Java

Comparing GC Collectors


Alex Zhitnitsky : Java Performance Tuning: Getting the Most Out of Your Garbage Collector

The main question here is this: What do you see as an acceptable criteria for the GC pause frequency and duration in your application? For example, a daily pause of 15 seconds might be acceptable, while a frequency of once in 30min would be an absolute disaster for the product. The requirements come from the domain of each system, where real time and high frequency trading systems would have the most strict requirements.
Overall, seeing pauses of 15-17 seconds is not a rare thing. Some systems might even reach 40-50 seconds pauses, and Haim also had a chance to see 5 minute pauses in a system with a large heap that did batch processing jobs. So pause duration doesn’t play a big factor there.

Wednesday, 15 April 2015

General Scala Links


4 Interview Questions for Scala Developers


What’s the difference between the following terms and types in Scala: ‘Nil,’ ‘Null,’ ‘None,’ ‘Nothing’?What is ‘Option’ and how is it used?Explain the difference between ‘concurrency’ and ‘parallelism,’ and name some constructs you can use in Scala to leverage both.Bonus Question: What is ‘Unit’ and ‘()’?


Scala DSLs:

short intro : http://www.scala-lang.org/old/node/1403
Using in Camel routes:  http://camel.apache.org/scala-dsl.html
Good presentation, which describes Scala as more succinct than Java and therefore more appropriate:  http://www.slideshare.net/indicthreads/using-scala-for-building-ds-ls-abhijit-sharma


Torsten Möller: Data Visualization Course

http://www2.cs.sfu.ca/~torsten/Teaching/Cmpt467/

Content Description:
Visualization deals with all aspects that are connected with the visual representation of data sets from scientific experiments, simulations, medical scanners, databases, web system, and the like in order to achieve a deeper understanding or a simpler representation of complex phenomena and to extract important information visually. To obtain these goals, both well-known techniques from the field of interactive computer graphics and completely new methods are applied. The objective of the course is to provide knowledge about visualization algorithms and data structures as well as acquaintance with practical applications of visualization. Through several projects the student is expected to learn methods to explore and visualize different kinds of data sets.
  • Introduction and historical remarks
  • Abstract visualization concepts and the visualization pipeline
  • Data acquisition and representation (sampling and reconstruction; grids and data structures).
  • Basic mapping concepts
  • Visualization of scalar fields (isosurface extraction, volume rendering)
  • Visualization of vector fields (particle tracing, texture-based methods, vector field topology)
  • Tensor fields, multi-attribute data, multi-field visualization
  • Human visual perception + Color
  • Space/Order + Depth/Occlusion
  • Focus+Context; Navigation+Zoom
  • Visualization of graphs and trees and high-dimensional data
  • Evaluation + Interaction models 

Analysis of Algorithms

The Sedgewick/Wayne book covers very well (part of Coursera course):
Analysis of Algorithms

As people gain experience using computers, they use them to solve difficult problems or to process large amounts of data and are invariably led to questions like these:
  • How long will my program take?
  • Why does my program run out of memory?

Basic Sorting Algorithms (plus Python examples)

This site has animated visualizations of the main algorithms (h/t David R. Martin):
Sorting Algorithm Animations

The Sedgewick/Wayne book covers well (part of Coursera course):
Princeton: Sorting Applications

Quicksort is the fastest general-purpose sort.
In most practical situations, quicksort is the method of choice. If stability is important and space is available, mergesort might be best. In some performance-critical applications, the focus may be on just sorting numbers, so it is reasonable to avoid the costs of using references and sort primitive types instead.

Tuesday, 7 April 2015

Miko Matsumura: Data Science Is Dead

http://news.dice.com/2014/03/05/data-science-is-dead/

Yes, more and more companies are hoarding every single piece of data that flows through their infrastructure. As Google Chairman Eric Schmidt pointed out, we create more data in a single day today than all the data in human history prior to 2013.
Unfortunately, unless this is structured data, you will be subjected to the data equivalent of dumpster diving. But surfacing insight from a rotting pile of enterprise data is a ghastly process—at best. Sure, you might find the data equivalent of a flat-screen television, but you’ll need to clean off the rotting banana peels. If you’re lucky you can take it home, and oh man, it works! Despite that unappetizing prospect, companies continue to burn millions of dollars to collect and gamely pick through the data under respective roofs. What’s the time-to-value of the average “Big Data” project? How about “Never”?
If the data does happen to be structured data, you will probably be given a job title like Database Administrator, or Data Warehouse Analyst.
When it comes to sorting data, true salvation may lie in automation and other next-generation processes, such as machine learning and evolutionary algorithms; converging transactional and analytic systems also looks promising, because those methods deliver real-time analytic insight while it’s still actionable (the longer data sits in your store, the less interesting it becomes). These systems will require a lot of new architecture, but they will eventually produce actionable results—you can’t say the same of “data dumpster diving.” That doesn’t give “Data Scientists” a lot of job security: like many industries, you will be replaced by a placid and friendly automaton.
So go ahead: put “Data Scientist” on your resume. It may get you additional calls from recruiters, and maybe even a spiffy new job, where you’ll be the King or Queen of a rotting whale-carcass of data. And when you talk to Master Data Management and Data Integration vendors about ways to, er, dispose of that corpse, you’ll realize that the “Big Data” vendors have filled your executives’ heads with sky-high expectations (and filled their inboxes with invoices worth significant amounts of money). Don’t be the data scientist tasked with the crime-scene cleanup of most companies’ “Big Data”—be the developer, programmer, or entrepreneur who can think, code, and create the future.

Monday, 6 April 2015

Trey Causey: Getting started in data science: My thoughts



http://treycausey.com/getting_started.html


One of the primary things that separates a data scientist from someone just building models is the ability to think carefully about things like endogeneity, causal inference, and experimental and quasi-experimental design. Data scientists must understand and think about things like data generating processes and reason through how misspecdraw from their analyses.

It takes a long time and a lot of training for this to come naturally. I don't think I gave much thought to selecting on the dependent variable and how endemic it is until I got to grad school. Now it sticks out like a sore thumb everywhere I look. Similarly, thinking carefully about outliers (extreme values) or the process by which your data came to have missing values; these are things that often get swept aside by tutorials showing you how to use R. This isn't to say you have to go to grad school (you probably shouldn't) or even to college; it just means that data science is not simply a series of programs and tutorials that automatically make inferences from your data. Often times, what isn't in your data has significant implications for inference. Your software package isn't going to tell you what they are.

All this being said, I do think we live in an extremely exciting time for democratizing education. I hope some good comes out of it. Enough doom and gloom, and let's get on to the links.