Here at Intent HQ we use Wikipedia and Wikidata as sources of data. They are very important to us because they both encode an enormous amount of information in several languages that we use to build our Topic Graph.
Although the current process we have to process these dumps works well enough, we are always interested in finding new and better ways of doing our work. It’s because of that that we were very excited when we saw the Reactive Streams initiative 1. We thought it could be used to process the largest encyclopædia in the world 2.
One may think about using some of the (alert, buzzword landing) Big Data tools everybody is using nowadays. Wikidata and Wikipedia are huge (the english version of Wikipedia, for example, has over 4.8 million articles3 and the uncompressed dump is a single xml file of about 50 GB) and it is also challenging enough to work with. But we can’t call it Big Data - actually, the whole dataset fits “easily” in memory4.
The proof of concept
So, here we are, trying to solve an old problem with a (fairly) new technology. The main goal of the proof of concept was to evaluate if it was possible to process the whole Wikidata dump with constant memory usage and making the most of our computer (by using all of our CPU cores, for example).
The PoC was based on this requirement:
In order to obtain the Wikidata ID for an item given its title and a specific language, we should generate an index containing (title, lang) => canonical-id
Let’s see what we did…
Reactive Streams
Presentation: Asynchronous Java in a Microservices World
One of the ways that microservices architectures differ from monolithic architectures, is that they add latency to request processing. This has consequences for what techniques you need to use when writing highly performant code. In particular, you need to do more things asynchronously. This talk covers some of my learnings from six years of working with asynchronous microservices and gives you some advice about how to choose a the right tool for the job. Tools and libraries covered: Guava, RxJava and Spotify’s Trickle (which I am one of the authors of). Filmed at JFokus 2015.
Simple, lean & powerful HTTP apps with Reactive Streams
You can write Ratpack applications in Java 8 or any alternative JVM language that plays well with Java. Specific support for the Groovy language is provided, utilizing the latest static compilation and typing features.
Ratpack does not take a heavily opinionated approach as to what libraries and tools you should use to compose your application. As the developer of the application, you are in control. Direct integration of tools and libraries is favored over generic abstractions.
Ratpack is for nontrivial, high performance, low resource usage, HTTP applications.
Github RxJava
RxJava is a Java VM implementation of ReactiveX (Reactive Extensions): a library for composing asynchronous and event-based programs by using observable sequences.
For more information about ReactiveX, see the Introduction to ReactiveX page.
RxJava is Lightweight
RxJava tries to be very lightweight. It is implemented as a single JAR that is focused on just the Observable abstraction and related higher-order functions. You could implement a composable Future that is similarly unbiased, but Akka Futures for example come tied in with an Actor library and a lot of other stuff.)
RxJava is a Polyglot Implementation
RxJava supports Java 6 or higher and JVM-based languages such as Groovy, Clojure, JRuby, Kotlin and Scala.
RxJava is meant for a more polyglot environment than just Java/Scala, and it is being designed to respect the idioms of each JVM-based language. (This is something we’re still working on.)
RxJava Libraries
The following external libraries can work with RxJava:
- Hystrix latency and fault tolerance bulkheading library.
- Camel RX provides an easy way to reuse any of the Apache Camel components, protocols, transports and data formats with the RxJava API
- rxjava-http-tail allows you to follow logs over HTTP, like
tail -f
- mod-rxvertx - Extension for VertX that provides support for Reactive Extensions (RX) using the RxJava library
- rxjava-jdbc - use RxJava with jdbc connections to stream ResultSets and do functional composition of statements
- rtree - immutable in-memory R-tree and R*-tree with RxJava api including backpressure
No comments:
Post a Comment