Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Tuesday, 22 December 2015

db-engines.com Database Popularity Rankings (updated)

Database Rankings

1.1.1.OracleRelational DBMS1497.55+16.61+37.76
2.2.2.MySQLRelational DBMS1298.54+11.70+29.96
3.3.3.Microsoft SQL ServerRelational DBMS1123.16+0.83-76.89
4.4. 5.MongoDB Document store301.39-3.22+54.87
5.5. 4.PostgreSQLRelational DBMS280.09-5.60+26.09
6.6.6.DB2Relational DBMS196.13-6.40-14.13
7.7.7.Microsoft AccessRelational DBMS140.21-0.75+0.31
8.8. 9.Cassandra Wide column store130.84-2.08+36.78
9.9. 8.SQLiteRelational DBMS100.85-2.60+6.15
10.10.10.Redis Key-value store100.54-1.87+12.66

11.11.11.SAP Adaptive ServerRelational DBMS81.47-2.24-4.52
12.12.12.SolrSearch engine79.15-0.63+0.73
13. 14. 16.ElasticsearchSearch engine76.57+1.79+30.67
14. 13. 13.TeradataRelational DBMS75.72-1.37+8.32
15. 16. 17.HiveRelational DBMS55.27+0.36+18.90
16. 15. 15.HBaseWide column store54.25-2.21+3.17
17.17. 14.FileMakerRelational DBMS50.12-1.61-2.10
18.18. 20.SplunkSearch engine43.86-0.76+12.39
19.19. 21.SAP HANARelational DBMS38.86-0.76+11.05
20.20. 18.InformixRelational DBMS36.40-2.05+1.28
21.21. 23.Neo4j Graph DBMS33.18-0.86+8.02



Friday, 18 September 2015

Hadoop, Spark, Storm (and ecosystem) links



What MapReduce can't do

We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

Here we are talking about self-joining 10,000 data sets with themselves, each data set being a time series of 5 or 10 observations. My point in my article is that the distributed architecture of Map Reduce does not bring any smart processing of these computations. Just brute force (better than a non-parallel approach, sure) but totally incapable of removing the very massive redundancy involved in these computations (as well as in storage redundancy, since the same data set must be stored on a bunch of servers to allow for the computations of all cross-correlations).


This underscores the importance of a major upcoming change to Hadoop - the introduction of YARN which will open up Hadoop's distributed data and compute framework for use with non-MapReduce paradigms and frameworks.   One such example - RushAnalytics which, incidentally, also adds fine-grained parallelism within each node of a cluster.

Wednesday, 27 May 2015

General Data Science links



NY Times: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

et far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”
Several start-ups are trying to break through these big data bottlenecks by developing software to automate the gathering, cleaning and organizing of disparate data, which is plentiful but messy. The modern Wild West of data needs to be tamed somewhat so it can be recognized and exploited by a computer program.
“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.


Wednesday, 22 April 2015

General Big Data, Big Analytics links

5 technologies that will help big data cross the chasm

Apache Spark

Spark’s popularity is aided by the YARN resource manager for Hadoop and the Apache Mesos cluster-management software, both of which make it possible to run Spark, MapReduce and other processing engines on the same cluster using the same Hadoop storage layer. I wrote in 2012 about the move away from MapReduce as one of five big trends helping us rethink big data, and Spark has stepped up as the biggest part of that migration

Friday, 17 April 2015

Dean Wampler: SQL Strikes Back! Recent Trends in Data Persistence and Analysis


Traditional Data Warehouse:  Pros
–Mature
–Rich SQL, analytics functions
–Scales to “mid-size” data
Cons
–Expensive per TB
–Can’t scale to Hadoop-sized data sets

Data Warehouse vs. Hadoop? : • Data Warehouse
+Mature
+Rich SQL, analytics
–Scalability
–$$/TB
• Hadoop
–Maturity vs. DWs
+Growing SQL
+Massive scalability
+Excellent $$/TB

Facebook had data in Hadoop.  Facebook’s Data Analysts needed access to it...
 so they created Hive...

Wednesday, 15 April 2015

Torsten Möller: Data Visualization Course

http://www2.cs.sfu.ca/~torsten/Teaching/Cmpt467/

Content Description:
Visualization deals with all aspects that are connected with the visual representation of data sets from scientific experiments, simulations, medical scanners, databases, web system, and the like in order to achieve a deeper understanding or a simpler representation of complex phenomena and to extract important information visually. To obtain these goals, both well-known techniques from the field of interactive computer graphics and completely new methods are applied. The objective of the course is to provide knowledge about visualization algorithms and data structures as well as acquaintance with practical applications of visualization. Through several projects the student is expected to learn methods to explore and visualize different kinds of data sets.
  • Introduction and historical remarks
  • Abstract visualization concepts and the visualization pipeline
  • Data acquisition and representation (sampling and reconstruction; grids and data structures).
  • Basic mapping concepts
  • Visualization of scalar fields (isosurface extraction, volume rendering)
  • Visualization of vector fields (particle tracing, texture-based methods, vector field topology)
  • Tensor fields, multi-attribute data, multi-field visualization
  • Human visual perception + Color
  • Space/Order + Depth/Occlusion
  • Focus+Context; Navigation+Zoom
  • Visualization of graphs and trees and high-dimensional data
  • Evaluation + Interaction models 

Tuesday, 7 April 2015

Miko Matsumura: Data Science Is Dead

http://news.dice.com/2014/03/05/data-science-is-dead/

Yes, more and more companies are hoarding every single piece of data that flows through their infrastructure. As Google Chairman Eric Schmidt pointed out, we create more data in a single day today than all the data in human history prior to 2013.
Unfortunately, unless this is structured data, you will be subjected to the data equivalent of dumpster diving. But surfacing insight from a rotting pile of enterprise data is a ghastly process—at best. Sure, you might find the data equivalent of a flat-screen television, but you’ll need to clean off the rotting banana peels. If you’re lucky you can take it home, and oh man, it works! Despite that unappetizing prospect, companies continue to burn millions of dollars to collect and gamely pick through the data under respective roofs. What’s the time-to-value of the average “Big Data” project? How about “Never”?
If the data does happen to be structured data, you will probably be given a job title like Database Administrator, or Data Warehouse Analyst.
When it comes to sorting data, true salvation may lie in automation and other next-generation processes, such as machine learning and evolutionary algorithms; converging transactional and analytic systems also looks promising, because those methods deliver real-time analytic insight while it’s still actionable (the longer data sits in your store, the less interesting it becomes). These systems will require a lot of new architecture, but they will eventually produce actionable results—you can’t say the same of “data dumpster diving.” That doesn’t give “Data Scientists” a lot of job security: like many industries, you will be replaced by a placid and friendly automaton.
So go ahead: put “Data Scientist” on your resume. It may get you additional calls from recruiters, and maybe even a spiffy new job, where you’ll be the King or Queen of a rotting whale-carcass of data. And when you talk to Master Data Management and Data Integration vendors about ways to, er, dispose of that corpse, you’ll realize that the “Big Data” vendors have filled your executives’ heads with sky-high expectations (and filled their inboxes with invoices worth significant amounts of money). Don’t be the data scientist tasked with the crime-scene cleanup of most companies’ “Big Data”—be the developer, programmer, or entrepreneur who can think, code, and create the future.