Tuesday, 30 December 2014

TEN trends in DATA SCIENCE 2015

There is a certain irony talking about trends in data science, as much of data science is geared primarily to detecting and extrapolating trends from disparate data patterns. In this case, this is part of a series of analyses I've written for over a decade, looking at what I see as the key areas that most heavily impact the area of technology I'm focusing on at the top. For the last few years, this has been a set of technologies which have increasingly been subsumed under the rubrick of Data Science.
I tend to use the term to embrace an understanding of four key areas - Data Acquisition (how you get data into a usable form and set of stores or services), Data Awareness (how you provide context to this data so that it can work more effectively across or between enterprises), Data Analysis (turning this aware data into usable information for decision makers and data consumers) and Data Governance (establishing the business structures, provenance maintenance and continuity for that data). These I collectively call the Data Cycle, and it seems to be the broad arc that most data (whether Big Data or Small Data) follows in its life cycle. I'll cover this cycle in more detail later, but for now, it provides a reasonably good scope for what I see as the trends that are emerging in this field.
This has been a remarkably good year in the field of data science - the Big Data field both matured and spawned a few additional areas of study, semantics went from being an obscure term to getting attention in the C-Suite and the demand for good data visualizers went from tepid to white hot.
2015 looks to be more of the same, with the focus shifting more to the analytics and semantic side, and Hadoop (and Map/Reduce without Hadoop) becoming more mainstream. These trends benefit companies looking for a more comprehensive view of their information environment (both within and outside the company), and represent opportunities in the consulting space for talented analysts, programmers and architects.
The following trends are particularly noteworthy in the coming year:

Rise of Data Virtualization

The areas of natural language queries, semantics, and data hubs are converging in the realm of data virtualization. At its simplest, data virtualization is the process of opening up company data silos and making them accessible to one another through the use of hybrid data systems capable of both storing and retrieving content in a wide variety of formats.
With data virtualization, data can come in from multiple channels and formats (traditional ETL, data feeds in XML and JSON, word processing, spreadsheet and slideshow documents, and so forth, be mined for semantic attachments, and then stored internally within a data system. Queries to this database can be done using natural language questions - "Who were our top five clients by net revenue?", "Show me a graph of earnings by quarter starting in 2012", and so forth. Beyond such questions, data virtualization is also able to present output in a variety of different forms for more sophisticated uses of the data, including providing it as a data stream for reporting. purposes and visualization tools.

Hybrid Data Stores Become More Common

One thing that makes such systems feasibles is the rise of hybrid data stores. Such stores are capable of storing information in different ways and transforming them internally, along with providing more sophisticated mid-tier type loigc. Such systems might include the ability to work with XML, JSON, RDF and relational data in a single system, provide deep query capability in multiple modes (JSON-query, XQuery, SPARQL, SQL), and to take advantage of the fact that information doesn't have to be serialized out to text and back to do processing, which can make such operations an order of magnitude faster.
This area is most readily seen in the XML Database space especially, since these have generally been around long enough to thoroughly understanding the indexing structures necessary to support such capabilities. MarkLogic and eXist-db are perhaps my two favorites in this regard, with MarkLogic in particular bridging the gap. One thing that differentiates these from other systems is the robustness of the mid-tier API, with many things that have traditionally been the province of application logic now moving into the data tier. However, if you look at other NoSQL systems such as CouchBase or MongoDB, you see this same philosophy making its way into these systems, where JavaScript within the servers becomes the glue for handling data orchestration, transformations, and rules logic.

Semantics Becomes Standard

2014 saw the release of the SPARQL 1.1 specification and the SPARQL Update in the Semantics sphere, and also saw the release of the first products to fully incorporate these standards. Over the course of the next year, this major upgrade to the SPARQL standard will become the de facto mechanism for communicating with triple stores, which will in turn driive the utilization of new semantics-based applications.
Semantics already figure pretty heavily in recommendation engines and similar applications, since these kinds of applications deal more heavily with searching and making connections between types of resources, and it plays fairly heavily in areas such as machine learning and NLP. With update, you add the capability of creating a "tape" much like what you do with SQL's UPDATE capability that can add, modify and remove graphs (roughly analogous to databases) and resources, and can do so with any SPARQL compliant data store.
I'm also expecting that the SPARQL Service Description language will become more common, providing a standardized mechanism for doing discovery of data sets. Thiis is pretty critical, as it allows people (and computer systems) to discover the structure of data sets programmatically and build tools to make accessing these datasets easily and cleanly without the often confusing overhead of dealing with RDF's namespace complexity.
My suspicion is that this will also be the year where a number of existing names in the semantics field become targets for acquisition as semantics becomes increasingly seen as a must have complement to other types of data storage.

Hadoop Yarn and Hadoop without Hadoop

There is no question that 2014 was the banner year for Hadoop - Hortonworks, one of the biggest Hadoop players, announced its IPO, and if you are a Java developer, you almost certainly have a Hadoop test project or three under your belt now in order to take advantage of Hadoop opportunities.
In this space, Yarn and the container model is perhaps the most radical of changes - the realization that Hadoop is more than just map/reduce has been brewing for some time, and Yarn offers the capability of building containers that can extend beyond the orchestration into data federation. Yarn drives Storm, which handles more efficient streaming operations, and TEZ, which establishes a generalized execution engine that will eventually subsume Map/Reduce. Finally, Spark looks to take on in-memory data stores, targeting the rise of solid state drives in the data sphere.
However, success almost invariably breeds competition, and there is no question that this is happening the Big Data sphere. Amazon, for instance, has created its own Elastic Map/Reduce capability as part of its AWS offerings, whlie MongoDB has its own built-in map/reduce capability, and MarkLogic's flexible replication and multi-tier storage is built with similar M/R processing in mind (including from Hadoop stores). To me, this represents a shift from playing nicely with Hadoop infrastructure to attempting to offer alternatives to Hadoop for different stacks, such as the emerging JavaScript stack.

Databases Become Working Memory

One of the more subtle shifts that has been happening for a while, but will accelerate through 2015, is the erosion of tiered development in favor of intelligent nodes with local "working storage". This is in fact a natural consequence of the rise of lightweight (mainly RESTful) services - applications are no longer concentrated on any one tier. Instead, what seems to be emerging is a model whereby every node - whether a laptop, a mobile device, a server, even simple IoT sensors, now has enough processing power to make decisions, and has the ability to store relevant state data either locally or within an intermediate data node.
Some of this can be seen in the rise of "offline" apps. Such applications, whether on a computer or a tablet (a smartphone has become a micro-tablet with calling capabilities) have databases that are local to that device, so that you can type an email, create a drawing, or work with a spreadsheet even when not connected to the web. When connected, that data then synchronizes a reflection of this data on a server through some type of cloud service that maintains its own internal smarts, and this server is able to communicate with other nodes through built-in http capabilities.
These hosts increasingly are using these communication tools in order to more effectively cache data services from other nodes in the network, using profile cloud data to customize this information. Voice Recognition takes advantage of this to push the heavy processing of converting a sound file to meaningful text onto the server, and you are seeing more applications where this is being expanded to image and video recognition as well. At each stage, persistent memory is being used primarily as a mechanism for storing intermediate data states and power autonomous agents that can respond to external events to change that state.
This will have a broad long term effect on the way that applications are developed, with the bulk of information moving between systems moving away from the passing of UI to the passing of data streams for templatized web frameworks and polyfills.

Towards a Universal Data Query Language

I've often thought of query languages as being analogous to fundamental forces. Each force apparently has its own characteristics, its own syntax, and its own way of dealing with data stores. Electricity and magnetism appear unique, yet James Maxwell showed in the 19th century that these two were in fact different aspects of the same force. The weak and electric forces were unified conceptually in the early 20th century, and the strong force was unified in the latter part of the same century. There are strong hints that gravity may be unified as well, but because it is so weak at short distances its remarkably difficult to be able to prove out how that unification takes place.
In a similar fashion, SQL, XQuery and SPARQL were each put forth as solutions to query relational, XML and RDF data respectively, and there are a number of different proposals for querying JSON, depending upon the database in question, from Daniella Florescu's and 28msec's JSONiq to Couchbase's N1QL, though no single query language has become standardized there. In my experience, the problem of querying JSON is due primarily to the fact that JSON itself is a comparatively new language.
David Lee at MarkLogic has been at the forefront of developing a comprehensive data algebra for extending the XML and HTML object model to more comprehensively handle JSON entities. This work is critical, because it puts JSON data stores on the same footing mathematically that XML, RDF and SQL data stores currently employ.
One thing that I expect to see in 2015 is the emergence of a consensus within the various data communities that - 1) a standardized mechanism for querying JSON be encapsulated by one or more standard bodies (ECMA would be the logical place for this, but it may be a joint effort between ECMA and the W3C) and 2) that a generalized query language standard be proposed that merges the primary query environments. This won't completely replace proprietary "standards", but it will provide a minimal level of conformance that should make it possible for data to be queried consistently throughout the stack. I don't see either of these efforts completing in 2015, but I'm hoping to see them get underway.

Data Analytics Moves Beyond SQL

Analytics is emerging as THE hot data science profession. Formerly, analytics nerds were shunned by their contemporaries as not being real programmers, but with salaries for experienced analysts now pushing up into the $200,000+ range, knowing about chi-square analytics and regression testing is forcing a lot of programmers and BAs to dig through their libraries to get copies of their dusty stochastics and probability theory textbooks from college.
It has also, however, caught the analytics industry flat-footed, and propelled otherwise obscure products (for programmers, anyway) like Matlab and R into must-learn programs. The challenge that companies such as Pentaho and Cognizant face is that most of their toolsets have been geared towards SQL analytics, yet most of the big data revolution has happened primarily in the NoSQL and Semantics space. This is creating a rush as these companies (along with creators of generalized Business Intelligence and reporting tools) struggle with trying to better integrate these alternative systems (and likely represents a significant investment opportunity for the first companies that can successfully expand upon these).
The difficulty though is more than just simply writing a JSON plugin. Most data analytics tools "end" at the text analytics stage, which basically deals with search based systems. Yet what is happening now is that the data that's being analysed is no longer simple rows and columns but is unstructured and semi-structured narrative content, which has its own language for analytics. Semantics and graph databases offer a major venue for graph analytics as well, but here again none of this is integrated with either text or data analytics. What this points to is that the field of analytics itself is undergoing a major evolution as people increasingly look for commonalities and patterns that allow for a total view of the dataspace, regardless of the type of information being represented.

The JavaScript Stack Solidifies

I'm going to date myself badly here. The first book that I started to write, back in 1995, was a book on this toy language called JavaScript, which Netscape had just released that year. I eventually had to cancel it - there was just not enough "there" to make a good book, as it's primary purpose at the time seemed to be calling alert dialogs. It would take another couple of years before it had become solid enough that I could write a book on JavaScript, and even then it was largely in the context of XHTML and CSS.
Twenty years later, JavaScript is a completely different language, incorporating bits and pieces of everything from SmallTalk to Haskell to Lisp. Some of the emerging features of JavaScript 5, such as the use of promises in programming, are profound, and the JavaScript arms race of the past five years has made JavaScript one of the most widely used languages on the planet. (I am working on a piece about some of the cooler features of EcmaScript 5 and its implementation in Google's V8 engine, so will dig into this in more detail in the future).
Node.js has become the nucleus of a new stack that is likely going to relegate Ruby and Python to has-been languages, and might even end up dethroning PHP. Node by itself provides considerable power as a generalized server language - handling not only http traffic but websockets, ftp, ssh and a slew of other low-level communication protocols. Node has APIs into both relational and NoSQL data stores, in some cases, more sophisticated than the APIs of these stores natively, and increasingly ties together web client frameworks such as angular, ember and dart. It's the virtual machine for other languages as well, including CoffeeScript and its various support stacks,
Much of the code that is being written today in the data space is being written in JavaScript. The OpenCPU project, for instance, lets you call out to a bridge for R (R can invoke shells directly, so represents a significant security risk unless called through a facade). This way you can use the powerful analytics and visualization tools of R from any JavaScript system, and then pass the results on as JSON or SVG (which seems to have found a second life in the visualization space). It also means that many of the same libraries that are used for that can be reused within node - which hints at where so much of the power of the JavaScript stack really is.

Data Science Teams

Back in college, I had several course in stochastic methods and probability theory as part of completing a physics degree, but that was a long time ago. In the interrim, I have learned how to program, learned the basics of data structures and algorithms, and spent a long time exploring the mathematical boundaries between data stores and data services (something that admittedly gets very deep at times). I know about data, but I would be hard pressed to say that I could do all of the tasks that tend to be expected of data scientists if you read want ads. To be honest, I think this is because what many recruiters are looking for are unicorns - people who have all the hot requirements and deep experience but are willing to work for comparatively low wages. Data science is a broad field, and most people within that field are in fact specialists in one aspect or another of it.
I see the emergence within organizations of data science teams. Typically, such teams will be made up of a number of different specialties:
  • Integrator. A programmer or DBA that specializes in data ingestion and ETL from multiple different sources. Their domain will tend to be services and databases, and as databases become data application platforms, their role primarily shifts from being responsible for schemas to being responsible for building APIs. Primary focus: Data Acquisition
  • Data Translation Specialist. This will typically be a person focused on Hadoop, Map/Reduce and similar intermediate processing necessary to take raw data and clean it, transform it, and simplify it. They will work with both integrators and ontologists, Primary Focus: Data Acquiisition
  • Ontologist. The ontologist is a data architect specializing in building canonical models, working with different models, and establishing relationships between data sets. They will often have semantics or UML backgrounds. Primary focus: Data Awareness.
  • Curators. These people are responsible for the long term management, sourcing and provenance of data. This role is often held by librarians or archivists. They will often work closely with the ontologists. Primary Focus: Data Awareness.
  • Stochastic Analyst (Data Scientist?). This role is becoming a specialist one, in which people versed with increasingly sophisticated stochastic and semantic analysis tools take the contextual data and extraction trends, patterns and anti-patterns from this. They usually have a strong mathematical or statistical background, and will typically work with domain experts. Primary Focus: Data Analysis
  • Domain Expert. Typically these are analysts who know their particular domain, but aren't necessarily expert on informatics. These may be financial specialists, business analysts, researchers, and so forth, depending upon the specific enterprise focus. Primary Focus: Data Analysis
  • Visualizers. These are typically going to be web interface developers with skills in areas such as SVG or Canvas and the suites of visualization tools that are emerging in this area. Their role is typically to take the data at hand and turn it into usable, meaningful information. They will work closely with both domain experts and stochastic analysts, as well as with the ontologist to better coerce the information coming from the data systems into meaningful patterns.Primary Focus:Data Analysis
  • Data Science Manager. This person is responsible for managing the team, understanding all of the domains reasonably well enough to interface with the client, and coordinating efforts. This person also is frequently the point person for establishing governance. Primary Focus: All.
Not all teams will have one person for each of these roles (some people may have skills in two or three of these areas, while in other cases, there may be several people working to build visualizations or perform analysis), but all of these roles will be represented over the lifecycle of data. Note that the term "data scientist" currently seems to be most readily be appropriated by the stochastic analyst, and it's likely that term will stick, but in reality that person will almost certainly be focused primarily on analytic functions.
The creation of such teams will likely become much more obvious in 2015, though it has been going on quietly throughout 2014. Significantly, the data science manager may either be, or report to, the CIO of a company.

Data Visualization and Flexible Reporting Becomes Real

People who make pictures get no respect. That's in great part because most visualizations - maps, charts, graphs and so forth - can be taken in quickly in comparison to written reports, and so the underlying assumption is that if it seems simple to understand, it is simple to create.
In reality, good visualizations are incredibly difficult (and potentially time consuming) to create, and just as the skills for being a good data scientist are rare, so to are the skills to be good visualization specialists. Not only do you need to be a good programmer, but you also need to have strong graphical skills and the ability to condense a lot of information into an information graphic. This is a rare combination, especially when domain knowledge also needs to be considered.
This is why data visualization is breaking away from UX specialists, people who work on workflow and user interface design. The two have some overlap, but not as much as most people believe. Indeed, in many cases, the best experience for doing strong data visualization is to do mobile app (especially games) development. Similarly cartographic visualizers are becoming a specialty even within the visualization field, combining cartographic skills with knowledge of data systems and stochastics.
Some of the more basic visualizations are often later encoded into software (this is where companies such as Tableau make their money), but this is an area that is as much art as it is science.
One key point to understand is that Big Data is typically complex data - data that is in fact generally too complex for most people to directly comprehend. The data visualizer may in fact be the only person who can make things such as data models, trends analysis, and geospatial information meaningful to decision makers or the public in general. Expect this to become a critical part of any data science team, and a key player in making information consumable.
As a corrollary to this, the changes that are occuring here will also be occurring at the reporting level, though I think this may be a bigger story for 2016 or 2017, simply because it will require tools better able to handle the increasing shift of information away from SQL connections (and to a certain extent XML) to JSON as a primary delivery vehicle driving these reporting tools.
Note that as with visualizations, the important thing to understand here is that such reporting documents are "live" - they are driven by data that is circumscribed in time (or by similar constraints) but is otherwise driven by data services. The typical "report" within the next few years will most likely be an app connected to the Internet (or at least that maintains its own intermediate data store independent of the document itself).

Summary

Every decade seems to have a set of overlapping themes. Much of the foundation for the current data revolution was established in the 2000s, even as the mobile revolution went through its most visible consumer arc (with a lot of the first tentative steps in turn for mobile happening in the 1990s). Unless you are already immersed in it, much of software currently looks pretty dull, because the changes that you are seeing are changes in the cloud, on distributed servers and in invisible databases within the browser or mobile devices. It's not flashy. On the other hand, it is the foundation upon which a great deal of incredible innovation will rest in the 2020s, as the Internet of Things and the robotics revolution (I see these as being the same thing) comes into its own.
The key developments in the next few years will be the gradual shift within enterprises and institutions away from isolated data siloes into open data platforms. Information sharing becomes easier not just at the person to person level but also at the machine to machine level, as layers of semantics standards, best practices, innovating data technologies and organizational changes transform the role of data in our lives.

No comments:

Post a Comment