A Columnar and Analytic Platform for Data Mining?
As the rate at which we now generate and collect massive amounts of data, storing, mining, and analyzing Big Data has become the topic du jour. Interest abounds in every sector, from research institutes and universities, to industrial companies across the globe. As expected, many Big Data solutions are available. Still, the question remains, “Which solution best solves my problems?”
This talk focuses on the HP Vertica Columnar and Analytic Database , a modern commercial RDBMS system. However, while presenting a classical relational interface, HP Vertica’s database simultaneously achieves the high performance expected from modern “web scale” analytic systems, by making appropriate architectural choices.
The discussion begins with a high-level description of the HP Vertica columnar and massively parallel processing (MPP) RDBMS system. The benefits of encoding schemes for sorted and distributed columns of data explain why the HP Vertica database is such an outstanding solution for analytic queries. In addition, HP Vertica has a built-in database designer (DBD). The DBD customizes designs that are optimized for various scenarios and applications. For a given workload and space budget, DBD automatically recommends a physical design that optimizes query performance, storage footprint, fault tolerance and recovery to meet different customer requirements.
Besides many analytic-specific SQL functions, such as time series and pattern matching, HP Vertica supports user-defined functions. Provided as a set of external, shared libraries, users can develop the functions they need in C++, R, or Java. Then, load their custom libraries into HP Vertica to work in tandem with any native functions. User-defined functions are an ideal solution any time customers have analytic operations:
- That are difficult to perform in SQL
- Must be performed frequently enough that speed is a major concern
HP Vertica Distributed R is the HP Vertica integrated, scalable, and high-performance open platform for the R language. This platform:
- Lets business analysts gain insights from predictive analytic results using familiar BI tools
- Supports DBA’s avoiding data management from outside of the database
- Gives data scientists the ability to achieve scalability and performance without losing the R interactive tools and visualizations
While fundamentally a traditional RDBMS, HP Vertica can leverage column-based, structured data- processing machinery to store and query semi-structured data in flexible tables. We enable the smooth data load and exploration flexibility of NoSQL solutions, while maintaining a unified SQL interface over structured and semi-structured data. An automated optimization mechanism mitigates or eliminates the performance degradation associated with flexible tables, while maintaining an identical SQL query interface and retaining the flexible nature of the tables. Flexible tables dramatically improve usability. How? By decreasing time-to-insight by removing upfront schema costs, and then providing full SQL compatibility to support regular visualization and reporting tools.
From these insights into HP Vertica techniques and technology, this talk addresses the question “A Columnar and Analytic Platform for Data Mining?”
Ms. Nga Tran is the manager of the Query Optimizer and Database Designer team at HP Vertica. She was one of the original participants in the column store project, a collaboration between MIT, Brandeis, Brown, and UMass Boston, led by Professor Michael Stonebreaker. Dr. Stonebreaker is a co-founder of HP Vertica, which commercialized an academic prototype of a column store database. Ms. Tran has worked at HP Vertica since the company’s early start-up.
Ms. Tran earned her Bachelor’s degree from the Polytechnic University (Đại Học Bách Khoa), Saigon, Vietnam. She graduated summa cum laude with a Computer Engineering degree. She then worked as a software engineer for the Swiss software company ELCA (previously Electro-Calcul). Ms. Tran spent her first year at ELCA in Lausanne, Switzerland at company headquarters, and the next three years at ELCA’s office in Vietnam.
Ms. Tran left ELCA to pursue her Master’s degree at the University of New South Wales in Sydney, Australia. She earned her Master’s degree in Information Science, with a major in Database studies. Ms. Tran then worked towards her PhD at Brandeis University in Boston, Massachusetts. Later, as part of an HP Vertica team, Ms. Tran applied her extensive query optimization research to build the third, and current, generation of the HP Vertica Query Optimizer. Ms. Tran has published six papers on Column Store Database, HP Vertica Analytic Database, and Query Optimization. She also holds two US patents in Database Query Optimizer technology.
Insight gaining from OLAP queries via data movies
Can we answer user queries with data movies? Why should query results be treated simply as sets of tuples returned by the DBMS as if they would be visualized in an orange CRT of the 70’s? So far, database systems assume their work is done once results are produced, effectively prohibiting even well-educated end-users to work with them. Can we do something better?
In this talk, we will discuss how we can revise traditional assumptions of query answering in favor of insight gaining. We believe that insight gaining can be based on two pillars:
- Bringing out query results by making them
- properly visualized,
- textually exploitable (i.e., enriched with an automatically extracted text that comments on the result), and,
- vocally enriched (i.e., enriched with audio that allows the user not only to see, but also hear)
- Accompanying a query result with the results of complementary queries which allow the user to contextualize and analyze the information content of the original query.
Interestingly, an insightful sequence of related queries that provide context and depth to the original query, “dressed” with the appropriate visualization and sound, ends up to be nothing else but a data movie where cubes star. We will also discuss how our CineCubes system addresses the aforementioned challenges.
I was born in Athens in 1972. I graduated the Varvakio Experimental School in 1990 and obtained my Diploma in Electrical Engineering and my PhD from the Department of Electrical and Computer Engineering, of the National Technical University of Athens (NTUA) in 1995 and 2000, respectively. I have joined the Department of Computer Science of the University of Ioannina in 2002 and since then, I am a member of the Distributed Management of Data (DMOD) Laboratory.
So far, my research has focused on Data Warehouse technology, with particular interest on issues like data warehouse metadata repositories and metadata modeling, data warehouse quality, On-Line Analytical Processing (OLAP), and Extraction-Transformation-Loading (ETL). Currently, my on-going research is also targeted towards Metadata-Rich Data-Centric Information Systems, with particular emphasis to the modeling, pattern-based design, and evolution of their underlying database infrastructure as well as to Web-Services with particular emphasis to SOA maintenance.