Basic statistics: The mean, standard deviation, minimum, maximum for numerical attributes.Distribution: What is the distribution of values of an attribute?.Uniqueness: How many unique values does an attribute have? Does an attribute that is supposed to be unique key, have all unique values?.
![druid vs arangodb druid vs arangodb](https://i.ytimg.com/vi/op1GJ_3wktk/hqdefault.jpg)
![druid vs arangodb druid vs arangodb](https://i.ytimg.com/vi/O_THVA63AiE/maxresdefault.jpg)
For instance, after discovering that the most frequent pattern for phone numbers is (ddd)ddd-dddd, this pattern can be promoted to the rule that all phone numbers must be formatted accordingly. Questions that need to be answered are related to the distribution of the attributes (columns of the table), the completeness or the missing data.ĮDA can in a subsequent cleansing step be translated into constraints or rules that are then enforced. The Social-3 Personal Data Framework contains a data catalogue that allows data consumers to select interesting datasets and put them in a “shopping basket” to indicate which datasets they want to use and how they want to use them.īefore using a dataset with any algorithm it is essential to understand how the data looks like and what are the edge cases and distribution of each attribute.
![druid vs arangodb druid vs arangodb](https://i.ytimg.com/vi/gv-4WNDBGgM/maxresdefault.jpg)
One of the earliest steps after the data ingestion step is the automated creation of a data profile.Įxploratory data analysis (EDA) or data profiling can help assess which data might be useful and reveals the yet unknown characteristics of such new dataset including data quality and data transformation requirements before data analytics can be used.ĭata consumers can browse and get insight in the available datasets in the data lake of the Social-3 Personal Data Framework and can make informed decision on their usage and privacy requirements. The Social-3 Personal Data Framework provides metadata and data profiling information of each available dataset. This is an ideal solution for datasets containing personal data because only aggregated data are shown. The purpose of these statistics may be to find out whether existing data can easily be used for other purposes.īefore any dataset is used for advanced data analytics, an exploratory data analysis (EDA) or data profiling step is necessary. a database or a file) and collecting statistics and information about that data. Data profiling is the process of examining the data available in an existing data source (e.g.