Data science tasks are an iterative process – there’s no smooth upward monotonous path of getting from the business problem to the solution and automation. We believe we all have experienced that.

After bringing the business pain-point to the surface there are a couple of main questions that are to be asked:

  • Is my problem a prediction, inference or somewhere in-between – meaning what level of interpretability is required for the current task?
  • What data sources do I have and how can I utilise them to improve the task at hand?

It is said that 70% (we say it can be much more) of the analytical tasks is the data preparation – only the rest is the modelling itself. So, one definitely focuses on preparing the sample and making the proper feature engineering so that they both meet business case requirements as well as optimise our model.

In this article – the first from our Technology Innovation sub-series – we talk about interesting and innovative approach to improve your data science exercise. We would like to draw your attention to the feature engineering part and how we can utilise our feature space for optimal results. Optimal in our case means:

  • Highest predictive power with the proper level of model interpretation;
  • Utilising a variety of features so that enhanced profile that predicts the target is created;
  • As a side but not negligible benefit: improve future data gathering and potentially as a consequence improve both customers experience and general state of data. We will talk about that in a minute.

The topic that we have started is quite broad – we know. Therefore we will focus only on one specific example of how to improve your data science exercise while maintaining the logic flow and business requirements. This is: handling features that contain an excessive* number of attributes/ levels so that these levels are reduced to comprehensible amount. See the structure of the approach presented graphically below followed by elaboration on each item:

Note: any technical term or library mentioned are in the sense of R environment – the statistical software we have used for this specific exercise

1)     A (multilevel) feature of interest – ‘occupation’ (as of job position)

We start with more than 2300 levels that we would like to make sense of.

2)     Text Analytics

Leverage text analytics so that a corpus of non-sparse, non-redundant (stemming/ similarities was used) and consistent terms (account for typos, punctuation, extra blanks, lower/upper cases, etc.) is created.

See the graph on the right for example of going from levels to a concise corpus of terms

3)     Graph analysis data transformation

Using iGraph – transform the data based on term frequency where the result is similarity matrix. This matrix we feed into the hierarchical clustering exercise that follows.

4)     Hierarchical clustering – pre-final grouping

-Apply machine learning clustering approach that will provide us an absorbable amount of grouped terms – in our scenario we have chosen 15 but flexibility is allowed here based on the business case and the data scientist’s comprehension of the data. Please see below for the clustering dendrogram graph:

5)     Neo4j – final grouping and representation of results

Present the result of the clustering in an interpretable, fast and interactive manner so that the analyst can validate business logic and make corrections wherever necessary. Being able to easily see and analyse the relationship between the different terms (please see the picture below representing cluster 1 within Neo4j) we can validate our clustering and whenever necessary make manual one-off amendment to fine-tune the feature.

. . .

Now that we have the clusters and have applied some interactive exploratory analysis techniques on these – we apply the clusters over the original data and hence create a new, concise feature that we can use for a modelling exercise.

We advise, whenever possible, that these clusters are given meaningful generalising names, e.g.: public sector, manufacturing, etc.;

Remember the side benefit we talked about above? We can use the newly developed cluster names as a data-validation input when our customers are to fill-in their information on-line or on paper. Rather than giving them a black space to fill-in any possible value. Thus, we will:

  • Increase customer experience by giving them shortlist of meaningful options to choose from;
  • Improve internal data quality through the data-validation procedure and
  • Eventually boost the model predictive power by utilising stronger and reliable feature that enhances the overall customer profile.

To summarise what we have achieved:

  • Utilised hard-to-interpret multi-level feature where we have boosted our own knowledge of it in the process;
  • Potentially enhanced our model performance with improved profile using an additional feature;
  • Created a basis for data validation process when collecting data as well as improved customer experience in this very same process.

Not bad, ah? 🙂

____________ ______________ ______________ ______________

*excessive is defined as strenuous and time-consuming process for the analyst to execute – both technically and business-wise

____________ ______________ ______________ ______________

With this series our aim is to increase data science coverage and to make data-driven decision integral part of more and more companies around the Globe.