Feature engineering is often complex and time-intensive. A subset of data preparation for machine learning workflows within data engineering, feature engineering is the process of using domain knowledge to transform data into features that ML algorithms can understand. Regardless of how much algorithms continue to improve, feature engineering continues to be a difficult process that requires human intelligence with domain expertise. In the end, the quality of feature engineering often drives the quality of a machine learning model.
Assume two input variables, customers’ salaries and ages, and a target variable, the likelihood of purchasing a product. In this dataset, salaries range between $30,000 and $200,000, and ages are between 10 and 90. You understand that an age difference of 20 years is more significant than a salary difference of $20, but to an algorithm, they are just numbers it needs to fit a curve through. To enable computers to provide useful outcomes, you need to teach them to allow for these kinds of variable weights. In this case, you could do this by scaling (e.g., make both salary and age a range between 0.0 and 1.0). You can also use binning (AKA, bucketing) to place values into one of a fixed number of value ranges (e.g., salary bands with values of 0 to 6).
Another example of a feature might be a customer score (derived from raw data) for a churn model or a calculated variable called “length of time customer.” These may be based on structured data. Similarly there may be user activity that comes in the form of semi-structured data that can define a calculate feature such as "is active in last month". And to take advantage of every data type, unstructured data needs to be processed, normalized, and converted into numeric values that a machine learning algorithm can understand.
Data scientists spend a lot of time on feature engineering constructing new derivative attributes that can better represent the problem being solved. Domain experience is often required, as is an understanding of the parameter requirements for each model. If data scientists experience compute bottlenecks, the iterative process is elongated, costing valuable time and resources.
A significant trend in machine learning products is support for automated feature engineering in which tooling can automatically develop features. Some tools address tasks such as inputting missing values for specific algorithms, calculating certain functions (mean, etc.), or calculating ratios. Some are more sophisticated. Some only work on relational data. Some are best for image data.
Although automation does not replace the data scientist, it can assist and cut down on time spent developing new features. The domain expert can review the features provided by the tool and select the features that may provide predictive value. Given the productivity benefits, many organizations are adopting a hybrid approach that utilizes manual and automated feature engineering.
Data scientists need powerful compute resources for feature engineering. Other tools available today for data preparation and feature engineering are either highly inefficient or overly complicated to operate, resulting in brittle, expensive, and time-consuming data pipelines. But with Snowflake, data engineers and data scientists can perform feature engineering on large, petabyte-size datasets without the need for sampling. Transforming data with SQL makes feature engineering accessible to a broader audience of data workers and can result in speed and efficiency boosts.
Snowflake’s architecture dedicates compute clusters for each workload and team, ensuring there is no resource contention between data engineering, business intelligence, and data science workloads. Snowflake allows teams to extract and transform data into rich features with the same reliability and performance of ANSI SQL and the efficiency of functional programming and data frame constructs supported in Java and Python.
The Snowflake platform and broader partner ecosystem can enhance the advantages of AutoML by pushing down the process of feature engineering into the Snowflake Data Cloud, boosting automated machine learning (AutoML) speeds. Snowflake also enables manual feature engineering with Python, Apache Spark, and ODBC/JDBC connectors. To learn more about the Snowflake Data Cloud, visit our page.