Technology News from Around the World, Instantly on Oracnoos!

Is Perplexity's Sonar really more 'factual' than its AI rivals? See for yourself - Related to a, decision, sonar, future, from

Build a Decision Tree in Polars from Scratch

Build a Decision Tree in Polars from Scratch

Decision Tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications.

Frameworks such as sklearn, Lightgbm, xgboost and catboost have done a very good job until today. However, in the past few months, I have been missing support for arrow datasets. While lightgbm has in the recent past added support for that, it is still missing in most other frameworks. The arrow data format could be a perfect match for decision trees since it has a columnar structure optimized for efficient data processing. Pandas already added support for that and also polars uses the advantages.

Polars has shown some significant performance advantages over most other data frameworks. It uses the data efficiently and avoids copying the data unnecessarily. It also provides a streaming engine that allows the processing of larger data than memory. This is why I decided to use polars as a backend for building a decision tree from scratch.

The goal is to explore the advantages of using polars for decision trees in terms of memory and runtime. And, of course, learning more about polars, efficiently defining expressions, and the streaming engine.

The code for the implementation can be found in this repository.

To get a first overview of the code, I will show the structure of the DecisionTreeClassifier first:

The first crucial thing can be seen in the imports. It was crucial for me to keep the import section clean and with as few dependencies as possible. This was successful with only having dependencies to polars, pickle, and typing.

The init method allows to define if the polars streaming engine should be used. Also, the max_depth of the tree can be set here. Another feature in the definition of categorical columns. These are handled in a different way than numerical aspects using a target encoding.

It is possible to save and load the decision tree model. It is represented as a nested dict and can be saved to disk as a pickled file.

The polars magic happens in the fit() and build_tree() methods. These accept both LazyFrames and DataFrames to have support for in-memory processing and streaming.

There are two prediction methods available, predict() and predict_many() . The predict() method can be used on a small example size, and the data needs to be provided as a dict. If we have a big test set, it is more efficient to use the predict_many() method. Here, the data can be provided as a Polars Dataframe or LazyFrame.

import pickle from typing import Iterable, List, Union import polars as pl class DecisionTreeClassifier: def __init__(self, streaming=False, max_depth=None, categorical_columns=None): ... def save_model(self, path: str) -> None: ... def load_model(self, path: str) -> None: ... def apply_categorical_mappings(self, data: Union[pl.DataFrame, pl.LazyFrame]) -> Union[pl.DataFrame, pl.LazyFrame]: ... def fit(self, data: Union[pl.DataFrame, pl.LazyFrame], target_name: str) -> None: ... def predict_many(self, data: Union[pl.DataFrame, pl.LazyFrame]) -> List[Union[int, float]]: ... def predict(self, data: Iterable[dict]): ... def get_majority_class(self, df: Union[pl.DataFrame, pl.LazyFrame], target_name: str) -> str: ... def _build_tree( self, data: Union[pl.DataFrame, pl.LazyFrame], feature_names: list[str], target_name: str, unique_targets: list[int], depth: int, ) -> dict: ...

To train the decision tree classifier, the fit() method needs to be used.

def fit(self, data: Union[pl.DataFrame, pl.LazyFrame], target_name: str) -> None: """ Fit method to train the decision tree. :param data: Polars DataFrame or LazyFrame containing the training data. :param target_name: Name of the target column """ columns = data.collect_schema().names() feature_names = [col for col in columns if col != target_name] # Shrink dtypes data = [website] [website] ) # Prepare categorical columns with target encoding if self.categorical_columns: categorical_mappings = {} for categorical_column in self.categorical_columns: categorical_mappings[categorical_column] = { value: index for index, value in enumerate( [website] .group_by(categorical_column) .agg([website]"avg")) .sort("avg") .collect(streaming=self.streaming)[categorical_column] ) } self.categorical_mappings = categorical_mappings data = self.apply_categorical_mappings(data) unique_targets = [website] if isinstance(unique_targets, pl.LazyFrame): unique_targets = unique_targets.collect(streaming=self.streaming) unique_targets = unique_targets[target_name].to_list() [website] = self._build_tree(data, feature_names, target_name, unique_targets, depth=0).

It receives a polars LazyFrame or DataFrame that contains all functions and the target column. To identify the target column, the target_name needs to be provided.

Polars provides a convenient way to optimize the memory usage of the data.

With that, all columns are selected and evaluated. It will convert the dtype to the smallest possible value.

To encode categorical values, a target encoding is used. For that, all instances of a categorical feature will be aggregated, and the average target value will be calculated. Then, the instances are sorted by the average target value, and a rank is assigned. This rank will be used as the representation of the feature value.

( [website] .group_by(categorical_column) .agg([website]"avg")) .sort("avg") .collect(streaming=self.streaming)[categorical_column] ).

Since it is possible to provide polars DataFrames and LazyFrames, I use [website] first. If the given data is a DataFrame, it will be converted to a LazyFrame. If it is already a LazyFrame, it only returns self. With that trick, it is possible to ensure that the data is processed in the same way for LazyFrames and DataFrames and that the collect() method can be used, which is only available for LazyFrames.

To illustrate the outcome of the calculations in the different steps of the fitting process, I apply it to a dataset for heart disease prediction. It can be found on Kaggle and is .

Here is an example of the categorical feature representation for the glucose levels:

┌──────┬──────┬──────────┐ │ rank ┆ gluc ┆ avg │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i8 ┆ f64 │ ╞══════╪══════╪══════════╡ │ 0 ┆ 1 ┆ [website] │ │ 1 ┆ 2 ┆ [website] │ │ 2 ┆ 3 ┆ [website] │ └──────┴──────┴──────────┘.

For each of the glucose levels, the probability of having a heart disease is calculated. This is sorted and then ranked so that each of the levels is mapped to a rank value.

As the last part of the fit() method, the unique target values are determined.

unique_targets = [website] if isinstance(unique_targets, pl.LazyFrame): unique_targets = unique_targets.collect(streaming=self.streaming) unique_targets = unique_targets[target_name].to_list() [website] = self._build_tree(data, feature_names, target_name, unique_targets, depth=0).

This serves as the last preparation before calling the _build_tree() method recursively.

After the data is prepared in the fit() method, the _build_tree() method is called. This is done recursively until a stopping criterion is met, [website], the max depth of the tree is reached. The first call is executed from the fit() method with a depth of zero.

def _build_tree( self, data: Union[pl.DataFrame, pl.LazyFrame], feature_names: list[str], target_name: str, unique_targets: list[int], depth: int, ) -> dict: """ Builds the decision tree recursively. If max_depth is reached, returns a leaf node with the majority class. Otherwise, finds the best split and creates internal nodes for left and right children. :param data: The dataframe to evaluate. :param feature_names: Name of the feature columns. :param target_name: Name of the target column. :param unique_targets: unique target values. :param depth: The current depth of the tree. :return: A dictionary representing the node. """ if self.max_depth is not None and depth >= self.max_depth: return {"type": "leaf", "value": self.get_majority_class(data, target_name)} # Make data lazy here to avoid that it is evaluated in each loop iteration. data = [website] # Evaluate entropy per feature: information_gain_dfs = [] for feature_name in feature_names: feature_data = [website][feature_name, target_name]).filter([website] feature_data = [website]{feature_name: "feature_value"}) # No streaming (yet) information_gain_df = ( feature_data.group_by("feature_value") .agg( [ [website] .filter([website] == target_value) .len() .alias(f"class_{target_value}_count") for target_value in unique_targets ] + [[website]"count_examples")] ) .sort("feature_value") .select( [ [website]"class_{target_value}_count").cum_sum().alias(f"cum_sum_class_{target_value}_count") for target_value in unique_targets ] + [ [website]"class_{target_value}_count").sum().alias(f"sum_class_{target_value}_count") for target_value in unique_targets ] + [ [website]"count_examples").cum_sum().alias("cum_sum_count_examples"), [website]"count_examples").sum().alias("sum_count_examples"), ] + [ # From previous select [website]"feature_value"), ] ) .filter( # At least one example available [website]"sum_count_examples") > [website]"cum_sum_count_examples") ) .select( [ ([website]"cum_sum_class_{target_value}_count") / [website]"cum_sum_count_examples")).alias( f"left_proportion_class_{target_value}" ) for target_value in unique_targets ] + [ ( ([website]"sum_class_{target_value}_count") - [website]"cum_sum_class_{target_value}_count")) / ([website]"sum_count_examples") - [website]"cum_sum_count_examples")) ).alias(f"right_proportion_class_{target_value}") for target_value in unique_targets ] + [ ([website]"sum_class_{target_value}_count") / [website]"sum_count_examples")).alias( f"parent_proportion_class_{target_value}" ) for target_value in unique_targets ] + [ # From previous select [website]"cum_sum_count_examples"), [website]"sum_count_examples"), [website]"feature_value"), ] ) .select( ( -1 * pl.sum_horizontal( [ ( [website]"left_proportion_class_{target_value}") * [website]"left_proportion_class_{target_value}").log(base=2) ).fill_nan([website] for target_value in unique_targets ] ) ).alias("left_entropy"), ( -1 * pl.sum_horizontal( [ ( [website]"right_proportion_class_{target_value}") * [website]"right_proportion_class_{target_value}").log(base=2) ).fill_nan([website] for target_value in unique_targets ] ) ).alias("right_entropy"), ( -1 * pl.sum_horizontal( [ ( [website]"parent_proportion_class_{target_value}") * [website]"parent_proportion_class_{target_value}").log(base=2) ).fill_nan([website] for target_value in unique_targets ] ) ).alias("parent_entropy"), # From previous select [website]"cum_sum_count_examples"), [website]"sum_count_examples"), [website]"feature_value"), ) .select( ( [website]"cum_sum_count_examples") / [website]"sum_count_examples") * [website]"left_entropy") + ([website]"sum_count_examples") - [website]"cum_sum_count_examples")) / [website]"sum_count_examples") * [website]"right_entropy") ).alias("child_entropy"), # From previous select [website]"parent_entropy"), [website]"feature_value"), ) .select( ([website]"parent_entropy") - [website]"child_entropy")).alias("information_gain"), # From previous select [website]"parent_entropy"), [website]"feature_value"), ) .filter([website]"information_gain").is_not_nan()) .sort("information_gain", descending=True) .head(1) .with_columns([website] ) [website] if isinstance(information_gain_dfs[0], pl.LazyFrame): information_gain_dfs = pl.collect_all(information_gain_dfs, streaming=self.streaming) information_gain_dfs = [website], how="vertical_relaxed").sort( "information_gain", descending=True ) information_gain = 0 if len(information_gain_dfs) > 0: best_params = [website], named=True) information_gain = best_params["information_gain"] if information_gain > 0: left_mask = [website]["feature"]) <= best_params["feature_value"]) if isinstance(left_mask, pl.LazyFrame): left_mask = left_mask.collect(streaming=self.streaming) left_mask = left_mask["filter"] # Split data left_df = [website] right_df = [website] left_subtree = self._build_tree(left_df, feature_names, target_name, unique_targets, depth + 1) right_subtree = self._build_tree(right_df, feature_names, target_name, unique_targets, depth + 1) if isinstance(data, pl.LazyFrame): target_distribution = ( [website] .collect(streaming=self.streaming)[target_name] .value_counts() .sort(target_name)["count"] .to_list() ) else: target_distribution = data[target_name].value_counts().sort(target_name)["count"].to_list() return { "type": "node", "feature": best_params["feature"], "threshold": best_params["feature_value"], "information_gain": best_params["information_gain"], "entropy": best_params["parent_entropy"], "target_distribution": target_distribution, "left": left_subtree, "right": right_subtree, } else: return {"type": "leaf", "value": self.get_majority_class(data, target_name)}.

This method is the heart of building the trees and I will explain it step by step. First, when entering the method, it is checked if the max depth stopping criterion is met.

if self.max_depth is not None and depth >= self.max_depth: return {"type": "leaf", "value": self.get_majority_class(data, target_name)}.

If the current depth is equal to or greater than the max_depth , a node of the type leaf will be returned. The value of the leaf corresponds to the majority class of the data. This is calculated as follows:

def get_majority_class(self, df: Union[pl.DataFrame, pl.LazyFrame], target_name: str) -> str: """ Returns the majority class of a dataframe. :param df: The dataframe to evaluate. :param target_name: Name of the target column. :return: majority class. """ majority_class = df.group_by(target_name).len().filter([website]"len") == [website]"len").max()).select(target_name) if isinstance(majority_class, pl.LazyFrame): majority_class = majority_class.collect(streaming=self.streaming) return majority_class[target_name][0].

To get the majority class, the count of rows per target is determined by grouping over the target column and aggregating with len() . The target instance, which is present in most of the rows, is returned as the majority class.

To find a good split of the data, the information gain is used.

Equation 1 — Calculation of information gain. Image by author.

To get the information gain, the parent entropy and child entropy need to be calculated.

Equation 2 — Calculation of entropy. Image by author.

Calculating The Information Gain in Polars.

The information gain is calculated for each feature value that is present in a feature column.

information_gain_df = ( feature_data.group_by("feature_value") .agg( [ [website] .filter([website] == target_value) .len() .alias(f"class_{target_value}_count") for target_value in unique_targets ] + [[website]"count_examples")] ) .sort("feature_value").

The feature values are grouped, and the count of each of the target values is assigned to it. Additionally, the total count of rows for that feature value is saved as count_examples . In the last step, the data is sorted by feature_value . This is needed to calculate the splits in the next step.

For the heart disease dataset, after the first calculation step, the data looks like this:

┌───────────────┬───────────────┬───────────────┬────────────────┐ │ feature_value ┆ class_0_count ┆ class_1_count ┆ count_examples │ │ --- ┆ --- ┆ --- ┆ --- │ │ i8 ┆ u32 ┆ u32 ┆ u32 │ ╞═══════════════╪═══════════════╪═══════════════╪════════════════╡ │ 29 ┆ 2 ┆ 0 ┆ 2 │ │ 30 ┆ 1 ┆ 0 ┆ 1 │ │ 39 ┆ 1068 ┆ 331 ┆ 1399 │ │ 40 ┆ 975 ┆ 263 ┆ 1238 │ │ 41 ┆ 1052 ┆ 438 ┆ 1490 │ │ … ┆ … ┆ … ┆ … │ │ 60 ┆ 1054 ┆ 1460 ┆ 2514 │ │ 61 ┆ 695 ┆ 1408 ┆ 2103 │ │ 62 ┆ 566 ┆ 1125 ┆ 1691 │ │ 63 ┆ 572 ┆ 1517 ┆ 2089 │ │ 64 ┆ 479 ┆ 1217 ┆ 1696 │ └───────────────┴───────────────┴───────────────┴────────────────┘.

Here, the feature age_years is processed. Class 0 stands for “no heart disease,” and class 1 stands for “heart disease.” The data is sorted by the age of years feature, and the columns contain the count of class 0 , class 1 , and the total count of examples with the respective feature value.

In the next step, the cumulative sum over the count of classes is calculated for each feature value.

.select( [ [website]"class_{target_value}_count").cum_sum().alias(f"cum_sum_class_{target_value}_count") for target_value in unique_targets ] + [ [website]"class_{target_value}_count").sum().alias(f"sum_class_{target_value}_count") for target_value in unique_targets ] + [ [website]"count_examples").cum_sum().alias("cum_sum_count_examples"), [website]"count_examples").sum().alias("sum_count_examples"), ] + [ # From previous select [website]"feature_value"), ] ) .filter( # At least one example available [website]"sum_count_examples") > [website]"cum_sum_count_examples") ).

The intuition behind it is that when a split is executed over a specific feature value, it includes the count of target values from smaller feature values. To be able to calculate the proportion, the total sum of the target values is calculated. The same procedure is repeated for count_examples , where the cumulative sum and the total sum are calculated as well.

After the calculation, the data looks like this:

┌──────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐ │ cum_sum_clas ┆ cum_sum_cla ┆ sum_class_0 ┆ sum_class_1 ┆ cum_sum_cou ┆ sum_count_e ┆ feature_val │ │ s_0_count ┆ ss_1_count ┆ _count ┆ _count ┆ nt_examples ┆ xamples ┆ ue │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ i8 │ ╞══════════════╪═════════════╪═════════════╪═════════════╪═════════════╪═════════════╪═════════════╡ │ 3 ┆ 0 ┆ 27717 ┆ 26847 ┆ 3 ┆ 54564 ┆ 29 │ │ 4 ┆ 0 ┆ 27717 ┆ 26847 ┆ 4 ┆ 54564 ┆ 30 │ │ 1097 ┆ 324 ┆ 27717 ┆ 26847 ┆ 1421 ┆ 54564 ┆ 39 │ │ 2090 ┆ 595 ┆ 27717 ┆ 26847 ┆ 2685 ┆ 54564 ┆ 40 │ │ 3155 ┆ 1025 ┆ 27717 ┆ 26847 ┆ 4180 ┆ 54564 ┆ 41 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ 24302 ┆ 20162 ┆ 27717 ┆ 26847 ┆ 44464 ┆ 54564 ┆ 59 │ │ 25356 ┆ 21581 ┆ 27717 ┆ 26847 ┆ 46937 ┆ 54564 ┆ 60 │ │ 26046 ┆ 23020 ┆ 27717 ┆ 26847 ┆ 49066 ┆ 54564 ┆ 61 │ │ 26615 ┆ 24131 ┆ 27717 ┆ 26847 ┆ 50746 ┆ 54564 ┆ 62 │ │ 27216 ┆ 25652 ┆ 27717 ┆ 26847 ┆ 52868 ┆ 54564 ┆ 63 │ └──────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘.

In the next step, the proportions are calculated for each feature value.

.select( [ ([website]"cum_sum_class_{target_value}_count") / [website]"cum_sum_count_examples")).alias( f"left_proportion_class_{target_value}" ) for target_value in unique_targets ] + [ ( ([website]"sum_class_{target_value}_count") - [website]"cum_sum_class_{target_value}_count")) / ([website]"sum_count_examples") - [website]"cum_sum_count_examples")) ).alias(f"right_proportion_class_{target_value}") for target_value in unique_targets ] + [ ([website]"sum_class_{target_value}_count") / [website]"sum_count_examples")).alias( f"parent_proportion_class_{target_value}" ) for target_value in unique_targets ] + [ # From previous select [website]"cum_sum_count_examples"), [website]"sum_count_examples"), [website]"feature_value"), ] ).

To calculate the proportions, the results from the previous step can be used. For the left proportion, the cumulative sum of each target value is divided by the cumulative sum of the example count. For the right proportion, we need to know how many examples we have on the right side for each target value. That is calculated by subtracting the total sum for the target value from the cumulative sum of the target value. The same calculation is used to determine the total count of examples on the right side by subtracting the sum of the example count from the cumulative sum of the example count. Additionally, the parent proportion is calculated. This is done by dividing the sum of the target values counts by the total count of examples.

This is the result data after this step:

┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ left_prop ┆ left_prop ┆ right_pro ┆ right_pro ┆ … ┆ parent_pr ┆ cum_sum_c ┆ sum_count ┆ feature_ │ │ ortion_cl ┆ ortion_cl ┆ portion_c ┆ portion_c ┆ ┆ oportion_ ┆ ount_exam ┆ _examples ┆ value │ │ ass_0 ┆ ass_1 ┆ lass_0 ┆ lass_1 ┆ ┆ class_1 ┆ ples ┆ --- ┆ --- │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ u32 ┆ i8 │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ u32 ┆ ┆ │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 3 ┆ 54564 ┆ 29 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 4 ┆ 54564 ┆ 30 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 1428 ┆ 54564 ┆ 39 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 2709 ┆ 54564 ┆ 40 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 4146 ┆ 54564 ┆ 41 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 44419 ┆ 54564 ┆ 59 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 46922 ┆ 54564 ┆ 60 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 49067 ┆ 54564 ┆ 61 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 50770 ┆ 54564 ┆ 62 │ │ [website] ┆ [website] ┆ [website] ┆ [website] ┆ … ┆ [website] ┆ 52859 ┆ 54564 ┆ 63 │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘.

Now that the proportions are available, the entropy can be calculated.

.select( ( -1 * pl.sum_horizontal( [ ( [website]"left_proportion_class_{target_value}") * [website]"left_proportion_class_{target_value}").log(base=2) ).fill_nan([website] for target_value in unique_targets ] ) ).alias("left_entropy"), ( -1 * pl.sum_horizontal( [ ( [website]"right_proportion_class_{target_value}") * [website]"right_proportion_class_{target_value}").log(base=2) ).fill_nan([website] for target_value in unique_targets ] ) ).alias("right_entropy"), ( -1 * pl.sum_horizontal( [ ( [website]"parent_proportion_class_{target_value}") * [website]"parent_proportion_class_{target_value}").log(base=2) ).fill_nan([website] for target_value in unique_targets ] ) ).alias("parent_entropy"), # From previous select [website]"cum_sum_count_examples"), [website]"sum_count_examples"), [website]"feature_value"), ).

For the calculation of the entropy, Equation 2 is used. The left entropy is calculated using the left proportion, and the right entropy uses the right proportion. For the parent entropy, the parent proportion is used. In this implementation, pl.sum_horizontal() is used to calculate the sum of the proportions to make use of possible optimizations from polars. This can also be replaced with the python-native sum() method.

The data with the entropy values look as follows:

┌──────────────┬───────────────┬────────────────┬─────────────────┬────────────────┬───────────────┐ │ left_entropy ┆ right_entropy ┆ parent_entropy ┆ cum_sum_count_e ┆ sum_count_exam ┆ feature_value │ │ --- ┆ --- ┆ --- ┆ xamples ┆ ples ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ --- ┆ --- ┆ i8 │ │ ┆ ┆ ┆ u32 ┆ u32 ┆ │ ╞══════════════╪═══════════════╪════════════════╪═════════════════╪════════════════╪═══════════════╡ │ [website] ┆ [website] ┆ [website] ┆ 3 ┆ 54564 ┆ 29 │ │ [website] ┆ [website] ┆ [website] ┆ 4 ┆ 54564 ┆ 30 │ │ [website] ┆ [website] ┆ [website] ┆ 1427 ┆ 54564 ┆ 39 │ │ [website] ┆ [website] ┆ [website] ┆ 2694 ┆ 54564 ┆ 40 │ │ [website] ┆ [website] ┆ [website] ┆ 4177 ┆ 54564 ┆ 41 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ [website] ┆ [website] ┆ [website] ┆ 44483 ┆ 54564 ┆ 59 │ │ [website] ┆ [website] ┆ [website] ┆ 46944 ┆ 54564 ┆ 60 │ │ [website] ┆ [website] ┆ [website] ┆ 49106 ┆ 54564 ┆ 61 │ │ [website] ┆ [website] ┆ [website] ┆ 50800 ┆ 54564 ┆ 62 │ │ [website] ┆ [website] ┆ [website] ┆ 52877 ┆ 54564 ┆ 63 │ └──────────────┴───────────────┴────────────────┴─────────────────┴────────────────┴───────────────┘.

Almost there! The final step is missing, which is calculating the child entropy and using that to get the information gain.

.select( ( [website]"cum_sum_count_examples") / [website]"sum_count_examples") * [website]"left_entropy") + ([website]"sum_count_examples") - [website]"cum_sum_count_examples")) / [website]"sum_count_examples") * [website]"right_entropy") ).alias("child_entropy"), # From previous select [website]"parent_entropy"), [website]"feature_value"), ) .select( ([website]"parent_entropy") - [website]"child_entropy")).alias("information_gain"), # From previous select [website]"parent_entropy"), [website]"feature_value"), ) .filter([website]"information_gain").is_not_nan()) .sort("information_gain", descending=True) .head(1) .with_columns([website] ) [website].

For the child entropy, the left and right entropy are weighted by the count of examples for the feature values. The sum of both weighted entropy values is used as child entropy. To calculate the information gain, we simply need to subtract the child entropy from the parent entropy, as can be seen in Equation 1. The best feature value is determined by sorting the data by information gain and selecting the first row. It is appended to a list that gathers all the best feature values from all attributes.

Before applying .head(1) , the data looks as follows:

┌──────────────────┬────────────────┬───────────────┐ │ information_gain ┆ parent_entropy ┆ feature_value │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ i8 │ ╞══════════════════╪════════════════╪═══════════════╡ │ [website] ┆ [website] ┆ 54 │ │ [website] ┆ [website] ┆ 52 │ │ [website] ┆ [website] ┆ 53 │ │ [website] ┆ [website] ┆ 50 │ │ [website] ┆ [website] ┆ 51 │ │ … ┆ … ┆ … │ │ [website] ┆ [website] ┆ 62 │ │ [website] ┆ [website] ┆ 39 │ │ [website] ┆ [website] ┆ 63 │ │ [website] ┆ [website] ┆ 30 │ │ [website] ┆ [website] ┆ 29 │ └──────────────────┴────────────────┴───────────────┘.

Here, it can be seen that the age feature value of 54 has the highest information gain. This feature value will be collected for the age feature and needs to compete against the other attributes.

Selecting Best Split and Define Sub Trees.

To select the best split, the highest information gain needs to be found across all elements.

if isinstance(information_gain_dfs[0], pl.LazyFrame): information_gain_dfs = pl.collect_all(information_gain_dfs, streaming=self.streaming) information_gain_dfs = [website], how="vertical_relaxed").sort( "information_gain", descending=True ).

For that, the pl.collect_all() method is used on information_gain_dfs . This evaluates all LazyFrames in parallel, which makes the processing very efficient. The result is a list of polars DataFrames, which are concatenated and sorted by information gain.

For the heart disease example, the data looks like this:

┌──────────────────┬────────────────┬───────────────┬─────────────┐ │ information_gain ┆ parent_entropy ┆ feature_value ┆ feature │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ str │ ╞══════════════════╪════════════════╪═══════════════╪═════════════╡ │ [website] ┆ [website] ┆ [website] ┆ ap_hi │ │ [website] ┆ [website] ┆ [website] ┆ ap_lo │ │ [website] ┆ [website] ┆ [website] ┆ cholesterol │ │ [website] ┆ [website] ┆ [website] ┆ age_years │ │ [website] ┆ [website] ┆ [website] ┆ bmi │ │ … ┆ … ┆ … ┆ … │ │ [website] ┆ [website] ┆ [website] ┆ active │ │ [website] ┆ [website] ┆ [website] ┆ height │ │ [website] ┆ [website] ┆ [website] ┆ smoke │ │ [website] ┆ [website] ┆ [website] ┆ alco │ │ [website] ┆ [website] ┆ [website] ┆ gender │ └──────────────────┴────────────────┴───────────────┴─────────────┘.

Out of all aspects, the ap_hi (Systolic blood pressure) feature value of 129 results in the best information gain and thus will be selected for the first split.

information_gain = 0 if len(information_gain_dfs) > 0: best_params = [website], named=True) information_gain = best_params["information_gain"].

In some cases, information_gain_dfs might be empty, for example, when all splits result in having only examples on the left or right side. If this is the case, the information gain is zero. Otherwise, we get the feature value with the highest information gain.

if information_gain > 0: left_mask = [website]["feature"]) <= best_params["feature_value"]) if isinstance(left_mask, pl.LazyFrame): left_mask = left_mask.collect(streaming=self.streaming) left_mask = left_mask["filter"] # Split data left_df = [website] right_df = [website] left_subtree = self._build_tree(left_df, feature_names, target_name, unique_targets, depth + 1) right_subtree = self._build_tree(right_df, feature_names, target_name, unique_targets, depth + 1) if isinstance(data, pl.LazyFrame): target_distribution = ( [website] .collect(streaming=self.streaming)[target_name] .value_counts() .sort(target_name)["count"] .to_list() ) else: target_distribution = data[target_name].value_counts().sort(target_name)["count"].to_list() return { "type": "node", "feature": best_params["feature"], "threshold": best_params["feature_value"], "information_gain": best_params["information_gain"], "entropy": best_params["parent_entropy"], "target_distribution": target_distribution, "left": left_subtree, "right": right_subtree, } else: return {"type": "leaf", "value": self.get_majority_class(data, target_name)}.

When the information gain is greater than zero, the sub-trees are defined. For that, the left mask is defined using the feature value that resulted in the best information gain. The mask is applied to the parent data to get the left data frame. The negation of the left mask is used to define the right data frame. Both left and right data frames are used to call the _build_tree() method again with an increased depth+1. As the last step, the target distribution is calculated. This is used as additional information on the node and will be visible when plotting the tree along with the other information.

When information gain is zero, a leaf instance will be returned. This contains the majority class of the given data.

It is possible to make predictions in two different ways. If the input data is small, the predict() method can be used.

def predict(self, data: Iterable[dict]): def _predict_sample(node, sample): if node["type"] == "leaf": return node["value"] if sample[node["feature"]] <= node["threshold"]: return _predict_sample(node["left"], sample) else: return _predict_sample(node["right"], sample) predictions = [_predict_sample([website], sample) for sample in data] return predictions.

Here, the data can be provided as an iterable of dicts. Each dict contains the feature names as keys and the feature values as values. By using the _predict_sample() method, the path in the tree is followed until a leaf node is reached. This contains the class that is assigned to the respective example.

def predict_many(self, data: Union[pl.DataFrame, pl.LazyFrame]) -> List[Union[int, float]]: """ Predict method. :param data: Polars DataFrame or LazyFrame. :return: List of predicted target values. """ if self.categorical_mappings: data = self.apply_categorical_mappings(data) def _predict_many(node, temp_data): if node["type"] == "node": left = _predict_many(node["left"], [website]["feature"]) <= node["threshold"])) right = _predict_many(node["right"], [website]["feature"]) > node["threshold"])) return [website][left, right], how="diagonal_relaxed") else: return [website]"temp_prediction_index"), [website]["value"]).alias("prediction")) data = data.with_row_index("temp_prediction_index") predictions = _predict_many([website], data).sort("temp_prediction_index").select([website]"prediction")) # Convert predictions to a list if isinstance(predictions, pl.LazyFrame): # Despite the execution plans says there is no streaming, using streaming here significantly # increases the performance and decreases the memory food print. predictions = predictions.collect(streaming=True) predictions = predictions["prediction"].to_list() return predictions.

If a big example set should be predicted, it is more efficient to use the predict_many() method. This makes use of the advantages that polars provides in terms of parallel processing and memory efficiency.

The data can be provided as a polars DataFrame or LazyFrame. Similarly to the _build_tree() method in the training process, a _predict_many() method is called recursively. All examples in the data are filtered into sub-trees until the leaf node is reached. Examples that went the same path to the leaf node get the same prediction value assigned. At the end of the process, all sub-frames of examples are concatenated again. Since the order can not be preserved with that, a temporary prediction index is set at the beginning of the process. When all predictions are done, the original order is restored with sorting by that index.

A usage example for the decision tree classifier can be found here. The decision tree is trained on a heart disease dataset. A train and test set is defined to test the performance of the implementation. After the training, the tree is plotted and saved to a file.

With a max depth of four, the resulting tree looks as follows:

Decision tree for heart disease dataset. Image by author.

It achieves a train and test accuracy of 73% on the given data.

One goal of using polars as a backend for decision trees is to explore the runtime and memory usage and compare it to other frameworks. For that, I created a memory profiling script that can be found here.

The script compares this implementation, which is called “efficient-trees” against sklearn and lightgbm. For efficient-trees, the lazy streaming variant and non-lazy in-memory variant are tested.

Comparison of runtime and memory usage. Image by author.

In the graph, it can be seen that lightgbm is the fastest and most memory-efficient framework. Since it introduced the possibility of using arrow datasets a while ago, the data can be processed efficiently. However, since the whole dataset still needs to be loaded and can’t be streamed, there are still potential scaling issues.

The next best framework is efficient-trees without and with streaming. While efficient-trees without streaming has a superior runtime, the streaming variant uses less memory.

The sklearn implementation achieves the worst results in terms of memory usage and runtime. Since the data needs to be provided as a numpy array, the memory usage grows a lot. The runtime can be explained by using only one CPU core. Support for multi-threading or multi-processing doesn’t exist yet.

As can be seen in the comparison of the frameworks, the possibility of streaming the data instead of having it in memory makes a difference to all other frameworks. However, the streaming engine is still considered an experimental feature, and not all operations are compatible with streaming yet.

To get a more effective understanding of what happens in the background, a look into the execution plan is useful. Let’s jump back into the training process and get the execution plan for the following operation:

def fit(self, data: Union[pl.DataFrame, pl.LazyFrame], target_name: str) -> None: """ Fit method to train the decision tree. :param data: Polars DataFrame or LazyFrame containing the training data. :param target_name: Name of the target column """ columns = data.collect_schema().names() feature_names = [col for col in columns if col != target_name] # Shrink dtypes data = [website] [website] ).

The execution plan for data can be created with the following command:

This returns the execution plan for the LazyFrame.

WITH_COLUMNS: [col("cardio").strict_cast(UInt64).shrink_dtype().alias("cardio")] SELECT [col("gender").shrink_dtype(), col("height").shrink_dtype(), col("weight").shrink_dtype(), col("ap_hi").shrink_dtype(), col("ap_lo").shrink_dtype(), col("cholesterol").shrink_dtype(), col("gluc").shrink_dtype(), col("smoke").shrink_dtype(), col("alco").shrink_dtype(), col("active").shrink_dtype(), col("cardio").shrink_dtype(), col("age_years").shrink_dtype(), col("bmi").shrink_dtype()] FROM STREAMING: DF ["gender", "height", "weight", "ap_hi"]; PROJECT 13/13 COLUMNS; SELECTION: None.

The keyword that is essential here is STREAMING . It can be seen that the initial dataset loading happens in the streaming mode, but when shrinking the dtypes , the whole dataset needs to be loaded into memory. Since the dtype shrinking is not a necessary part, I remove it temporarily to explore until what operation streaming is supported.

The next problematic operation is assigning the categorical elements.

def apply_categorical_mappings(self, data: Union[pl.DataFrame, pl.LazyFrame]) -> Union[pl.DataFrame, pl.LazyFrame]: """ Apply categorical mappings on input frame. :param data: Polars DataFrame or LazyFrame with categorical columns. :return: Polars DataFrame or LazyFrame with mapped categorical columns """ return data.with_columns( [[website][col]).cast([website] for col in self.categorical_columns] ).

The replace expression doesn’t support the streaming mode. Even after removing the cast, streaming is not used which can be seen in the execution plan.

WITH_COLUMNS: [col("gender").replace([Series, Series]), col("cholesterol").replace([Series, Series]), col("gluc").replace([Series, Series]), col("smoke").replace([Series, Series]), col("alco").replace([Series, Series]), col("active").replace([Series, Series])] STREAMING: DF ["gender", "height", "weight", "ap_hi"]; PROJECT */13 COLUMNS; SELECTION: None.

Moving on, I also remove the support for categorical attributes. What happens next is the calculation of the information gain.

information_gain_df = ( feature_data.group_by("feature_value") .agg( [ [website] .filter([website] == target_value) .len() .alias(f"class_{target_value}_count") for target_value in unique_targets ] + [[website]"count_examples")] ) .sort("feature_value") ).

Unfortunately, already in the first part of calculating, the streaming mode is not supported anymore. Here, using [website] prevents us from streaming the data.

SORT BY [col("feature_value")] AGGREGATE [col("cardio").filter([(col("cardio")) == (1)]).count().alias("class_1_count"), col("cardio").filter([(col("cardio")) == (0)]).count().alias("class_0_count"), col("cardio").count().alias("count_examples")] BY [col("feature_value")] FROM STREAMING: RENAME simple π 2/2 ["gender", "cardio"] DF ["gender", "height", "weight", "ap_hi"]; PROJECT 2/13 COLUMNS; SELECTION: col("gender").is_not_null().

Since this is not so easy to change, I will stop the exploration here. It can be concluded that in the decision tree implementation with polars backend, the full potential of streaming can’t be used yet since essential operators are still missing streaming support. Since the streaming mode is under active development, it might be possible to run most of the operators or even the whole calculation of the decision tree in the streaming mode in the future.

In this blog post, I presented my custom implementation of a decision tree using polars as a backend. I showed implementation details and compared it to other decision tree frameworks. The comparison reveals that this implementation can outperform sklearn in terms of runtime and memory usage. But there are still other frameworks like lightgbm that provide a more effective runtime and more efficient processing. There is a lot of potential in the streaming mode when using polars backend. Currently, some operators prevent an end-to-end streaming approach due to a lack of streaming support, but this is under active development. When polars makes progress with that, it is worth revisiting this implementation and comparing it to other frameworks again.

I'm an Android user and have been since version [website] of the OS. Over the past year or so, Google has switched its default assistant to its power......

Telangana is evolving from merely serving as a back office for global companies to becoming a centre for innovation, advanced technology, and intellec......

Positron, an AI chip startup that aims to go head-to-head with NVIDIA, has raised $[website] million in funding from investors, including Flume Ventures, V......

Is Perplexity's Sonar really more 'factual' than its AI rivals? See for yourself

Is Perplexity's Sonar really more 'factual' than its AI rivals? See for yourself

AI search engine Perplexity says its latest release goes above and beyond for user satisfaction -- especially compared to OpenAI's GPT-4o.

On Tuesday, Perplexity showcased a new version of Sonar, its proprietary model. Based on Meta's open-source Llama [website] 70B, the updated Sonar "is optimized for answer quality and user experience," the corporation says, having been trained to improve the readability and accuracy of its answers in search mode.

Also: The billion-dollar AI firm no one is talking about - and why you should care.

Perplexity states Sonar scored higher than GPT-4o mini and Claude models on factuality and readability. The corporation defines factuality as a measure of "how well a model can answer questions using facts that are grounded in search results, and its ability to resolve conflicting or missing information." However, there isn't an external benchmark to measure this.

Instead, Perplexity displays several screenshot examples of side-by-side answers from Sonar and competitor models including GPT-4o and Claude [website] Sonnet. They do, in my opinion, differ in directness, completion, and scannability, often favoring Sonar's cleaner formatting (a subjective preference) and higher number of citations -- though that doesn't speak directly to source quality, only quantity. The findings a chatbot cites are also influenced by the publisher and media partner agreements of its parent firm, which Perplexity and OpenAI each have.

More importantly, the examples don't include the queries themselves, only the answers, and Perplexity does not clarify a methodology on how it provoked or measured the responses -- differences between queries, the number of queries run, etc. -- instead leaving the comparisons up to individuals to "see the difference." ZDNET has reached out to Perplexity for comment.

One of Perplexity's "factuality and readability" examples. Perplexity.

Perplexity says that online A/B testing revealed that customers were much more satisfied and engaged with Sonar than with GPT-4o mini, Claude [website] Haiku, and Claude [website] Sonnet, but it didn't expand on the specifics of these results.

Also: The work tasks people use Claude AI for most,.

"Sonar significantly outperforms models in its class, like GPT-4o mini and Claude [website] Haiku, while closely matching or exceeding the performance of frontier models like GPT-4o and Claude [website] Sonnet for user satisfaction," Perplexity's announcement states.

, Sonar's speed of 1,200 tokens per second enables it to answer queries almost instantly and work 10 times faster than Gemini [website] Flash. Testing showed Sonar surpassing GPT-4o mini and Claude [website] Haiku "by a substantial margin," but the corporation doesn't clarify the details of that testing. The corporation also says Sonar outperforms more expensive frontier models like Claude [website] Sonnet "while closely approaching the performance of GPT-4o."

Sonar did beat its two competitors, among others, on academic benchmark tests IFEval and MMLU, which evaluate how well a model follows user instructions and its grasp of "world knowledge" across disciplines.

Also: Cerebras CEO on DeepSeek: Every time computing gets cheaper, the market gets bigger.

Want to try it for yourself? The upgraded Sonar is available for all Pro consumers, who can make it their default model in their settings or access it through the Sonar API.

In a recent test of ChatGPT's Deep Research feature, the AI was asked to identify 20 jobs that OpenAI's new o3 model was likely to replace. As ......

Decision Tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks......

Dans notre société en constante évolution, surtout au niveau de l’IA, deux noms se distinguent grandement : Chat GPT 4 et Dall-E. La question est de s......

The Future of Content is ‘Headless’

The Future of Content is ‘Headless’

Content management has come a long way since the early days of the internet. In the early 2000s, websites were the sole digital channel, and managing content was a straightforward but rigid process. Then came the social media revolution, followed by the android boom between 2008 and 2009. However, the real game-changer was the rise of cloud software, which completely transformed how content is created, delivered and managed.

Traditionally, launching a website meant installing a content management system (CMS), where content creation and presentation were tightly linked. Eventually, as digital landscapes evolved, businesses needed more flexibility. This is when headless CMS, a modern solution that separates content creation from its presentation, was introduced.

Unlike traditional CMS platforms, headless CMS allows developers to use APIs to distribute content seamlessly across websites, mobile apps, and other digital platforms. This separation of content and design not only empowers teams to work independently but also speeds up content updates, enhances user experience, and ensures consistency across multiple channels.

Artificial intelligence (AI) plays a significant role in content management. Nishant Patel, founder and CTO at Contentstack, noted, “AI today is largely centred around generative AI, which excels at content generation. Many marketing websites and mobile apps leverage AI tools built into our software to create dynamic content effortlessly.”.

One of Contentstack’s AI-driven innovations is Brand Kit, a tool designed to capture a brand’s unique voice and tone. When companies use generative AI within their CMS, they ensure that the content aligns with their identity, which makes messaging more authentic and impactful.

Another breakthrough product, Automate, is changing the game in content integration. Patel shared a compelling use case from Golfbreaks, a travel enterprise specialising in golf vacation packages.

Previously, compiling these packages was a manual, time-consuming task that took days. By integrating AI-powered LLMs like Llama, Gemini, and OpenAI into Automate, Golfbreaks reduced this process to mere hours. The system curates deals, structures them in a suitable format, uploads them to its website, and prepares them for approval – all in record time.

When it comes to the global adoption of headless CMS, North America is leading the charge. The US and Canada have been at the forefront because of their robust digital infrastructure, a thriving tech ecosystem, and a high concentration of businesses seeking agile content management solutions.

Sectors like e-commerce, media, and technology have embraced headless CMS to deliver seamless content experiences across multiple platforms. This early adoption has spurred continuous innovation, with both startups and established players driving the evolution of headless CMS solutions.

Speaking about competition, Patel mentioned, “We focus on mid-sized companies and large enterprises like Fortune 2000 companies and major brands around the world that are using Contentstack to deliver their content. This includes websites, mobile apps, gaming software, and more.”.

He mentioned that in the traditional content management space, there are incumbents like Adobe, Sitecore, and Optimizely, and they have not long ago started making investments in headless CMS.

The global headless CMS software market is on the rise. Valued at $[website] billion in 2023, it’s set to skyrocket to $[website] billion by 2032, growing at a CAGR of [website], .

The surge is driven by businesses prioritising SEO, performance optimisation, and omnichannel content distribution. Headless CMS is becoming the backbone of modern digital experiences. The integration of AI, automation, and cloud-based collaboration is reshaping content management, making it more dynamic and future-proof.

For developers, marketers, and business leaders, adopting a headless CMS is more than just an upgrade; it could be a transformative shift.

Invest Karnataka 2025 saw the signing of several significant memorandums of understanding (MoUs), reinforcing the state’s attractiveness to major inve......

Disponible il y a déjà quelques jours, la gamme Galaxy S25 se décline sous trois modèles : S25, S25+ et S25 Ultra. Pour les fans de la marque aux troi......

Market Impact Analysis

Market Growth Trend

2018201920202021202220232024
23.1%27.8%29.2%32.4%34.2%35.2%35.6%
23.1%27.8%29.2%32.4%34.2%35.2%35.6% 2018201920202021202220232024

Quarterly Growth Rate

Q1 2024 Q2 2024 Q3 2024 Q4 2024
32.5% 34.8% 36.2% 35.6%
32.5% Q1 34.8% Q2 36.2% Q3 35.6% Q4

Market Segments and Growth Drivers

Segment Market Share Growth Rate
Machine Learning29%38.4%
Computer Vision18%35.7%
Natural Language Processing24%41.5%
Robotics15%22.3%
Other AI Technologies14%31.8%
Machine Learning29.0%Computer Vision18.0%Natural Language Processing24.0%Robotics15.0%Other AI Technologies14.0%

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity:

Innovation Trigger Peak of Inflated Expectations Trough of Disillusionment Slope of Enlightenment Plateau of Productivity AI/ML Blockchain VR/AR Cloud Mobile

Competitive Landscape Analysis

Company Market Share
Google AI18.3%
Microsoft AI15.7%
IBM Watson11.2%
Amazon AI9.8%
OpenAI8.4%

Future Outlook and Predictions

The Build Decision Tree landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:

Year-by-Year Technology Evolution

Based on current trajectory and expert analyses, we can project the following development timeline:

2024Early adopters begin implementing specialized solutions with measurable results
2025Industry standards emerging to facilitate broader adoption and integration
2026Mainstream adoption begins as technical barriers are addressed
2027Integration with adjacent technologies creates new capabilities
2028Business models transform as capabilities mature
2029Technology becomes embedded in core infrastructure and processes
2030New paradigms emerge as the technology reaches full maturity

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:

Time / Development Stage Adoption / Maturity Innovation Early Adoption Growth Maturity Decline/Legacy Emerging Tech Current Focus Established Tech Mature Solutions (Interactive diagram available in full report)

Innovation Trigger

  • Generative AI for specialized domains
  • Blockchain for supply chain verification

Peak of Inflated Expectations

  • Digital twins for business processes
  • Quantum-resistant cryptography

Trough of Disillusionment

  • Consumer AR/VR applications
  • General-purpose blockchain

Slope of Enlightenment

  • AI-driven analytics
  • Edge computing

Plateau of Productivity

  • Cloud infrastructure
  • Mobile applications

Technology Evolution Timeline

1-2 Years
  • Improved generative models
  • specialized AI applications
3-5 Years
  • AI-human collaboration systems
  • multimodal AI platforms
5+ Years
  • General AI capabilities
  • AI-driven scientific breakthroughs

Expert Perspectives

Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:

"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."

— AI Researcher

"Organizations that develop effective AI governance frameworks will gain competitive advantage."

— Industry Analyst

"The AI talent gap remains a critical barrier to implementation for most enterprises."

— Chief AI Officer

Areas of Expert Consensus

  • Acceleration of Innovation: The pace of technological evolution will continue to increase
  • Practical Integration: Focus will shift from proof-of-concept to operational deployment
  • Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
  • Regulatory Influence: Regulatory frameworks will increasingly shape technology development

Short-Term Outlook (1-2 Years)

In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:

  • Improved generative models
  • specialized AI applications
  • enhanced AI ethics frameworks

These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.

Mid-Term Outlook (3-5 Years)

As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:

  • AI-human collaboration systems
  • multimodal AI platforms
  • democratized AI development

This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.

Long-Term Outlook (5+ Years)

Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:

  • General AI capabilities
  • AI-driven scientific breakthroughs
  • new computing paradigms

These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.

Key Risk Factors and Uncertainties

Several critical factors could significantly impact the trajectory of ai tech evolution:

Ethical concerns about AI decision-making
Data privacy regulations
Algorithm bias

Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.

Alternative Future Scenarios

The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:

Optimistic Scenario

Responsible AI driving innovation while minimizing societal disruption

Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.

Probability: 25-30%

Base Case Scenario

Incremental adoption with mixed societal impacts and ongoing ethical challenges

Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.

Probability: 50-60%

Conservative Scenario

Technical and ethical barriers creating significant implementation challenges

Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.

Probability: 15-20%

Scenario Comparison Matrix

FactorOptimisticBase CaseConservative
Implementation TimelineAcceleratedSteadyDelayed
Market AdoptionWidespreadSelectiveLimited
Technology EvolutionRapidProgressiveIncremental
Regulatory EnvironmentSupportiveBalancedRestrictive
Business ImpactTransformativeSignificantModest

Transformational Impact

Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.

The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.

Implementation Challenges

Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.

Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.

Key Innovations to Watch

Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.

Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.

Technical Glossary

Key technical terms and definitions to help understand the technologies discussed in this article.

Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.

Filter by difficulty:

generative AI intermediate

algorithm

platform intermediate

interface Platforms provide standardized environments that reduce development complexity and enable ecosystem growth through shared functionality and integration capabilities.

API beginner

platform APIs serve as the connective tissue in modern software architectures, enabling different applications and services to communicate and share data according to defined protocols and data formats.
API concept visualizationHow APIs enable communication between different software systems
Example: Cloud service providers like AWS, Google Cloud, and Azure offer extensive APIs that allow organizations to programmatically provision and manage infrastructure and services.

algorithm intermediate

encryption