Answer by Alexander Heath for Best way to process a click stream to create features in Pandas

DataFrame.agg() is your friend here. You're right in that the initial method implemented iterates over the entire dataset for EACH call. So what we can do is define all of the heavy lifting that we want to do at the beginning and let pandas handle all of the internal optimizations. Typically with these libraries, there's VERY rarely a time where you can code something that would beat just using the internal libraries.

What's nice about this method is that you only have to do this heavy calculation ONCE and then you can do all of the fine tuned feature creation on the filtered dataset since it's so much faster.

This reduces runtime by 65% percent, which is quite large. And also, next time you want to get a new statistic, you can just access the result of featurize2 and not have to run the computation again.

df = make_data()# include this to be able to calculate standard deviations correctlydf['price_sq'] = df['price'] ** 2.def featurize2(df):    grouped = df.groupby(['id', 'product', 'event'])    initial = grouped.agg({'price': ['count', 'max', 'min', 'mean', 'std', 'sum', 'size'], 'date': ['max', 'min'], 'price_sq': ['sum']}).reset_index()    return initialdef featurize3(initial):    # Features 5-8    features = initial.groupby('product').sum()['price']['count'].agg(['max', 'min', 'mean', 'std']).rename({'max': 'max_product_events','min': 'min_product_events','mean': 'mean_product_events','std': 'std_product_events'    })    searches = initial[initial['event'] == 'search']['price']    # Feature 1: Number of search events    features['number_of_search_events'] = searches['count'].sum()    tablets = initial[initial['product'] == 'tablet']['price']    tablets_sq = initial[initial['product'] == 'tablet']['price_sq']    # Feature 2: Number of tablets    features['number_of_tablets'] = tablets['count'].sum()    # Feature 9 total price for tablet products    features['tablet_price_sum'] = tablets['sum'].sum()    # Feature 10 max price for tablet products    features['tablet_price_max'] = tablets['max'].max()    # Feature 11 min price for tablet products    features['tablet_price_min'] = tablets['min'].min()    # Feature 12 mean price for tablet products    features['tablet_price_mean'] = (        tablets['mean'] * tablets['count']).sum() / tablets['count'].sum()    # Feature 13 std price for tablet products    features['tablet_price_std'] = np.sqrt(tablets_sq['sum'].sum(    ) / tablets['count'].sum() - features['tablet_price_mean'] ** 2.)    # Feature 3: Total time    features['total_time'] = (initial['date']['max'].max(    ) - initial['date']['min'].min()) / np.timedelta64(1, 'D')    # Feature 4: Total number of events    features['events'] = initial['price']['count'].sum()    return featuresdef new_featurize(df):    initial = featurize2(df)    final = featurize3(initial)    return finaloriginal = featurize(df)final = new_featurize(df)for x in final.index:    print("outputs for index {} are equal: {}".format(        x, np.isclose(final[x], original[x])))print("featurize(df): {}".format(timeit.timeit("featurize(df)","from __main__ import featurize, df", number=3)))print("featurize2(df): {}".format(timeit.timeit("featurize2(df)","from __main__ import featurize2, df", number=3)))print("new_featurize(df): {}".format(timeit.timeit("new_featurize(df)","from __main__ import new_featurize, df", number=3)))for x in final.index:    print("outputs for index {} are equal: {}".format(        x, np.isclose(final[x], original[x])))

Results

featurize(df): 76.0546050072featurize2(df): 26.5458261967new_featurize(df): 26.4640090466outputs for index max_product_events are equal: [ True]outputs for index min_product_events are equal: [ True]outputs for index mean_product_events are equal: [ True]outputs for index std_product_events are equal: [ True]outputs for index number_of_search_events are equal: [ True]outputs for index number_of_tablets are equal: [ True]outputs for index tablet_price_sum are equal: [ True]outputs for index tablet_price_max are equal: [ True]outputs for index tablet_price_min are equal: [ True]outputs for index tablet_price_mean are equal: [ True]outputs for index tablet_price_std are equal: [ True]outputs for index total_time are equal: [ True]outputs for index events are equal: [ True]

Latest Images

Trending Articles

Latest Images