DataFrame.agg() is your friend here. You're right in that the initial method implemented iterates over the entire dataset for EACH call. So what we can do is define all of the heavy lifting that we want to do at the beginning and let pandas handle all of the internal optimizations. Typically with these libraries, there's VERY rarely a time where you can code something that would beat just using the internal libraries.
What's nice about this method is that you only have to do this heavy calculation ONCE and then you can do all of the fine tuned feature creation on the filtered dataset since it's so much faster.
This reduces runtime by 65% percent, which is quite large. And also, next time you want to get a new statistic, you can just access the result of featurize2 and not have to run the computation again.
df = make_data()# include this to be able to calculate standard deviations correctlydf['price_sq'] = df['price'] ** 2.def featurize2(df): grouped = df.groupby(['id', 'product', 'event']) initial = grouped.agg({'price': ['count', 'max', 'min', 'mean', 'std', 'sum', 'size'], 'date': ['max', 'min'], 'price_sq': ['sum']}).reset_index() return initialdef featurize3(initial): # Features 5-8 features = initial.groupby('product').sum()['price']['count'].agg(['max', 'min', 'mean', 'std']).rename({'max': 'max_product_events','min': 'min_product_events','mean': 'mean_product_events','std': 'std_product_events' }) searches = initial[initial['event'] == 'search']['price'] # Feature 1: Number of search events features['number_of_search_events'] = searches['count'].sum() tablets = initial[initial['product'] == 'tablet']['price'] tablets_sq = initial[initial['product'] == 'tablet']['price_sq'] # Feature 2: Number of tablets features['number_of_tablets'] = tablets['count'].sum() # Feature 9 total price for tablet products features['tablet_price_sum'] = tablets['sum'].sum() # Feature 10 max price for tablet products features['tablet_price_max'] = tablets['max'].max() # Feature 11 min price for tablet products features['tablet_price_min'] = tablets['min'].min() # Feature 12 mean price for tablet products features['tablet_price_mean'] = ( tablets['mean'] * tablets['count']).sum() / tablets['count'].sum() # Feature 13 std price for tablet products features['tablet_price_std'] = np.sqrt(tablets_sq['sum'].sum( ) / tablets['count'].sum() - features['tablet_price_mean'] ** 2.) # Feature 3: Total time features['total_time'] = (initial['date']['max'].max( ) - initial['date']['min'].min()) / np.timedelta64(1, 'D') # Feature 4: Total number of events features['events'] = initial['price']['count'].sum() return featuresdef new_featurize(df): initial = featurize2(df) final = featurize3(initial) return finaloriginal = featurize(df)final = new_featurize(df)for x in final.index: print("outputs for index {} are equal: {}".format( x, np.isclose(final[x], original[x])))print("featurize(df): {}".format(timeit.timeit("featurize(df)","from __main__ import featurize, df", number=3)))print("featurize2(df): {}".format(timeit.timeit("featurize2(df)","from __main__ import featurize2, df", number=3)))print("new_featurize(df): {}".format(timeit.timeit("new_featurize(df)","from __main__ import new_featurize, df", number=3)))for x in final.index: print("outputs for index {} are equal: {}".format( x, np.isclose(final[x], original[x])))
Results
featurize(df): 76.0546050072featurize2(df): 26.5458261967new_featurize(df): 26.4640090466outputs for index max_product_events are equal: [ True]outputs for index min_product_events are equal: [ True]outputs for index mean_product_events are equal: [ True]outputs for index std_product_events are equal: [ True]outputs for index number_of_search_events are equal: [ True]outputs for index number_of_tablets are equal: [ True]outputs for index tablet_price_sum are equal: [ True]outputs for index tablet_price_max are equal: [ True]outputs for index tablet_price_min are equal: [ True]outputs for index tablet_price_mean are equal: [ True]outputs for index tablet_price_std are equal: [ True]outputs for index total_time are equal: [ True]outputs for index events are equal: [ True]