Deprecated: __autoload() is deprecated, use spl_autoload_register() instead in /customers/d/c/f/ on line 502 Warning: Cannot modify header information - headers already sent by (output started at /customers/d/c/f/ in /customers/d/c/f/ on line 62 Deprecated: Function create_function() is deprecated in /customers/d/c/f/ on line 208 Using Python in KNIME – Dominik Schauer

Using Python in KNIME

In the beginning I found it can be quite confusing to use Python in KNIME. At least it was in my case, but fortunately I was able to figure out how to go about the integration. Here is how I did it.

The Extensions

First off, I am going to assume that you have a working distribution of Python and KNIME Analytics Platform installed on your machine. In case you can check off these two points, let’s continue. Currently (last update: November 2015) there are two public KNIME Extensions to help you integrating Pyhton in KNIME. These are:

  • KNIME Python Scripting Extension (KNIME Community Contribution – Other)
  • KNIME Python Integration (KNIME Labs Extension)

You can find both of them easily by typing in “python” after clicking on File… -> Install KNIME Extensions… . If you’d like to have a more detailed description read my post on how to install KNIME extensions. Both of them support Python 2.x (recommended Python 2.7) as said on the Extension’s GitHub Wiki. After installing the Extensions you should find them in your Node repository.

python plugins for KNIME

KNIME Python Scripting Extension

Personally I prefer using this extension over the other one for simple transformations due to its simpler interface. The only flaw with it is that it doesn’t use Pandas DataFrames out of the box like most users probably want it to do. Instead the incoming data is handled as an OrderedDictionary. Here is how you can circumvent that, just type the following into the Python Snippet Dialog:

import pandas as pd

# Load your workflow data into a dataframe
df = pd.DataFrame(data=kIn)

# Do some manipulations here

# Send your dataframe down the workflow
pyOut = df

And there you go. Now you can add do sorts of things you would normally do in the Python IDE of your choice, including using libraries. There are two mandatory variables here. kIn (read: K-in) is used for getting the data from the previous node, and pyOut is used for telling KNIME which data you’d like to transmit onto the next node.

In case you want to do a plot, you can also use the Python Plot node. It has the same flaw as above, but on the bright side it’s just as easy to fix as above as well. Here is an example I did for visualizing soccer player data (a Scatter Matrix using Seaborn):

import seaborn as sns
import pandas as pd

 data = kIn 
 df = pd.DataFrame.from_dict(data)

sns.pairplot(df, vars=["height", "weight", "age", "rating"], hue="position",  diag_kind="kde", diag_kws=dict(shade=True))

python scatter plot

KNIME Python Integration

Earlier I said that I prefer the other extension over this one for simple manipulations. Once it gets trickier though this one is clearly more useful to me. Its most prominent advantages are that

  • a) you don’t need to convert the data, this node just gives you a DataFrame right away
  • b) you have a console to work with interactively like you would do in a full-fledged Python IDE and
  • c) there are special nodes for training a model and using models for predictions.

So in the Python Script node the equivalents to kIn and pyOut in the Python Snippet are input_table and output_table. Here’s the most simple thing you can do without breaking the Python Script node.

# Copy input to output
output_table = input_table.copy()

The Python Script (2:1) node is functionally the same. Its small difference is that it takes two tables as input instead of just one.

# Do pandas inner join
output_table = input_table_1.join(input_table_2, how='inner', lsuffix=' (left)', rsuffix=' (right)')

The Python Source node is functionally the same as well but it takes no input at all.

from pandas import DataFrame
# Create empty table
output_table = DataFrame()

There also is an equivalent to the Python Plot node in form of the Python View node. I found that one a little unhandy though since it requires you to save a Figure object into a Buffer. For reference here is the same plot I did above but in the Python View node. You can clearly see that it is much more verbose:

from io import BytesIO
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Create a buffer to write into
buffer = BytesIO()

# Copy data and create plot
df = input_table
sns_plot = sns.pairplot(df, hue="positionText", diag_kind="kde", diag_kws=dict(shade=True), vars = ["height", "weight", "age", "rating"])

# Get the Figure type object from the plot
# and save the it into the buffer
sns_plot.fig.savefig(buffer, format='png')

# Set the buffer's content as output image
output_image = buffer.getvalue()

There are more nodes in the Python Integration Extension, namely the Python Object ReaderPython Object Writer as well as the Python Learner and Python Predictor. Especially the latter ones sound relevant to me. I will take a look at them in a follow-up post.

Leave a Reply

Your email address will not be published. Required fields are marked *