Using Python in KNIME
In the beginning I found it can be quite confusing to use Python in KNIME. At least it was in my case, but fortunately I was able to figure out how to go about the integration. Here is how I did it.
First off, I am going to assume that you have a working distribution of Python and KNIME Analytics Platform installed on your machine. In case you can check off these two points, let’s continue. Currently (last update: November 2015) there are two public KNIME Extensions to help you integrating Pyhton in KNIME. These are:
- KNIME Python Scripting Extension (KNIME Community Contribution – Other)
- KNIME Python Integration (KNIME Labs Extension)
You can find both of them easily by typing in “python” after clicking on File… -> Install KNIME Extensions… . If you’d like to have a more detailed description read my post on how to install KNIME extensions. Both of them support Python 2.x (recommended Python 2.7) as said on the Extension’s GitHub Wiki. After installing the Extensions you should find them in your Node repository.
KNIME Python Scripting Extension
Personally I prefer using this extension over the other one for simple transformations due to its simpler interface. The only flaw with it is that it doesn’t use Pandas DataFrames out of the box like most users probably want it to do. Instead the incoming data is handled as an OrderedDictionary. Here is how you can circumvent that, just type the following into the Python Snippet Dialog:
import pandas as pd # Load your workflow data into a dataframe df = pd.DataFrame(data=kIn) # Do some manipulations here # Send your dataframe down the workflow pyOut = df
And there you go. Now you can add do sorts of things you would normally do in the Python IDE of your choice, including using libraries. There are two mandatory variables here. kIn (read: K-in) is used for getting the data from the previous node, and pyOut is used for telling KNIME which data you’d like to transmit onto the next node.
In case you want to do a plot, you can also use the Python Plot node. It has the same flaw as above, but on the bright side it’s just as easy to fix as above as well. Here is an example I did for visualizing soccer player data (a Scatter Matrix using Seaborn):
import seaborn as sns import pandas as pd sns.set() data = kIn df = pd.DataFrame.from_dict(data) sns.pairplot(df, vars=["height", "weight", "age", "rating"], hue="position", diag_kind="kde", diag_kws=dict(shade=True))
KNIME Python Integration
Earlier I said that I prefer the other extension over this one for simple manipulations. Once it gets trickier though this one is clearly more useful to me. Its most prominent advantages are that
- a) you don’t need to convert the data, this node just gives you a DataFrame right away
- b) you have a console to work with interactively like you would do in a full-fledged Python IDE and
- c) there are special nodes for training a model and using models for predictions.
So in the Python Script node the equivalents to kIn and pyOut in the Python Snippet are input_table and output_table. Here’s the most simple thing you can do without breaking the Python Script node.
# Copy input to output output_table = input_table.copy()
The Python Script (2:1) node is functionally the same. Its small difference is that it takes two tables as input instead of just one.
# Do pandas inner join output_table = input_table_1.join(input_table_2, how='inner', lsuffix=' (left)', rsuffix=' (right)')
The Python Source node is functionally the same as well but it takes no input at all.
from pandas import DataFrame # Create empty table output_table = DataFrame()
There also is an equivalent to the Python Plot node in form of the Python View node. I found that one a little unhandy though since it requires you to save a Figure object into a Buffer. For reference here is the same plot I did above but in the Python View node. You can clearly see that it is much more verbose:
from io import BytesIO import matplotlib import matplotlib.pyplot as plt import seaborn as sns # Create a buffer to write into buffer = BytesIO() # Copy data and create plot df = input_table sns.set() sns_plot = sns.pairplot(df, hue="positionText", diag_kind="kde", diag_kws=dict(shade=True), vars = ["height", "weight", "age", "rating"]) # Get the Figure type object from the plot # and save the it into the buffer sns_plot.fig.savefig(buffer, format='png') # Set the buffer's content as output image output_image = buffer.getvalue()
There are more nodes in the Python Integration Extension, namely the Python Object Reader, Python Object Writer as well as the Python Learner and Python Predictor. Especially the latter ones sound relevant to me. I will take a look at them in a follow-up post.