← Back to blog

[Palantir Foundry] Titanic Survivor Prediction Project (1/2) — Data Preprocessing

Using Palantir Foundry's no-code Pipeline Builder, we upload the Kaggle Titanic dataset, handle missing values, and perform feature engineering to get the data ready for modeling.

PalantirFoundryTitanicDataNo-code

In this post, we’ll use Kaggle’s classic beginner dataset—Titanic: Machine Learning from Disaster—and walk through how to create a project in Palantir Foundry and preprocess the data using Pipeline Builder (no-code) up to the point where it’s ready for modeling.

Table of contents

1. Create and configure a project

1.1 Create a new project

First, create a workspace for this work.

  • Click New project to start creating a project.
New project
New project
  • On the template selection screen, choose Production project (recommended for collaboration and access control).
Production project template
Production project template
  • Set the project name to something clear, e.g. Titanic, and create the project.
Project name
Project name

2. Upload the data

Once the project is created, bring in the data you want to analyze. From the Kaggle Titanic competition page, download the following three files:

  • train.csv
  • test.csv
  • gender_submission.csv

Kaggle competition page: Titanic - Machine Learning from Disaster

In your Foundry project:

  • Click + New
New button
New button
  • Click Upload files and upload all three files.
Upload files
Upload files
  • When prompted for the data format, select Upload as individual structured datasets (recommended). This converts CSV (structured) files into Foundry datasets that are immediately usable.
Upload as structured datasets
Upload as structured datasets

3. Preprocess data with Pipeline Builder

Now it’s time to transform the data. We’ll use Pipeline Builder, which lets you build logic without writing code.

  • Click New
New pipeline
New pipeline
  • Select Pipeline Builder.
Pipeline Builder
Pipeline Builder
  • Keep the defaults (Batch pipeline, Standard mode) and click Create pipeline.
Create pipeline
Create pipeline

Next, add your input dataset:

  • Click Add Foundry data
Add Foundry data
Add Foundry data
  • Select the uploaded train dataset.
Select train dataset
Select train dataset
  • Click Add data
Add data
Add data

4. Handle missing values (Age)

If you inspect the data, you’ll notice missing (Null) values in the Age column. Instead of dropping those rows, we’ll fill missing Age values using the overall mean age.

Missing Age
Missing Age

4.1 Compute the mean age

  • From the train node, choose Transform.
Transform
Transform
  • Click Aggregate.
Aggregate
Aggregate
  • Set Aggregations: ‘Mean’, Expression: Age, Output: Mean_Age and click Apply.
Apply aggregate
Apply aggregate
  • Confirm the output, then Close.
Aggregate result
Aggregate result

4.2 Join the mean back to the original rows

Now we need to attach the computed mean age (≈ 29.7) to each row in the original dataset.

  • Select the train node, then click Join.
Join
Join
  • Click Transform pathStart (Left: train, Right: Transform path)
Transform path start
Transform path start
  • Set Join type to Cross join, then click Apply and Close.
  • This appends the same Mean_Age value to every row.
Cross join
Cross join

You should see Mean_Age added at the far right of the table:

Mean_Age column
Mean_Age column

4.3 Fill Null Age with Mean_Age

We’ll create logic that says: If Age is null → use Mean_Age, Else → keep the original Age

  • From the Join node, click Transform.
Transform after join
Transform after join
  • Choose Case.
Case transform
Case transform
  • Condition: Is null, Expression: Age
Case apply
Case apply
  • true(next to ‘is equal to’), Then: Mean_Age, Else: Age, Click Apply.
Case result
Case result

Apply this so the Age column is overwritten with the filled value.

4.4 Drop the temporary Mean_Age column

After filling, Mean_Age is no longer needed. To keep the dataset clean:

  • Use Apply Multiple expressions to exclude Mean_Age and keep the remaining columns.
Apply multiple expressions
Apply multiple expressions
  • Click Add item, select everything except Mean_Age, uncheck Keep remaining columns, then click Apply.
Exclude Mean_Age
Exclude Mean_Age

5. Handle missing values (Embarked)

When you check the distribution of Embarked, you’ll typically find that S (Southampton) is the most frequent value. We’ll fill missing Embarked values with the mode: S.

Embarked distribution
Embarked distribution
  • Choose Case.
Embarked case
Embarked case
  • Condition: Embarked Is null
  • Then: "S" (a literal string)
  • Else: Embarked
  • Click Apply.
Embarked apply
Embarked apply

6. Feature engineering

To improve downstream model performance, let’s create a few additional columns from existing data.

6.1 Family size

  • SibSp: number of siblings/spouses aboard the Titanic
  • Parch: number of parents/children aboard the Titanic

we can estimate how many family members were traveling together. We’ll also add 1 to include the passenger themself.

  • Click Add numbers.
Add numbers
Add numbers
  • Expressions: SibSp, Parch, 1
  • Output: FamilySize
  • Click Apply.
FamilySize result
FamilySize result

6.2 Extract title from name

We can extract the honorific (e.g., Mr, Mrs, Miss) from the Name field using a regex.

  • Click Regex extract.
Regex extract
Regex extract
  • Expression: Name
  • Pattern: ([A-Za-z]+)\.
  • Group: 1
  • Output: Title
  • Click Apply.
Regex Apply
Regex Apply

6.3 Encode categorical values (Sex)

Machine learning models typically work better with numeric features than raw strings. Let’s convert Sex (male, female) into a numeric column.

  • Choose Case.
Sex case
Sex case
  • If Sex == "male"1
  • If Sex == "female"0
  • Else → Null
  • Set Output to Sex_Encoded, then click Apply.
Sex encoded
Sex encoded

7. Write out the cleaned dataset

Once preprocessing is complete, save the final dataset for modeling.

  • Click Add output.
Add output
Add output
  • Click New dataset.
New dataset
New dataset
  • Set the dataset name to titanic_cleaned_train.
Dataset name
Dataset name
  • Click the green upward arrow (save all changes).
Save changes
Save changes
  • Click DeployDeploy pipeline.
Deploy menu
Deploy menu
Deploy pipeline
Deploy pipeline

After a short wait, the pipeline deployment should complete successfully (Successfully deployed pipeline), and you’ll have a clean, processed dataset ready for training.

Deploy success
Deploy success

Wrap-up

Today we used Pipeline Builder to preprocess the Titanic dataset without coding: filling missing values, creating derived features, and encoding categorical data. In the next post, we’ll take the resulting titanic_cleaned_train dataset and move on to training a machine learning model and visualizing survival predictions (Workshop).