K-Nearest Neighbors (KNN) Implementation Tutorial with PXL
This tutorial demonstrates how to classify data using the K-Nearest Neighbors (KNN) algorithm with the PXL framework. We will cover the implementation of KNN on both CPU and parallelized environments using PXL’s Map and Parallel execution modes.
Prerequisites:
Before starting, it is recommended to review the following tutorials:
- SDK Workflow: Provides an overview of the SDK’s architecture and workflow.
- Get Started: Guides you through setting up your development environment.
- Hello Sort: Introduces basic concepts using a simple sorting example.
- Emulator on Docker: Guides you through setting up the emulator on Docker.
Overview
The KNN algorithm works by:
- Finding the K nearest neighbors to a target point.
- Taking a majority vote of their labels to classify the target.
Our implementation includes three main approaches:
- Host Implementation: Uses the CPU for computation.
- Map Implementation: Leverages PXL’s Map execution mode for parallel processing.
- Parallel Implementation: Distributes computation tasks across multiple parallel resources.
Important Notes
- Ensure that QEMU is started with the
--no-kvm
option when running the application in a QEMU environment to avoid issues.
cd emulator
./run.sh --no-kvm
Detailed Implementation
1. Build Your Compute Kernel
-
Write Distance Calculation Functions
The kernel code, implemented in
mu_knn.cpp
, calculates distances using two metrics:- Euclidean Distance
- Dot Product
#include <algorithm> #include <utility> #include <vector> #include "mu/mu.hpp" #include "mu/vector.hpp" //for vector engine #include "mu_knn_config.hpp" // Calculate Euclidean Distance void calculate_euclidean_distance(float* point0, float* point1, int pointSize, float* result) { float distance = 0; for (int i = 0; i < pointSize; i++) { distance += (point0[i] - point1[i]) * (point0[i] - point1[i]); } *result = distance; } // Calculate Dot Product void calculate_dot_product(float* point0, float* point1, int pointSize, float* result) { float distance = 0; for (int i = 0; i < pointSize; i++) { distance += point0[i] * point1[i]; } *result = distance; }
-
Implement KNN Functions
Note: Using user-defined types (e.g.,
struct
orclass
) as a parameter type formap
currently is not supported. However, it is supported forparallel
implementation.-
KNN with Map
void knn_with_map(float* data, int* label, float* target, int k, int dim, int numPoints, float* result, int* labelResult, bool useDotProduct) { std::vector<std::pair<float, int>> distances(numPoints); for (int i = 0; i < numPoints; i++) { float distance = 0; if (useDotProduct) { calculate_dot_product(data + i * dim, target, dim, &distance); } else { calculate_euclidean_distance(data + i * dim, target, dim, &distance); } distances[i] = std::make_pair(distance, i * dim); } std::sort(distances.begin(), distances.end()); for (int i = 0; i < k; i++) { for (int j = 0; j < dim; j++) { result[i * dim + j] = data[distances[i].second + j]; } labelResult[i] = label[distances[i].second / dim]; } }
-
KNN with Parallel
void knn_with_parallel(float* data, int* label, float* target, const knnConfig& config, int numTasks, float* result, int* labelResult) { auto taskIdx = mu::getTaskIdx(); int chunkSize = config.dataCount / numTasks; int start = taskIdx * chunkSize; int end = (taskIdx == numTasks - 1) ? config.dataCount : start + chunkSize; std::vector<std::pair<float, int>> distances(end - start); for (int i = start; i < end; i++) { float distance = 0; if (config.useDotProduct) { calculate_dot_product(data + i * config.dimension, target, config.dimension, &distance); } else { calculate_euclidean_distance(data + i * config.dimension, target, config.dimension, &distance); } distances[i - start] = std::make_pair(distance, i * config.dimension); } std::sort(distances.begin(), distances.end()); for (int i = 0; i < config.k; i++) { for (int j = 0; j < config.dimension; j++) { int dataIndex = distances[i].second + j; result[taskIdx * config.k * config.dimension + i * config.dimension + j] = data[dataIndex]; } labelResult[taskIdx * config.k + i] = label[distances[i].second / config.dimension]; } } MU_KERNEL_ADD(knn_with_map) MU_KERNEL_ADD(knn_with_parallel)
-
2. Create Your Host Application
-
Setup PXL Runtime Instances
You can initialize PXL runtime instances as shown below:
uint32_t deviceId = 0; auto ctx = pxl::runtime::createContext(deviceId); auto job = ctx->createJob(); auto module = pxl::createModule(filename); job->load(module);
Alternatively, use a custom
PxlResources
struct to manage the lifecycle of PXL resources. ThePxlResources
struct is implemented inpxl_helper.hpp
.struct PxlResources { pxl::runtime::Context* context; pxl::runtime::Job* job; pxl::runtime::Map* map; pxl::runtime::Parallel* parallel; ~PxlResources() { if (context) { if (job) { context->destroyJob(job); } pxl::runtime::destroyContext(context); } } };
int main(int argc, char* argv[]) { const uint32_t numTasks = 4; auto pxlResources = initializePxl(numTasks, MAP); // or PARALLEL ... }
-
Setup Data
-
Reading Data and configuring KNN parameters
First, read data from the input files using a helper function. helper function is implemented in
helper.hpp
std::string dataFilename = "knn_data/dim_5/data.txt"; std::string targetFilename = "knn_data/dim_5/targets.txt"; auto [data, label] = readData<float>(dataFilename); auto target = readTarget<float>(targetFilename); const uint32_t k = 3; const bool useDotProduct = false; const uint32_t dimension = data[0].size(); const uint32_t dataCount = data.size(); knnConfig config = {k, dimension, dataCount, useDotProduct};
-
Allocating Device Memory
You can allocate device memory as in a similar manner as demonstrated in the Hello Sort tutorial. In this tutorial, we use the helper function
allocatePxlMemory
along with a custom deleter implemented through thePxlMemoryDeleter
struct. This ensures that memory allocated by PXL is properly freed when it is no longer needed. TheallocatePxlMemory
function is a template that returns aPxlUniquePtr
, which is astd::unique_ptr
with a custom deleter. This deleter ensures that the allocated memory is freed using the PXL runtime’smemFree
function.template <typename T> struct PxlMemoryDeleter { pxl::runtime::Context* context; PxlMemoryDeleter(pxl::runtime::Context* ctx) : context(ctx) { if (!ctx) throw std::invalid_argument("Null context provided to PxlMemoryDeleter"); } void operator()(T* ptr) { if (ptr && context) context->memFree(ptr); } }; template <typename T> using PxlUniquePtr = std::unique_ptr<T, PxlMemoryDeleter<T>>; template <typename T> PxlUniquePtr<T> allocatePxlMemory(pxl::runtime::Context* context, size_t count) { T* ptr = reinterpret_cast<T*>(context->memAlloc(count * sizeof(T))); if (!ptr) { throw std::runtime_error("Failed to allocate PXL memory"); } return PxlUniquePtr<T>(ptr, PxlMemoryDeleter<T>(context)); }
allocate memory for data and label
int main(int argc, char* argv[]) { ... auto muData = allocatePxlMemory<float>(pxlResources->context, dataCount * dimension); auto dataTensor = pxl::runtime::Tensor<float>( muData.get(), {static_cast<size_t>(numTasks), (dataCount * dimension) / static_cast<size_t>(numTasks)} ); //in case of MAP auto muLabel = allocatePxlMemory<int>(pxlResources->context, dataCount); auto labelTensor = pxl::runtime::Tensor<int>( muLabel.get(), {static_cast<size_t>(numTasks), (dataCount) / static_cast<size_t>(numTasks)} ); //in case of MAP ...
-
Write data to device memory
for (size_t i = 0; i < dataCount; i++) { std::memcpy(muData.get() + i * dimension, data[i].data(), dimension * sizeof(float)); } std::memcpy(muLabel.get(), label.data(), dataCount * sizeof(int)); ... }
-
-
Run KNN
// Iterate over each target for (size_t i = 0; i < target.size(); i++) { std::vector<std::vector<float>> mapResult(numTasks * config.k, std::vector<float>(config.dimension)); std::vector<int> labelResult(numTasks * config.k); knnMap(pxlResources->context, pxlResources->map, dataTensor, labelTensor, target[i], config, numTasks, mapResult, labelResult); // Map knnParallel(pxlResources->context, pxlResources->parallel, data, label, target[i], config, numTasks, mapResult, labelResult); // Parallel // Validate and process results ... }
- Map The
knnMap
function is responsible for executing the KNN algorithm using the PXL framework’s Map mode. This includes:- Preparing the necessary tensors (target tensor, result tensor, and label tensor).
- Executing the Map operation via the
map->execute
function. - Converts the results into a format that can be returned to the host.
void knnMap(pxl::runtime::Context* context, pxl::runtime::Map* map, pxl::runtime::Tensor<float> dataTensor, pxl::runtime::Tensor<int> labelTensor, const std::vector<float>& target, const knnConfig& config, int numTasks, std::vector<std::vector<float>>& result, std::vector<int>& labelResult) { // Allocate and prepare target tensor, notice here we allocate target memory for each task because we need same target for each task auto muTarget = allocatePxlMemory<float>(context, numTasks * config.dimension); auto targetTensor = pxl::runtime::Tensor<float>(muTarget.get(), {static_cast<size_t>(numTasks), config.dimension}); for (int j = 0; j < numTasks; j++) { std::memcpy(muTarget.get() + j * config.dimension, target.data(), config.dimension * sizeof(float)); } // Allocate result tensor auto muResult = allocatePxlMemory<float>(context, numTasks * config.dimension * config.k); auto resultTensor = pxl::runtime::Tensor<float>(muResult.get(), {static_cast<size_t>(numTasks), config.dimension * static_cast<size_t>(config.k)}); // Allocate label result tensor auto muLabelResult = allocatePxlMemory<int>(context, numTasks * config.k); auto labelResultTensor = pxl::runtime::Tensor<int>(muLabelResult.get(), {static_cast<size_t>(numTasks), static_cast<size_t>(config.k)}); // Execute the map operation if (!map->execute(dataTensor, labelTensor, targetTensor, config.k, config.dimension, config.dataCount / numTasks, resultTensor, labelResultTensor, config.useDotProduct)) { throw std::runtime_error("Map Execute Failed"); } map->synchronize(); // Convert result to return format for (int i = 0; i < numTasks * config.k; i++) { for (int j = 0; j < config.dimension; j++) { result[i][j] = muResult.get()[i * config.dimension + j]; } labelResult[i] = muLabelResult.get()[i]; } }
Note:
map
automatically divides the data into chunks and distributes them to each task. As a result, the target tensor is divided into chunks. Ensure that thetarget
is correctly written to each chunk. -
Parallel The
knnParallel
function is responsible for executing the KNN algorithm using the PXL framework’s Parallel mode. This method divides the workload among multiple tasks to achieve efficient parallel computation.- Allocates and prepares the necessary memory for the target.
- Executes the Parallel operation via the
parallel->execute
function. - Converts the results into a format that can be returned to the host.
void knnParallel(pxl::runtime::Context* context, pxl::runtime::Parallel* parallel, float* data, int* label, const std::vector<float>& target, const knnConfig& config, int numTasks, std::vector<std::vector<float>>& result, std::vector<int>& labelResult) { // Allocate and prepare target tensor, ensuring the target memory is correctly written for processing auto muTarget = allocatePxlMemory<float>(context, config.dimension); std::memcpy(muTarget.get(), target.data(), config.dimension * sizeof(float)); // Allocate result memory auto muResult = allocatePxlMemory<float>(context, numTasks * config.dimension * config.k); // Allocate label result memory auto muLabelResult = allocatePxlMemory<int>(context, numTasks * config.k); // Execute the parallel operation if (!parallel->execute(data, label, muTarget.get(), config, numTasks, muResult.get(), muLabelResult.get())) { throw std::runtime_error("Parallel Execute Failed"); } parallel->synchronize(); // Convert result to return format for (int i = 0; i < numTasks * config.k; i++) { for (int j = 0; j < config.dimension; j++) { result[i][j] = muResult.get()[i * config.dimension + j]; } labelResult[i] = muLabelResult.get()[i]; } }
- Map The
-
Validate and Process Results
Once the KNN algorithm has been executed in Map mode, the results need to be:
- Validated against the CPU-based host results.
- Processed for errors.
- Displayed or logged as the final classified label.
The following code handles these tasks::
- Validation: Compare the
mapResult
andlabelResult
from Map mode to thehostTopKResult
andhostMajorityLabel
from the host. - Error Handling: Log discrepancies using
printErrorDetails
. - Output: Print the final classified label to the console.
// Get top k result from whole data auto [hostTopKResult, hostMajorityLabel] = getTopK(data, label, target[i], config.k, config.dimension, dataCount, config.useDotProduct); // Get top k from kernel result auto [mapTopKResult, mapMajorityLabel] = getTopK(mapResult, labelResult, target[i], config.k, config.dimension, numTasks * config.k, config.useDotProduct); if (mapTopKResult != hostTopKResult || mapMajorityLabel != hostMajorityLabel) { printErrorDetails<float>(directoryName, target[i], hostTopKResult, mapTopKResult, hostMajorityLabel, mapMajorityLabel, k); } else { std::cout << "Classified label : " << mapMajorityLabel[0] << std::endl; }
3. Build and Run
-
Build Build the application using the build.sh script.
./build.sh
-
Run executable and check results
./map_knn ./parallel_knn
Classified label : 5