Intelligent selection of diverse examples from JSON datasets
Features • Quick Start • Installation • Usage • Advanced • How It Works
- 🔄 Smart Selection: Intelligently selects diverse examples from your dataset
- 📊 Completeness Aware: Optionally prioritizes more complete records
- 🎛️ Configurable Distance: Custom distance functions for your specific needs
- 💪 Efficient Processing: Smart sampling for large datasets
- 🎮 CLI Support: Easy command-line interface for quick analysis
- 📦 TypeScript Ready: Full TypeScript support with comprehensive types
# From npm
bunx pickasso data.json -n 5
# Using key path for nested data
bunx pickasso response.json -n 5 -k "data.items"
# Prioritize complete records
bunx pickasso users.json -n 10 -p
bunx pickasso <file> -n <number_of_examples>
bun install pickasso
bunx pickasso <input-file> [options]
Options:
-n, --num-examples <number>
: Number of examples to select (required)-s, --sample-size <number>
: Size of random sample to consider-p, --prioritize-complete
: Consider record completeness in selection-k, --key-path <path>
: Path to array in nested JSON (e.g., 'data.items')-o, --out-file <file>
: Output file (defaults to stdout)-w, --completeness-weight <number>
: Balance between diversity and completeness (0-1, only used with -p)- 0: Pure diversity-based selection
- 1: Pure completeness-based selection
- 0.3 (default): Balanced selection favoring diversity
import { selectDiverseExamples } from "pickasso";
const dataset = [
{ id: 1, name: "John", age: 25 },
{ id: 2, name: "Jane", age: 30 },
// ... more objects
];
const diverseExamples = selectDiverseExamples(dataset, {
numExamples: 5,
prioritizeComplete: true,
completenessWeight: 0.3,
});
Define how similarity is calculated between objects:
const customDistance = (a: any, b: any) => {
// Custom logic to calculate distance
// Returns a number between 0 and 1
return Math.abs(a.age - b.age) / 100;
};
const selected = selectDiverseExamples(dataset, {
numExamples: 5,
distanceFunction: customDistance,
});
Pickasso automatically handles large datasets efficiently:
// For large datasets, use sample size to control processing
const selected = selectDiverseExamples(largeDataset, {
numExamples: 10,
sampleSize: 1000, // Consider 1000 random items
});
When working with real-world data, you often want examples that are both diverse and well-populated. Pickasso lets you control this balance:
// Default behavior: Pure diversity-based selection
const diverse = selectDiverseExamples(dataset, {
numExamples: 5,
});
// Prioritize complete records while maintaining diversity
const balancedSelection = selectDiverseExamples(dataset, {
numExamples: 5,
prioritizeComplete: true, // Enable completeness consideration
completenessWeight: 0.3, // 30% completeness, 70% diversity
});
// Strongly favor complete records
const completeRecords = selectDiverseExamples(dataset, {
numExamples: 5,
prioritizeComplete: true,
completenessWeight: 0.8, // 80% completeness, 20% diversity
});
Both CLI and API support nested data structures:
// CLI
pickasso complex.json -n 5 -k "response.data.items"
// API
const data = {
response: {
data: {
items: [/* ... */]
}
}
};
const selected = selectDiverseExamples(data.response.data.items, {
numExamples: 5
});
Pickasso uses a multi-step algorithm to select diverse examples:
-
Initial Selection
- Randomly samples from the dataset if needed
- Optionally starts with the most complete item
-
Iterative Selection
- Calculates distances between candidates and selected items
- Maximizes minimum distance to ensure diversity
- Optionally weights completeness scores
-
Distance Calculation
- Flattens nested objects for comparison
- Normalizes numerical differences
- Handles missing values gracefully
- Node.js 14 or later
- TypeScript 4.5+ (for development)
Contributions are welcome! Check out our contribution guidelines for details.
Created by Hrishi Olickel • Support Pickasso by starring our GitHub repository