# Nub Sieve

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
 ≠

Nub Sieve (), Nubsieve, or Unique Mask, is a monadic function that returns a Boolean vector indicating which major cells of its argument would be included in the result of Unique (Nub). More precisely, it indicates all cells which do not match any earlier indicated cell. The Nub Sieve is more informative than Unique because it encodes not only which cells are unique but where they appear in the argument. This means it can be used for tasks such as filtering secondary arrays and working with duplicates that are difficult to do with Unique and slow to compute with Key.

## Examples

Nub Sieve returns a 1 for each major cell in the argument that is unique, and a 0 if it's a duplicate:

'Hello, World'
1 1 1 0 1 1 1 1 0 1 0 1

Using Key we can see which elements were labelled unique and duplicate:

( {⍺⍵} ) 'Hello, World!'
┌─┬──────────┐
1Helo, Wrd!
├─┼──────────┤
0lol
└─┴──────────┘
Works in: Dyalog APL

Nub Sieve can be used to filter a matrix by some of its columns, or find its unique rows.

a  (6) ∘.! 3+⍳3
4  5  6
6 10 15
4 10 20
1  5 15
0  1  6
0  0  1
(11a)  a           ⍝ Keep only one row for each first column value
4  5  6
6 10 15
1  5 15
0  1  6
a                     ⍝ Which entire rows are unique?
1 1 1 1 1 1
a                  ⍝ All of them.
1

## Description

While the Nub Sieve satisfies (A) (A)A, this criterion is not a full definition because it doesn't specify which cell should be selected from among duplicates. The Nub Sieve should indicate cells by following the same procedure as Unique, that is, by considering all major cells in order and indicating those which do not match any cell already indicated.

Formally, let x be an array of rank at least 1. The Nub Sieve m of x is the array which satisfies the following two axioms (with ⎕IO0) for all i in ⍳≢x.

notin  (⍳=≢)               ⍝ Extension of (~∊) to rank >1
(m)  ,≢x                     ⍝ Shape of m
m[i]  (ix) notin m⌿⍥(i)x   ⍝ Items of m

The search in notin depends on ⎕CT and so the Nub Sieve depends on comparison tolerance. The two comparisons are taken to be independent of ⎕CT; m is exactly Boolean.

When the argument is a scalar, Unique returns its ravel, an instance of scalar rank extension. Correspondingly, Nub Sieve of a scalar is defined to be the singleton vector ,1.

### APL models

An implementation based on the above definition follows:

⎕IO0
notin  (⍳=≢)
x  1/                          ⍝ Treat scalar as vector
m  {m,(x) notin mx}¨ ⍳≢x
}

The Unique Mask can also be derived from the self-classify or index-in-nub function (∪⍳1/). The following two implementations arise from the fact that 1s in the Unique Mask appear exactly when a new value appears in the self-classify.

(≢=⍳)(∪⍳1/)      ⍝ Pre-correction of "bad meme" (⍳≢⍵)=⍵⍳⍵
(2/¯1,⌈\)(∪⍳1/)    ⍝ Optimised self-classify → Nub Sieve conversion

## Use cases

One use of Nub Sieve is to remove duplicates from one array and also corresponding arrays, which may contain attributes of cells in that array. The statement a b⌿⍨⊂≠a filters a and corresponding b in this way. A minor variation is to filter an array based on only one attribute of its major cells: a⌿⍨31a gets the rows of matrix a which introduce new values in column 3. Code like this might be used to compute, for example, the first Nobel prize winner from each country given a list of winners sorted by date. While Key can be used for such computations (for instance {}⌸⍨ in place of ⌿⍨∘), the Nub Sieve is slightly simpler and considerably faster in most implementations: Key recomputes uniqueness information for its left argument each time.

The combination with Indices ⍸≠A finds the indices of unique elements of an array, which might be useful to iterate through these elements without extracting them into a separate array (it could even be used to modify them in their original positions). Again, this combination can be computed with Key, using {}A, but the Nub Sieve implementation is more obvious and readable ("Where is A unique?") and much faster (while the particular Key operand {} could be sped up, the function can be written in many ways, and recognising all of them is difficult).

Nub Sieve can be used to find all elements of an array that are duplicates, by inverting the result. This is useful in order to see or manipulate what is being discarded by Unique. A program might print duplicate values before discarding them:

mvals
In A Programming Language, the forward set selector ${\displaystyle \sigma /\mathbf {v} }$ is a special notation indicating the nub sieve of ${\displaystyle \mathbf {v} }$, so that unique elements are found with a Compression ${\displaystyle (\sigma /\mathbf {v} )/\mathbf {v} }$.