XSUM - Remove Duplicate Records using SORT

DFSORT Utility for Removing Duplicates using SUM FIELDS=NONE and XSUM

Eliminating duplicate records while preserving clean output and capturing the discarded duplicates can be efficiently achieved using the DFSORT or SyncSORT utility’s SUM FIELDS=NONE, XSUM feature. The SUM control statement is designed to aggregate numeric fields when records share a common sort key. However, by specifying FIELDS=NONE, numeric summation is bypassed, and only one unique instance of each key is retained. The XSUM parameter ensures that all discarded duplicates are written to the dataset defined by the SORTXSUM DD name.

SUM FIELDS=NONE
Eliminates all but one record for each unique sort key.
XSUM
Writes the removed duplicate records into the SORTXSUM DD dataset.

Internally, the SORT utility performs the following steps:

Sends all subsequent duplicate occurrences to SORTXSUM.
Sorts records based on the defined keys.
Keeps the first occurrence of each key in SORTOUT.
If the EQUALS option is in effect, the first record of summed records is kept.
If the NOEQUALS option is in effect, the retained record is unpredictable.
The ZDPRINT option enables printing of positive summed ZD values.
The NZDPRINT option prevents printing of positive summed ZD values.

The way DFSORT processes short SUM fields depends on whether the VLSHRT or NOVLSHRT option is used. A short field extends beyond the length of a variable-length record, making it important for accurate record handling during sorting or summing.

JCL Syntax

//STEP1    EXEC PGM=SORT
//SYSOUT   DD SYSOUT=*
//SYSPRINT DD SYSOUT=*
//SORTIN   DD DSN=your.input.dataset,DISP=SHR
//SORTOUT  DD DSN=your.output.dataset,
//            DISP=(NEW,CATLG,DELETE),
//            UNIT=SYSDA,SPACE=(CYL,(5,1),RLSE),
//            DCB=(RECFM=FB,LRECL=80)
//SORTXSUM DD DSN=your.xsum.dataset,
//            DISP=(NEW,CATLG,DELETE),
//            UNIT=SYSDA,SPACE=(CYL,(5,1),RLSE),
//            DCB=(RECFM=FB,LRECL=80)
//SYSIN    DD *
  SORT FIELDS=(,,CH,A)
  SUM FIELDS=NONE,XSUM
/*

SUM Statement Formats:

  SUM {FIELDS=NONE}
  SUM {FIELDS=(p1,l1,f1 {,p2,l2,f2} ... )}
  SUM {FIELDS=(p1,l1 {,p2,l2} ... ),FORMAT=f} {,XSUM}

When XSUM is used, the dropped records are written to the dataset defined in SORTXSUM.

//SORTXSUM DD DSN=XXXXXX.OUTPUT.SORTOUT,
//            DISP=OLD

SORTXSUM: Output file for a SORT or MERGE function. It contains the records eliminated during SUM processing.

SUM FIELDS=(5,5,ZD,12,6,PD,21,3,PD,35,7,ZD)
SUM FORMAT=ZD,FIELDS=(5,5,12,6,PD,21,3,PD,35,7)
SUM FIELDS=(5,5,ZD,12,6,21,3,35,7,ZD),FORMAT=PD

Summary Field Formats	Length	Description
BI	2, 4, or 8 bytes	Unsigned binary
FI	2, 4, or 8 bytes	Signed fixed-point
FL	4, 8, or 16 bytes	Signed hexadecimal floating-point
PD	1 to 16 bytes	Signed packed decimal
ZD	1 to 31 bytes	Signed zoned decimal

XSUM Examples in JCL DFSORT

In the below example, the SORTXSUM file will contain the duplicate records based on the key starting at position 5 for 4 bytes. The unique record will be written to the SORTOUT file.

//STEP01    EXEC PGM=SORT
//SYSOUT    DD SYSOUT=*
//SORTIN    DD DSN=INPUT.FILE1,DISP=SHR
//SORTOUT   DD DSN=INPUT.FILE1,
//          DISP=(NEW,CATLG,DELETE),UNIT=3390, 
//          SPACE=(CYL,(5,1)),DCB=(LRECL=22)
//SORTXSUM  DD DSN=INPUT.FILE1,
//          DISP=(NEW,CATLG,DELETE),UNIT=3390,
//          SPACE=(CYL,(5,1)),DCB=(LRECL=22)
//SYSIN     DD *
  SORT FIELDS=(5,4,CH,A)
  SUM FIELDS=NONE,XSUM
/*

In this modified example, SORTXSUM will capture duplicate records where the key starts at position 5 for 4 bytes and where the value at position 45 is 'XYZ'. The unique record with value 'XYZ' at position 45 will be written to SORTOUT.

//SYSIN     DD *
  SORT FIELDS=(5,4,CH,A)
  INCLUDE COND=(45,3,CH,EQ,C'XYZ')
  SUM FIELDS=NONE,XSUM
/*

Using Multiple Fields as Key:

//SYSIN     DD *
  SORT FIELDS=(1,2,CH,A,4,3,CH,A,8,3,CH,A)
  SUM FIELDS=NONE,XSUM
/*

This technique helps isolate duplicate records using composite keys spread across multiple fields and can be useful in reporting, deduplication, and validation tasks in mainframe environments.

With `SUM FIELDS=NONE` and `EQUALS` in effect, DFSORT eliminates “duplicate records” by writing the first record with each key to the SORTOUT data set and deleting subsequent records with each key. A competitive product offers an `XSUM` operand that allows the deleted duplicate records to be written to a SORTXSUM dataset. While DFSORT does not support the XSUM operand, DFSORT does provide the equivalent function and a lot more with the `SELECT` operator of ICETOOL. SELECT lets you put the records that are selected in the `TO` data set and the records that are not selected in the `DISCARD` data set. So an ICETOOL SELECT job to do the XSUM function might look like this:

//XSUM JOB ...
//DOIT EXEC PGM=ICETOOL
//TOOLMSG DD SYSOUT=*
//DFSMSG  DD SYSOUT=*
//IN      DD DSN=... input data set
//OUT     DD DSN=... first record with each key
//SORTXSUM DD DSN=... subsequent records with each key
//TOOLIN  DD * 
  SELECT FROM(IN) TO(OUT) ON(1,3,CH) FIRST DISCARD(SORTXSUM)
/*

This will put the first occurrence of each ON field (sort key) in the OUT data set and the rest of the records in the SORTXSUM data set.

If IN contained the following records:

J03 RECORD 1
M72 RECORD 1
M72 RECORD 2
J03 RECORD 2
A52 RECORD 1
M72 RECORD 3

OUT would contain the following records:
A52 RECORD 1
J03 RECORD 1
M72 RECORD 1

SORTXSUM would contain the following records:
J03 RECORD 2
M72 RECORD 2
M72 RECORD 3

SELECT also allows you to use multiple ON fields (that is, multiple keys), and DFSORT control statements (for example, INCLUDE or OMIT), such as in this SELECT statement:

//TOOLIN DD *
  SELECT FROM(IN) TO(OUT1) ON(1,3,CH) ON(25,3,PD) FIRST -
  DISCARD(XSUM1) USING(CTL1)
//CTL1CNTL DD *
  INCLUDE COND=(11,7,CH,EQ,C'PET-RAT')
/*

And SELECT can do much more than that. Besides FIRST, it also lets you use:

FIRST(n)
FIRSTDUP
FIRSTDUP(n)
LAST
LASTDUP
ALLDUPS
NODUPS
HIGHER(x)
LOWER(y)
EQUAL(v)

You can use TO(outdd) alone, DISCARD(savedd) alone, or both, to manage selected and non-selected records for various use cases.

* Put duplicates in DUPS and non-duplicates in NODUPS:

SELECT FROM(DATA) TO(DUPS) ON(5,8,CH) ALLDUPS DISCARD(NODUPS)

* Put records with 5 occurrences (of the key) in EQ5:

SELECT FROM(DATA) TO(EQ5) ON(5,8,CH) EQUAL(5)

* Put records with more than 3 occurrences (of the key) in GT3,

* and records with 3 or less occurrences in LE3:

SELECT FROM(DATA) TO(GT3) ON(5,8,CH) HIGHER(3) DISCARD(LE3)

* Put records with 9 or more occurrences in OUT2:

SELECT FROM(DATA) ON(5,8,CH) LOWER(9) DISCARD(OUT2)

* Put last of each set of duplicates in DUP1:

SELECT FROM(DATA) TO(DUP1) ON(5,8,CH) LASTDUP

Best Practices

Define Keys Precisely: Ensure your SORT FIELDS cover all bytes that constitute uniqueness.
Manage DISP & DCB: Let SORT supply DCB attributes when possible; override only if needed.
Validate Results: Always verify both SORTOUT and SORTXSUM for expected counts.
Performance: Use COUNT(1) with PRINT in a prior IDCAMS step if merely checking emptiness—not related to XSUM but useful in broader workflows.

Conclusion

Using SUM FIELDS=NONE, XSUM (SyncSORT) or ICETOOL SELECT…FIRST DISCARD (DFSORT) provides a straightforward, single-pass method to both remove duplicates from your main output and capture all dropped records separately. This dual-output approach simplifies auditing, downstream processing, and data quality checks in your batch JCL workflows.