merge — Merge datasets 11
If you specify the update option, however, then all missing values of overlapping variables in
matched observations are replaced with values from the using data. Because of this new behavior,
the merge codes change somewhat. Codes 1 and 2 keep their old meaning. Code 3 splits into codes
3, 4, and 5. Codes 3, 4, and 5 are filtered according to the following rules; the first applicable rule
is used.
5 corresponds to matched observations where at least one overlapping variable had conflicting
nonmissing values.
4 corresponds to matched observations where at least one missing value was updated, but there
were no conflicting nonmissing values.
3 means observations matched, and there were neither updated missing values nor conflicting
nonmissing values.
If you specify both the update and replace options, then the merge==5 cases are updated with
values from the using data.
Sort order
As we have mentioned, in the 1:1, 1:m, and m:1 match merges, the sort orders of the master and
using datasets do not affect the data in the merged dataset. This is not the case of m:m, which we
recommend you never use.
Sorting is used by merge internally for efficiency, so the merged result can be produced most
quickly when the master and using datasets are already sorted by the key variable(s) before merging.
You are not required to have the dataset sorted before using merge, however, because merge will
sort behind the scenes, if necessary. If the using dataset is not sorted, then a temporary copy is made
and sorted to ensure that the current sort order on disk is not affected.
All of this is to reassure you that 1) your datasets on disk will not be modified by merge and
2) despite the fact that our discussion has ignored sort issues, merge is, in fact, efficient behind the
scenes.
It hardly makes any difference in run times, but if you know that the master and using data are
already sorted by the key variable(s), then you can specify the sorted option. All that will be saved
is the time merge would spend discovering that fact for itself.
The merged result produced by merge orders the variables and observations in a special and
sometimes useful way. If you think of datasets as tables, then the columns for the new variables
appear to the right of what was the master. If the master data originally had k variables, then the new
variables will be the (k + 1)st, (k + 2)nd, and so on. The new observations are similarly ordered so
that they all appear at the end of what was the master. If the master originally had N observations,
then the new observations, if any, are the (N + 1)st, (N + 2)nd, and so on. Thus the original master
data can be found from the merged result by extracting the first k variables and first N observations.
If merge with the update option was specified, however, then be aware that the extracted master
may have some updated values.
If you care about the ordering of observations in the data after a merge, then you should sort the
data after the merge. You should sort it in such a way that it has a unique ordering; see Sorting with
ties in [D] sort. If, against this recommendation, you wish to have a reproducible ordering after a
merge, then read the next paragraph. But be forewarned; just because something is reproducible does
not mean it is useful. Again, see Sorting with ties.
The resulting dataset after any merge is unsorted. That is to say, if you type describe, the “Sorted
by” result will be empty. That is not to say that the data will not be ordered; a dataset always has
an order. After 1:1 merges, the ordering will always be in the original order of the master dataset,