Handling Non-BMP Unicode Characters in Data Loss Prevention 16.0.1

Non-BMP Unicode characters, such as emojis, cause detection issues in Symantec Data Loss Prevention 16.0.1. These characters must be removed or replaced.
If policies and data identifiers include non-BMP Unicode characters, the correctness of matcher results and incident snapshots is compromised.
Using the Update Readiness Tool
When you upgrade from DLP 16.0 to 16.0.1 using the Upgrade Readiness Tool (URT), policies and data identifiers (DIs) containing non-BMP characters are logged verbosely and update fails. You can use the Upgrade logs to identify which policies and data identifiers contain non-BMP characters.
You must remove the characters and then must rerun the URT. Consult the following topics to learn more about non-BMP characters and the update process. 
Detection Resiliency
During detection, Symantec DLP now handles non-BMP characters in several ways.
  • Non-BMP characters in the content that DLP scans are replaced by the Unicode Replacement Character OxFFFD before scanning.
  • Condition matches are regularly detected with a correct offset or span across all platforms.
  • For Conditions that allow partial string matching, you can match strings containing non-BMP Unicode points. For example, the regular expression "sensitive.*" matches "sensitive🙂file" and the highlight is shown as "sensitive��file".
Policy Authoring
The Enforce user interface restricts you from entering non-BMP Unicode characters into relevant fields that are used for message scanning for detection. Non-BMP characters are flagged when you try to Save. An error message identifies fields containing non-BMP Unicode characters, so that you can remove them.
Incident Snapshots
In Incident Snapshots, the extracted content for files containing non-BMP characters is replaced by the Unicode replacement characters �� (Unicode 0xFFFD).
Non-BMP Unicode Characters Explained
The Basic Multilingual Plane (BMP) includes characters and symbols that are used by most modern languages. Characters such as emojis are included in the Supplementary Multilingual Plane and are considered non-BMP.
Characters in the BMP are represented by a single 16-bit code. Non-BMP characters are represented by an ordered pair (called a Surrogate Pair in the Unicode vocabulary) of two 16-bit codes. Even though non-BMP characters are human readable as a single character, they are treated as two characters. This treatment may lead to unexpected problems when iterating the characters in a string. Symantec Data Loss Prevention handles non-BMP characters by flagging them for removal.
For a more information on Unicode characters, see Programming with Unicode.