DETECTING JAILBREAK ATTEMPTS ON GENERATIVE MODELS

Abstract: a computer-implemented method is provided that detects jailbreak attempts against generative models, which may involve a shift between benign and malicious content. the method includes determining a probability-based metric for each of a plurality of tokens in a target text using a language model, the probability-based metric being based on a probability at least one preceding token. the probability-based metrics are processed to identify a subset of the plurality of tokens having a change in the probability-based metric with respect to others of the plurality of the tokens not within the subset of the plurality of tokens. a jailbreak attempt in the target text is detected in response to identifying the change in the probability-based metric in the subset of the plurality of tokens.

Inventor(s): Roee OZ, Royi RONEN, Abedelkader ASI, Roy EISENSTADT, Alexander TSVETKOV

CPC Classification: G06F21/1014 ({to tokens})

Search for rejections for patent application number 20250181679