Detecting URL/links one after the other

As a developer working on a system that scans people’s messages for malicious links, I recently encountered a bug during testing that raised some concerns. The issue arises when a user sends a malicious URL in a specific format that bypasses our detection mechanism.

Typically, our system is able to detect malicious URLs, whether they are sent in markup, hidden, or just included in the message. However, we found that when a user double sends a URL in the following format: https://example.com/https://example.com, the message does not get flagged as expected.

Upon further investigation, I reviewed the current URL regex expression used in the system to extract links from the messages. Here is the snippet of the code:

def extract_links(text):
    # Define a regular expression pattern to match URLs
    url_pattern = r'(https?://\S+?)(?:\)|\s|$)'
    # Find all matches of URLs in the text
    matches = re.findall(url_pattern, text)

The purpose of this function is to return a list of all URLs found in the message. However, the regex pattern seems to be missing a check for cases where a URL is repeated within the same link.

This issue poses a significant concern as it allows malicious URLs to potentially evade detection and pose a security risk. I will need to revise the regex pattern to address this specific scenario and ensure that all URLs, including repeated ones, are accurately identified and flagged in the messages.

Stay tuned for updates on this bug fix and further improvements to our message scanning system. Your security is our top priority.

Comments

Leave a Reply Cancel reply