History And Achievements Of DAOs

One of the most incredible concepts successfully implemented with blockchain technology is DAO — decentralized autonomous organization. Decentralized autonomous organizations are organizations that…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Validate a String as HTML Using SQL

Is it possible to check if a string contains valid HTML using just SQL?

We would like to write a SQL query which when fed an input string is able to return TRUE or FALSE based on whether the provided string is valid HTML or not.

That said, let’s dive into the restricted problem of validating if a string is valid HTML as defined by us. What do we consider valid HTML for the purposes of this article?

We would like to be able to validate the following HTML documents.

[1] In the document below, the </head> tag is missing.

[2] In the document below, the </html> tag appears before the </head> tag.

First, we tokenize the input string and extract the open and close tags as they appear in the document string. We use the below to extract the HTML tags into a separate table with one row per HTML tag in the document.

This table is then again processed to assign the following to each row:

We then remove all the unpaired tags (i.e. tags that are not supposed to have an open and close pair). For this article, there are only 2 such tags, namely <br> and <br/>.

The first solution is seemingly correct but fails to correctly identify our 2nd invalid document above as invalid HTML.

This solution fails to invalidate the 2nd invalid HTML document because we fail to track the order in which the close tags appear relative to other tags in the document. While we correctly check if a close tag appears after its corresponding open tag, we don’t check if there are any OTHER unclosed open tag between this close tag and its corresponding open tag.

For example, in the example below, the <head> tag on line 2 is unclosed whereas the enclosing <html> tag (on line 1) has been closed by the </html> tag (on line 3).

Runtime complexity: The runtime complexity of this solution is O(n²), where n is the number of tags in the document. This is because we join each tag with every other tag before it in the query above. This is the dominant cost in the entire solution.

The “inside out solution” processes strings from the innermost matching pair of open and close HTML tags. This solution relies on the fact that there is at least 1 matching open/close pair of tags that appear next to each other (on adjacent rows) in a valid HTML document.

If we remove this matching pair, then we can find and eliminate the next matching pair, till no more tags remain (in a valid HTML document). We know that in a document with 2N tags, we will perform this matching and eliminate process at most N times to reach an empty list of tags. If after N rounds of matching and elimination, we still have some tags left, it indicates an invalid HTML document.

Example-1: For the input below,

This is what the recursive execution looks like and the animation below shows the order in which the pairs of open/close tags are matched and eliminated.

Second solution processing a valid HTML string (Image by author)

Example-2: For the input below,

This is what the recursive execution looks like and the animation below shows the order in which the pairs of open/close tags are matched and eliminated. Since this is invalid input, the processing stops when there’s no matching pair of adjacent tags left to process.

Second solution processing an invalid HTML string (Image by author)

Runtime complexity: The runtime complexity of this solution is O(n²), where n is the number of tags in the document. We analyze the cost on both a valid as well as an invalid input:

We saw a way to check if a string is valid HTML or not. This method utilizes recursive CTEs to iteratively reduce the problem set size at every step.

Recursive CTEs in SQL are powerful tools that can solve a variety of problems if used creatively. However, recursive CTEs aren’t very space efficient. Where traditional imperative programming languages allow you to perform in-place updates, recursive CTEs require you to copy the data at every step.

Add a comment

Related posts:

How Heavy is a Slate Pool Table

A slate pool table is a high-quality, professional-grade piece of equipment used for playing pool and other billiard games. The playing surface of a slate pool table is made from a thick piece of…

Announcing the 5D token

Art remains the purest form of human expression, allowing collectors and viewers to enjoy from afar and for artists to connect with thousands or millions across the world. Yet, this vision of sharing…

Why do I need a CertiKShield Membership?

The 24th of October saw the launch of the CeritK Chain Mainnet and with it CTK, the native fuel of the Chain. With a plethora of utilities already under its belt, including on-chain governance and…