It is incorrect to "normalize" // in HTTP URL paths

Posted by pabs3 4 hours ago

3424Original

Comments

Comment by echoangle 58 minutes ago

> Wait, are there any implementations that wrongly collapse double-slashes?

> nginx with merge_slashes

How can it be wrong if it is server-side? If the server wants to treat those paths equally, it can if it wants to.

It would only be wrong if a client does it and requests a different URL than the user entered, right?

Comment by leni536 23 minutes ago

It can't be. It's the same confusion as "email address normalization" being wrong (for example when gmail ignores dots when mapping an address to an inbox).

It matters where the normalization happens, and server-side behavior is out-of-scope of these identifier RFCs.

Comment by OoooooooO 23 minutes ago

Yeah I would say that falls under the origin defining both paths as equivalent.

> Therefore, collapsing // to / in HTTP URL path segments is not correct normalization. It produces a different, non-equivalent identifier unless the origin explicitly defines those two paths as equivalent.

Comment by MattJ100 2 hours ago

URL parsing/normalisation/escaping/unescaping is a minefield. There are many edge cases where every implementation does things differently. This is a perfect example.

It gets worse if you are mapping URLs to a filesystem (e.g. for serving files). Even though they look similar, URL paths have different capabilities and rules than filesystems, and different filesystems also vary. This is also an example of that (I don't think most filesystems support empty directory names).

Comment by dale_glass 1 hour ago

But maybe you should anyway.

Because maybe you use S3, which treats `foo/bar.txt` and `foo//bar.txt` as entirely separate things. Because to S3, directories don't exist and those are literally the exact names of the keys under which data is stored.

So you have script A concatenate "foo" + "/bar" and script B concatenate "foo/" + "/bar", and suddenly you have a weird problem.

I can't imagine a real use case where you'd think this is desirable.

Comment by Mordisquitos 16 minutes ago

> I can't imagine a real use case where you'd think this is desirable.

Not S3, but here's a literal real use case: the entry for the Iraqw word /ameeni (woman) in Wiktionary.

https://en.wiktionary.org/wiki//ameeni

If for whatever reason your S3 keys contained English words and their translations separated by a slash, you would have a real problem if one of your scripts were to concatenate woman, / and /ameeni as woman/ameeni instead of woman//ameeni in the English/Iraqw case.

Comment by secondcoming 1 hour ago

If a user of S3 knows that directories aren't real why would they expect directory-related normalisation to happen?

Comment by PunchyHamster 1 hour ago

We cut those and few others coz historically there were exploits relying on it

Nothing on web is "correct", deal with it

Comment by leni536 33 minutes ago

I don't think it's incorrect for distinct paths to point to the same resource.

Of course you shouldn't assume that in a client. If you are implementing against an API don't deviate regarding // and trailing / from the API documentation.

Comment by sfeng 1 hour ago

What I’ve learned in doing this type of normalization is whatever the specification says, you will always find some website that uses some insane url tweak to decide what content it should show.

Comment by renewiltord 1 hour ago

I’m going to keep doing it.

Comment by mjs01 2 hours ago

// is useful if the server needs to serve both static files in the filesystem, and embedded files like a webpage. // can be used for embedded files' URL because they will never conflict with filesystem paths.

Comment by PunchyHamster 1 hour ago

....just serve it from other paths

Comment by janmarsal 1 hour ago

i'm gonna do it anyway

Comment by leni536 1 hour ago

Wait until you try http:/example.com and http://////example.com in your browser.

Comment by stanac 54 minutes ago

In both cases I get https://example.com/ in FF.

Comment by WesolyKubeczek 2 hours ago

It is probably “incorrect”, but given the established actual usage over the decades, it’s most likely what you need to do nevertheless.

Not doing it is like punishing people for not using Oxford commas, or entering an hour long debate each time someone writes “would of” instead of “would have”. It grinds my gears too, but I have different hills to die on.

Comment by bazoom42 1 hour ago

If different clients does it differently, you have incompatibilies. This punishes everybody. Since normalizing // to / removes information which may be significant, the obviously correct choice is folllowing the spec.

Comment by PunchyHamster 1 hour ago

if it is significant, you coded your app wrong, plain and simple

Comment by jeroenhd 1 hour ago

Of course not. It's an explicit feature part of every specification.

Plenty of websites rewrite paths like /a/b/c/d into a backend service call like /?w=a&x=b&y=c&z=d. In that scheme, /a//c/d would rewrite to /?w=a&x=&y=c&z=d, something entirely distinct from /a/c/d working out to /?w=a&x=b&y=c

It's not the application's fault that the people attempting to configure web server URLs don't know how web server URLs work.

Comment by bazoom42 1 hour ago

Why?

Comment by Etheryte 2 hours ago

Not sure I agree. The correct thing is to not mess with the URL at all if you're unsure about what to be doing to it. Doing nothing is the easiest thing of them all, why not do that?

Comment by j16sdiz 2 hours ago

because the you need some consistency or normalisation before applying ACL or do routing?

Comment by jeroenhd 1 hour ago

URL normalization is defined and it doesn't include collapsing slashes.

Not that you can include custom normalization rules (like collapsing slashes, tolower()ing the entire path, removing the query part of the URL), but that's not part of the standard. If you're doing anything extra, the risk of breaking stuff is on you.

Comment by LeonTing8090 1 hour ago

[dead]