Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mime-support as upstream source for MIME types. #205

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sgpinkus
Copy link

@sgpinkus sgpinkus commented Jul 20, 2020

Adds mime-support mime.types file as 4th non custom upstream source.

File structure is identical to Apache as the upstream file is a drop in replacement for the Apache mime.types files (actually it's the other way around ..).

In this PR I've added a basic script to print out how many MIME from each source. I've also labelled custom types explicitly to help with tracing source in db.json:

Before:

total 2193
iana 1861
apache 250
nginx 13
mime-support 0
custom 0
undefined 0
other 69

After:

total 2381
iana 1861
apache 250
nginx 13
mime-support 192
custom 65
undefined 0
other 0

  • Copy right info: https://salsa.debian.org/debian/mime-support/-/blob/master/debian/copyright
  • DB Lineage: A major upstream source of these MIME types is Freedesktop.org's shared-mime-info. Beyond that it's unclear exactly but you can see the commit history here. I feel it is much the same as Apache and Nginx: The maintainers just add entries as they see fit when it's reasonable. Also note, it is this file that is the default mime.types file, and the file that will be used by Apache server (or probably any other kind of HTTP server) running on a Debian based distro, given Apache is installed from pkg manager.

Copy link
Contributor

@dougwilson dougwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for putting this together! I left a couple comments, but in addition that is not answered from your PR text:

How does mime-support project source and gather their type data? I see lots of new data in here, so would like to understand the history of the data coming in and how they vet new data they are adding over time.

If they are sourcing from "shared-mime-info" then why don't we just source from there instead of mime-support? What does the intermediate dependency add apart from complexity and indirection? In addition, we need to get in contact with the maintainers to answer the question of how they are adding new one instead of just making assumptions :)

README.md Outdated Show resolved Hide resolved
/**
* URL for the mime.types file in the Apache HTTPD project source.
*/
var URL = 'https://salsa.debian.org/debian/mime-support/-/raw/master/mime.types?inline=false'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to follow up here with the permission of the site operator that we can begin to start polling / scraping their endpoint with an automated process. I couldn't find any public TOS on a quick look, so if there is one that says it's OK, then that's fine and we don't need their explicit permission.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree required for merge. Will do.

scripts/stats.js Outdated Show resolved Hide resolved
@@ -1066,7 +1117,8 @@
"extensions": ["pgp"]
},
"application/pgp-keys": {
"source": "iana"
"source": "iana",
"extensions": ["key"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this come from? The official record for this type (https://tools.ietf.org/html/rfc3156#section-9.3) states the file extension is ".asc". It jumped out to me since ".key" is the Apple Keynote files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gnupg uses this for keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Can you link to the source of that information?

@@ -768,7 +809,7 @@
},
"application/mathematica": {
"source": "iana",
"extensions": ["ma","nb","mb"]
"extensions": ["ma","nb","mb","nbp"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this new extension come from? I see there is a registration from Wolfram Research (the owners of the spec) at https://www.iana.org/assignments/media-types/application/mathematica but this extension is not listed there.

},
"application/font-tdpfr": {
"source": "iana",
"extensions": ["pfr"]
},
"application/font-woff": {
"source": "iana",
"compressible": false
"compressible": false,
"extensions": ["woff"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping is obsolete; many of these font types (like .woff) moved under font/ tree (as font/woff in this case). More information can be found here: http://tools.ietf.org/rfc/rfc8081.txt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete yes. Is not being obsolete prerequisite for the extension inclusion? Many files are going to have that extension regardless of whether someone declares is obsolete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mime is what is obsolete, not the extension. The files would simply use the non-obsolete file extension. My comment is about mapping the extension here to the obsolete type instead of the current type.

@@ -502,20 +523,26 @@
"source": "iana"
},
"application/font-sfnt": {
"source": "iana"
"source": "iana",
"extensions": ["otf","ttf"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mapping is obsolete; many of these font types (moved under font/ tree. More information can be found here: http://tools.ietf.org/rfc/rfc8081.txt

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete yes. Is not being obsolete prerequisite for the extension inclusion? Many files are going to have that extension regardless of whether someone declares is obsolete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mime is what is obsolete, not the extension. The files would simply use the non-obsolete file extension. My comment is about mapping the extension here to the obsolete type instead of the current type.

@@ -2351,7 +2419,8 @@
"source": "iana"
},
"application/vnd.debian.binary-package": {
"source": "iana"
"source": "iana",
"extensions": ["deb","ddeb","udeb"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see deb and udeb listed in the type registration https://www.iana.org/assignments/media-types/application/vnd.debian.binary-package , but where does ddeb come from?

"compressible": false
"source": "mime-support",
"compressible": false,
"extensions": ["m3u8"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my understanding this extension is for the type "application/vnd.apple.mpegurl", right? Looking at the spec https://tools.ietf.org/html/rfc8216#section-4 the section says:

Each Playlist file MUST be identifiable either by the path component
of its URI or by HTTP Content-Type. In the first case, the path MUST
end with either .m3u8 or .m3u. In the second, the HTTP Content-Type
MUST be "application/vnd.apple.mpegurl" or "audio/mpegurl". Clients
SHOULD refuse to parse Playlists that are not so identified.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping is not reversible. They both have the same extension?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what your comment means, I'm sorry. Can you state it a different way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File with MIME type application/x-mpegurl and application/vnd.apple.mpegurl both can have ext m3u8.

Copy link
Contributor

@dougwilson dougwilson Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I get it. My question is can you cite where the MIME type "application/x-mpegurl" comes from? I don't see it in the specification anywhere. Without a source, one could always argue that "foo/bar" MIME type is also m3u8 :)

Copy link
Author

@sgpinkus sgpinkus Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. To answer all of your questions, which are legit, will have to get more info on how this DB was pulled together, from the maintainer. Don't expect you to merge with out that. Fact remains though this the Debian Linux /etc/mime.types file source, so in my mind that gives it as much legitimacy as the Apache or Nginx versions.

Copy link
Contributor

@dougwilson dougwilson Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that doesn't mean they do not have outdated or invalid entries. When we pulled in NGINX and Apache back in the day, we did this same process and fixed a lot of bad data in their files. That is what we'd need to do in this same case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK . That clears it up. Not sure if I have the time right now for all that!

@@ -7418,7 +8024,7 @@
},
"text/calendar": {
"source": "iana",
"extensions": ["ics","ifb"]
"extensions": ["ics","ifb","icz"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in the text/calendar spec https://tools.ietf.org/html/rfc5545#section-8.1 section about the type, it lists the ics and ifb file extensions, but no mention of the icz extension.

},
"font/ttf": {
"source": "iana",
"compressible": true,
"extensions": ["ttf"]
"extensions": ["ttf","otf"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add .otf to this? My understanding is that .otf = font/otf and .ttf = font/ttf (which is what the database was prior to adding this source).

@dougwilson
Copy link
Contributor

@sgpinkus I'm looking into what is coming in because there are several known places that use this module to provide file extension -> mime type mappings: The Node.js package https://www.npmjs.com/package/mime and GitHub Pages https://docs.github.com/en/enterprise/2.13/user/articles/mime-types-on-github-pages are two very notable ones. So we have to be very careful pulling in conflicting entries and understand how they are going to affect the behavior of these projects if there are conflicting mappings pulled in.

@sgpinkus
Copy link
Author

@sgpinkus I'm looking into what is coming in because there are several known places that use this module to provide file extension -> mime type mappings ...

OK. Understood. Still waiting on more info from maintainer. Will get back to you with any new info. If you can't merge that is totally fine.

@sgpinkus
Copy link
Author

sgpinkus commented Jul 21, 2020

@dougwilson I got a reply from the maintainer of mime-support, Charles Plessy. It seems many of the MIME types in mime-support DB were added over the years by hand. He indicated that pulling directly from IANA would have been preferable now that IANA has become more receptive to adding new MIME types.

I pointed out that you have a script that pulls from IANA already. And directed him to this repository, and also this PR.

Still it would be useful to get some of the types in mime-support added here. I guess, in retrospect, the proper process for doing this is actually, 1. IANA, 2. pre IANA the adhoc "custom" type registration process you have set up here.


For reference, here are the 192 MIME types in mime-support not in mime-db, in CSV, and JSON. Script to used generate is included.

@younggun23
Copy link

https://github.com/younggun23/mime-db/Add mime-support as upstream source for MIME types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants