[File] new feature of recognizing JSON (1.35) / -k not working?

Yaroslav Halchenko list-file at onerussian.com
Fri Apr 26 21:16:57 UTC 2019


Dear Magic(al) File people,

A little background:  while using git-annex for managing repositories with both
text files to be committed to git and data files to go under git-annex control,
we rely on libmagic to provide mimetype to drive git-annex'es decision.  We say
that all files of some  text/  mimetype should get committed directly into git
and the rest managed by git-annex.  Worked really splendidly -- all
scripts .py, .txt, .md, .json, .yaml etc files got automagically committed into
git, while data/binary files of any kind (images, neuroimaging data etc) -- to
annex.

Today, after some time of troubleshooting (was a bit of pain since we used
bundled newer libmagic1 while system wide file was old one ;)) we
realized that since libmagic/file 1.35 there is special discovery of .json
files, and they are no longer text/* 

	$> file --mime-type sample.json
	sample.json: application/json

	$> file --version 
	file-5.35

which sure thing breaks our workflow, and I see  a potential problem with such
discovery in general:

	json is a subset of yaml.  So any json file an also be reported as YAML file.
	It is not reported ATM since there seems to be no application/yaml parser.
	But then what will happen when such parser appears?  Change again?

And I wondered how we could remedy the situation for our use case.  I looked at
file --help options, but it seems that -k, --keep-going  which could
potentially report us all the matches from more specific (application/json) to
less specific (e.g. application/yaml and then text/plain).  But that
option seems to not perform correctly in our .json case:

	$> file --mime -kr 1.json 
	1.json: application/json
	- 
	- ; charset=utf-8

and similarly ugly on sample .log file laying around:

	$> file --mime -kr tests.log
	tests.log: text/plain
	- ; charset=utf-8


so it  seems in the json case it detects that there is some additional mime
(not sure what it is for the .log, if any), but does not actually output it.
Note that on older file/magic version there is no strange empty second
entry for that .log file

	$> file --mime -kr tests.log 
	tests.log: text/plain; charset=utf-8

	$> file --version                  
	file-5.22
	magic file from /etc/magic:/usr/share/misc/magic


Bug?  

More info -- for some files I found it reporting multiple types:

	$> file -kr --mime-type dcmqrscp
	dcmqrscp: application/x-pie-executable
	- application/octet-stream

and for a sweep under /usr/bin:

	hopa:/usr/bin
	$> file --mime-type -k * | grep '/.*/' | sed -e 's,.*: *,,g' | sort | uniq -c
		187 application/x-executable\012- application/octet-stream
	   2828 application/x-pie-executable\012- application/octet-stream
		  9 text/x-perl\012- text/html\012- 
		 35 text/x-perl\012- text/x-c\012- 
		  1 text/x-perl\012- text/x-java\012- 
		  1 text/x-perl\012- text/x-makefile\012- 
		 46 text/x-perl\012- text/x-perl\012- 
		  2 text/x-perl\012- text/x-tex\012- 
		  1 text/x-shellscript\012- application/octet-stream


but for many it is that second blank one like in our case with .json:

	hopa:/usr/bin
	$> file --mime-type -k * | sed -e 's,.*: *,,g' | grep '\\012- \\012-' | sort | uniq -c
		406 text/x-perl\012- \012- 


Overall question is on how we could reliably (now and in the future and ideally
in the past of file/magic;)) discover if any file could be considered
some kind of a   text/  file?

Thank you in advance!
-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik        


More information about the File mailing list