[File] new feature of recognizing JSON (1.35) / -k not working?

Christos Zoulas christos at zoulas.com
Sat Apr 27 23:19:30 UTC 2019


I understand what you are trying to do and this is a valid request. There are three separate issues here:
1. You want to just identify text vs binary files
    There is no direct way to do this, file tries to print "text" in the description but not in the mime output
    when there is an application. Perhaps you can use --mime-encoding
2. New magic changes the output (JSON in this case). You can exclude the json identification
    with -e json. In fact perhaps you should exclude all the tests except "text" in your application.
3. the -k option is buggy. Please file a bug report to https://bugs.astron.com/ <https://bugs.astron.com/> with reproducers.

Perhaps we can add some code to improve things with --include flag to only include what specified,
by fixing -k to work and adding a separate option to print file's idea (if a file is contains text or is binary).

Best,

christos

> On Apr 26, 2019, at 5:16 PM, Yaroslav Halchenko <list-file at onerussian.com> wrote:
> 
> Dear Magic(al) File people,
> 
> A little background:  while using git-annex for managing repositories with both
> text files to be committed to git and data files to go under git-annex control,
> we rely on libmagic to provide mimetype to drive git-annex'es decision.  We say
> that all files of some  text/  mimetype should get committed directly into git
> and the rest managed by git-annex.  Worked really splendidly -- all
> scripts .py, .txt, .md, .json, .yaml etc files got automagically committed into
> git, while data/binary files of any kind (images, neuroimaging data etc) -- to
> annex.
> 
> Today, after some time of troubleshooting (was a bit of pain since we used
> bundled newer libmagic1 while system wide file was old one ;)) we
> realized that since libmagic/file 1.35 there is special discovery of .json
> files, and they are no longer text/* 
> 
> 	$> file --mime-type sample.json
> 	sample.json: application/json
> 
> 	$> file --version 
> 	file-5.35
> 
> which sure thing breaks our workflow, and I see  a potential problem with such
> discovery in general:
> 
> 	json is a subset of yaml.  So any json file an also be reported as YAML file.
> 	It is not reported ATM since there seems to be no application/yaml parser.
> 	But then what will happen when such parser appears?  Change again?
> 
> And I wondered how we could remedy the situation for our use case.  I looked at
> file --help options, but it seems that -k, --keep-going  which could
> potentially report us all the matches from more specific (application/json) to
> less specific (e.g. application/yaml and then text/plain).  But that
> option seems to not perform correctly in our .json case:
> 
> 	$> file --mime -kr 1.json 
> 	1.json: application/json
> 	- 
> 	- ; charset=utf-8
> 
> and similarly ugly on sample .log file laying around:
> 
> 	$> file --mime -kr tests.log
> 	tests.log: text/plain
> 	- ; charset=utf-8
> 
> 
> so it  seems in the json case it detects that there is some additional mime
> (not sure what it is for the .log, if any), but does not actually output it.
> Note that on older file/magic version there is no strange empty second
> entry for that .log file
> 
> 	$> file --mime -kr tests.log 
> 	tests.log: text/plain; charset=utf-8
> 
> 	$> file --version                  
> 	file-5.22
> 	magic file from /etc/magic:/usr/share/misc/magic
> 
> 
> Bug?  
> 
> More info -- for some files I found it reporting multiple types:
> 
> 	$> file -kr --mime-type dcmqrscp
> 	dcmqrscp: application/x-pie-executable
> 	- application/octet-stream
> 
> and for a sweep under /usr/bin:
> 
> 	hopa:/usr/bin
> 	$> file --mime-type -k * | grep '/.*/' | sed -e 's,.*: *,,g' | sort | uniq -c
> 		187 application/x-executable\012- application/octet-stream
> 	   2828 application/x-pie-executable\012- application/octet-stream
> 		  9 text/x-perl\012- text/html\012- 
> 		 35 text/x-perl\012- text/x-c\012- 
> 		  1 text/x-perl\012- text/x-java\012- 
> 		  1 text/x-perl\012- text/x-makefile\012- 
> 		 46 text/x-perl\012- text/x-perl\012- 
> 		  2 text/x-perl\012- text/x-tex\012- 
> 		  1 text/x-shellscript\012- application/octet-stream
> 
> 
> but for many it is that second blank one like in our case with .json:
> 
> 	hopa:/usr/bin
> 	$> file --mime-type -k * | sed -e 's,.*: *,,g' | grep '\\012- \\012-' | sort | uniq -c
> 		406 text/x-perl\012- \012- 
> 
> 
> Overall question is on how we could reliably (now and in the future and ideally
> in the past of file/magic;)) discover if any file could be considered
> some kind of a   text/  file?
> 
> Thank you in advance!
> -- 
> Yaroslav O. Halchenko
> Center for Open Neuroscience     http://centerforopenneuroscience.org
> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
> Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
> WWW:   http://www.linkedin.com/in/yarik        
> -- 
> File mailing list
> File at astron.com
> https://mailman.astron.com/mailman/listinfo/file

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.astron.com/pipermail/file/attachments/20190427/1e9cf2ee/attachment.html>


More information about the File mailing list