Skip to content

issue when reading a file with duplicates to be excluded #1

@npsonis

Description

@npsonis

Hi,
I used the -C option to compare two eigenstrat databases and found a few duplicates.
So, I executed the following to remove the dublicates from one of the databases:
eigenstrat_database_tools.py -g v54.1_1240K_public_Olalde2019.geno -s v54.1_1240K_public_Olalde2019.snp -i v54.1_1240K_public_Olalde2019.ind -o v54.1_1240K_public_Olalde2019_no_duplicates -L Olalde2019_duplicates.txt -R

and I met the following message

Traceback (most recent call last):
File "/home/psonis/software/EigenStratDatabaseTools/eigenstrat_database_tools.py", line 86, in
validate_eigenstrat(args.genoFn, args.snpFn, args.indFn)
File "/home/psonis/software/EigenStratDatabaseTools/eigenstrat_database_tools.py", line 21, in validate_eigenstrat
dimsGeno = [file_len(genof), file_width(genof)]
File "/home/psonis/software/EigenStratDatabaseTools/eigenstrat_database_tools.py", line 8, in file_len
for i, l in enumerate(f):
File "/home/psonis/miniconda3/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 90: invalid start byte

Any thoughs on how to resolve this?

Nikos

I just figure out that the geno files in Reich dataset are PACKEDANCESTRYMAP (binary) so your tool needs the non packed EIGENSTRAT (I converted it with convertf). I think that you should either inform the user that the files with geno extension could be not EIGENSTRAT or allow your tool to accept binary files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions