Subscribe to Dr. Granville's Weekly Digest

Uniquely identify a human being with two questions

Here are two multiple-choice questions that could be used to uniquely characterize each human that will ever exist on Earth. Even twins will have different answers. It is expected no two human beings to have the same answers.

First question: Order the following types of food, from your favorite (#1) to the one you like least (#9). Possible choices: fruit, vegetable, dairy, carbohydrate, red meat, poultry, fish, seafood, dessert.

Second question: Order the following types of environment, from your favorite (#1) to the one you like least (#9). Possible choices: beach, mountain, desert, plain / rural, urban, small town, lake / river bank, hills, forest.

The number of potential answers (that is, the number of potential orderings) for each question is factorial 9. The total number of potential answers for both questions is square of factorial 9, that is 132 billion.

Of course some combinations are more likely to appear than others, some people will have a hard time ranking and would rather allow for ties, and if you've lived all your life in the same place eating the same food, you can't correctly answer these questions. Same if you are a little kid. But for most of us, this works and could even be used by companies such as match.com or advertisers. Also, this type of ID has the following advantages:

  • It is universal (it could even apply to dogs),
  • It is personal unlike arbitrary social security numbers,
  • You know what's in your ID (government IDs such as SSN might be hiding some encoded data about you, in your ID, for profiling purposes) 
  • It's easy to retrieve if lost (at least partially, which might be good enough) by answering the two questions
  • Unlike genome, this ID is (to a large extent) is independent from gender and race (or age)

It may change over time as tastes change, but I think this is OK, your ID follows your personality. You might want to add a third question (maybe about favorite colors or climates) to increase the discriminating power, but I think it is not necessary.

Potential Improvement

Another option is to have more questions with fewer choices. For instance, 8 questions each with 4 choices (rather than 2 questions, each with 6 choices) would allow for pretty much the same number of unique IDs (a bit above 100 billion) but would be less error-prone, as people are more likely to correctly remember how their rank 4 items (e.g. colors), rather than 6 items. If you allow for only 2 choices per question, then you would need to ask 37 questions to cover 100+ billion unique IDs.

Experimental design to choose good questions and good choices 

The possible choices (answers) should be determined using experimental design and testing, not the other way around. Let's say that your first question is about food, with two choices: fish versus dirt. You do a test, you realize everybody rank fish as #1.  The test tells you that this is not a good, there will be lots of people with same ID. You change you choices from fish/dirt to fish/meat. Now you see that the distribution is more uniform. You continue testing till you have something good enough.

You can even test choice stability: Ask a person to rank 9 choices today and in 7 days, retain the choices that

  1. are most stable over time and
  2. provide an even distribution (or as close as possible to uniform distribution)

Related articles

Views: 5055

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by David Jurist on December 30, 2013 at 1:19pm

 

Vincent,

Thank you for this intriguing thread. I have thought about it for the last several days, and feel it is valid.

However, permit me to add just one refining remark.

Despite your huge permutation (132 billion) of the number of possibilities relative to the human population of planet earth, the possibility still theoretically does exist that your paired selection strings could be duplicated as an "id." [On Mount Baldy (New Mexico, USA) lightning does strike more than once in the same spot.] (8-)

If the premises of the survey steadfastly remain as is, and replies are at the onset stochastic, then in order to to ensure uniqueness the "surveyor" must force a new reply to be mutually exclusive (compared to prior replies) and constrain the data universe by prohibiting duplications.

This implies that an intelligent surveyor is conducting the survey and collecting the resulting data, and not only being evaluated by an analyst after the fact on independently generated samples collected from an indeterminate source. If this is not the case, than an analyst could impose another degree of uniqueness by adding context to the duplications or clustered strings.

Thank you very much for your consideration. Best wishes.

Comment by Steve O on December 20, 2013 at 6:39am

Why should an ID be permanent?  Because I am permanent.   We have a unique ID because all these other things are changing or shared with other people (e.g., my address).  There is a need to associate the person that is me with all the things I have done in my life.  Much of this course will continue even after I die (my estate and the biological traits I have passed on to my children for example).  So, for a variety of reasons, we need a way to get back to or uniquely identify each of us. Like Vincent we all have a number of different ID's SSA, driver's license number, account numbers, cell phone number, etc.  But at some level they all resolve back to the person who is us.  Thus, there is some ID of some form that is unique to each of us.  There is always some way to separate any two people.

It does raise an interesting philosophical question; what if we changed this paradigm, what if our identification changed?  Could I make a lot of money, then change my ID so the new me doesn't have to pay the taxes?

Which takes us all the way back to the first post.  What are we talking about here? A unique ID for every person on the earth who ever lived and ever will live?  A way to prevent identity theft?  An easy to remember, computer readable label (ID)?  What is the goal?  What are the parameters?

If the goal is uniqueness, then by Occam's Razor an individually assigned number is it.  20 or 30 digits should do.  But this is not easy to remember.  It's also easy to steal, just pick a number between 1 and a gazillion and you've identified somebody.

I find the concept of someone stealing something from me as sharing that thing with me interesting. If someone steals my ID that doesn't make them me.  We don't share an ID, just a number. Think about it.  They didn't steal from me.  They stole from the bank or credit card company.  The bank is just trying to make me pay for their mis-identification mistake.

Comment by Davide Imperati on December 20, 2013 at 2:22am

Vince, your idea is really interesting, but the design is very sensible.
I see a number of issues e.g.
Depending on the question chosen the coding can give rise to clustering.
This problem can be reduced by choosing a set of answers that are less prone to clustering, but those questions are likely to
be intrinsically unrelated to subjective events or preferences, or strongly related to unique features of the individual
In the former case it renders the test/retest reliability very low e.g. if the questions are exotic the same individual is likely to give different answers in different repetitions of the test
In the latter case the result of the test has higher reproducibility, but it fundamentally encodes informations about the features of the individual.
Moreover the test is prone to reverse enginering, see for example the design specifications of personality or psychiatric test batteries like MMPI-2-RF
Those tests use a similar idea with as much as 338 questions for a totally different purposes and it cluster fine enough to be commonly used in forensic and medicine to cluster patient

How do you think those issues would influence the value of your proposal?

Comment by Vincent Granville on December 17, 2013 at 10:17am

@Ben: Exactly, this is a nice experimental design / statistical problem to find combinations that minimize cluster formation.

Comment by Steve O on December 17, 2013 at 7:35am

I'm not seeing how a unique ID keeps it from being stolen.  It doesn't matter if it's a 9 digit SSN, 37 SSN, or a unique question/answer set.  If someone hacks my bank, store, or credit card company, they have the information regardless of its form.

Comment by Ben Dutta on December 17, 2013 at 3:26am

If we ask just 2-3 questions, there might still be a change of underlying clusters forming. For example, if we use colour, shape and sound, we might still find that people who really, really like that black pentagram also like the sound of heavy metal!

Since we are trying to avoid clusters from forming and we have the liberty of experimental design, one option is to add another level. The 9 items could be randomly selected from a choice of 20 types. In that case even if people are still replying to 2 questions, the possible combination is (20C9)^2 instead of (9!)^2. To make this even more robust, it is possible to have an adaptive algorithm that avoids common clusters.


As in the above example, people who choose that black pentagram are not presented with the sound of heavy metal at all. (They could be asked to choose from a range of 9 soft toys... all different shades of pink. Ok, that might be pushing it).

The "avoidance algorithm" will need background research on human psychology though. OK, so it won't be random but a feedback loop might keep the of choices always being offered selected from "rare / sparse" combinations.

Comment by Vincent Granville on December 16, 2013 at 7:16pm

@Lynne: You wrote "dessert ranks before veggies MUCH more often than the reverse". If that's the case, the experimental design / testing described in my article would take care of it, and replace veggies and/or dessert by something else with better (more uniform) distribution. Maybe it would recommend ordering 9 colors, 9 shapes (square, circle etc.) or 9 types of sounds, rather than 9 food types.

As for uniqueness, I am still convinced that as long as collisions are extremely rare (1 per 1 million), it is OK. No system is perfect, innocent people get sent to jail by error, with a much higher rate than 1 in 1 million. It does not make the judicial system worthless. Should we abolish justice because of very rare false positives? It's all a question of how efficient the system is, measured in dollars. If you want no duplicate IDs, use 3 questions with 10 choices carefully selected. But I believe the system with 2 questions / 9 choices, while subject to rare duplicates, is better. And using just one question would create so many duplicates that it is not an option. Two questions / 10 choices per question seems to be the optimum.

Comment by Lynne Mysliwiec on December 16, 2013 at 6:51pm

I'm sorry, but the fact that a credit card number or SSN has been stolen is completely independent of its attributes (like issuing bank or uniqueness). An identity has an economic value, therefore it gets stolen -- it has nothing to do with whether the number itself is unique.  In other words, the identity theft argument makes no sense -- please try something else. 

I agree that Talbot is on exactly the right track -- especially since there is likely to be a great deal of commonality in the top four items (I would predict that dessert ranks before veggies MUCH more often than the reverse).

Are you looking for a battery of better security and identity validation questions to be asked in concert with presentation of said identity to allow for its use? If so, then why doesn't the blog post say so?

Comment by Vincent Granville on December 16, 2013 at 11:04am

@Mike: "because they're based on an assumption of total randomness in people's responses". No, choices are selected though careful experimental design, to guarantee that answers are as close as possible to randomness. It's not the other way around as you suggest (creating arbitrary choices, and hoping for the best). 

Also the ID collision problem is far smaller than the ID theft issue. Having 500 people in the world sharing the same ID is nothing compared to 5 million having their ID stolen or misused (e.g. someone is dead, his SSN still in use, for instance to participate in Chicago elections - I'm not making this up). In any data that you collect, any issue that is much smaller than the inherent noise (present in all data), is a non-issue.

Comment by Mike Tamada on December 16, 2013 at 10:56am

"Maybe SSN IDs are truly unique, but ask the people who got their ID or credit card number stolen, and you'll hear a different story about how "unique" these IDs are: both your ID thieve(s) and you share the same ID."

But that doesn't address Talbot's point.  Identity theft is a problem for all systems.  This food+location ID system has an ADDITIONAL problem:  the inherent non-uniqueness of the ID numbers that Talbot pointed out.  Plus I suspect that clustering of preferences will make the problem worse than what the theoretical calculations show, because they're based on an assumption of total randomness in people's responses.

© 2014   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service