Skip to main content
All CollectionsPreparing and Loading Data
The Complete KnowledgeHound Data Quality Guide
The Complete KnowledgeHound Data Quality Guide

All things related to data, limits, performance, tips and tricks.

Adam Swing avatar
Written by Adam Swing
Updated over a year ago

Loading survey data into KnowledgeHound for analysis can be done in 3 ways:

  1. Data Loading for Managers ("self-loading"), which allows managers to upload .sav files for KnowledgeHound to automatically prepare and present in minutes

  2. Sending data to KnowledgeHound's Data Processing Team for professional data cleaning

  3. Via an integration with your data supplier

This article will cover data that is loaded in one of the first two methods.

To watch training videos related to adding research to KnowledgeHound, visit this article.

Acceptable Data File Types

We guarantee compatibility with SPSS files. If a manager chooses to self-load a data file, the following file formats are required for our system to begin automatically cleaning the data:

  1. .sav

  2. .sav.zip

Not all organizations have data in this format or the tools to produce it efficiently. KnowledgeHound's professional data cleaning service can transform files for you. Currently, we can accept:

  1. Respondent-level CSV files

  2. JMP

  3. Other file formats (on an experimental basis)

We strive for compatibility with a range of files and tools, but cannot guarantee compatibility with each of the hundreds of combinations of tools and file formats.  Some tools don’t gracefully export to competing file formats and some export processes truncate or mangle important data elements.

We would be happy to evaluate your files and, if we can’t import them into KnowledgeHound, recommend a process you can use to create compatible files or discuss additional data curation services.

What is an SAV file?

SPSS is a commonly used data file format in the market research industry. They contain the response of every respondent to every question (if they answered, of course). They also contain some of the metadata about that study -- What was the prompt of each question? What were the possible responses to categorical questions?

Why SAV Files, not Tables?

Because respondent-level data gives your users the most flexibility in determining their own cuts, and because highly structured files like SPSS’s SAV are more reliable, we prefer respondent-level SPSS files.

Datafile and Dataset Limitations

We currently support data files up to 1GB. If your data files are larger than 1GB, you can break them into several smaller files or exclude unnecessary variables. If your contract includes white glove data processing services, data files greater than 1GB can be sent to our team and we will work to load them.

We cannot promise that every data file larger than 1GB will be suitable for the KnowledgeHound platform. Our data processing team will consult with you on the best path forward in each case.

It is important to note that the actual filesize of the data file is not the only limitation to ensuring good end-user performance analyzing survey data with KnowledgeHound. The structure of the dataset is more important to ensuring a quality experience analyzing data.

Acceptable performance criteria and limitations

In-app performance when analyzing and interacting with your data is foundational to ensuring you're getting the most out of KnowledgeHound. When it comes to optimizing performance the actual size of the data file is not the most important. The more important pieces to consider are:

  1. The number of respondents (rows)

  2. The number of variables (columns)

  3. The number of response options for a specific variable

  4. The number of open-ended variables

  5. Whether or not you conduct statistical testing

Dataset Limits to optimize performance

Below we outline the current limits that will ensure optimized performance. Beyond these limits, you should expect load times to grow, or, in the worst cases, the data will not load.

  1. Up to 400,000 respondents (rows)

  2. Up to 20,000 variables

  3. Up to 300 response options for a single variable

  4. Up to 30 breakout options on a matrix/grid question

  5. 100,000 statistical comparisons for a given analysis (currently we do not support stat testing across means on matrix/grid questions.)

Tips to Improve Performance

Excluding unnecessary variables

You don’t need to upload data your users won’t see.  For example, you might have collected “system variables” that are not of interest to analysts.  Since this data will be unavailable to users, you can eliminate it from your files before uploading to shrink the file size.  To do so, simply export only the desired variables from your current file.

Breaking Files into Several Smaller Files

If all data in a file is important and the files are still too large, we recommend breaking a file into smaller files by exporting subsets of variables into each of the smaller files.

Tip: When exporting variable subsets into many files, make sure each and every file contains the important demographic and segmentation variables your users will want to compare results across.

Data Quality

Data quality is a big topic, much too big to cover completely in a few pages. You and your research supplier likely pay attention to many critical topics like appropriate sampling, bias, instrument design, ‘speeders’, and more.  

Here, we’ll discuss some additional facets of data quality you may not be familiar with, especially if historically your data has usually been accessed and analyzed by specialists. Paying close attention to these facets will help you get the most out of your data.

Supported Variable Types

KnowledgeHound supports categorical variables (Single Response and Multiple Response) and numerical variables.

For categorical variables, each response should be the full text of the response selected by users. Responses should not restate the question prompt.

Good example:  

Q: Where do you shop most often?

○ Grocery Stores

○ Club Stores

○ Convenience Stores

Bad example 1:  

Q: Where do you shop most often?

○ 1

○ 2

○ 3

Bad example 2:  

Q: Where do you shop most often?

○ Where do you shop most often - Grocery Stores

○ Where do you shop most often - Club Stores

○ Where do you shop most often - Convenience Stores

Variable Type Conventions

KnowledgeHound works best when SPSS files use the proper variable type for the questions they describe. Almost all suppliers do this by default.  In the rare cases that improper variable types are used, incompatible tools are usually to blame.

When properly formatted, a data file should have exactly one variable for every question asked to respondents. Each variable should match the type of question given to respondents. For example, if users are allowed to select more than one possible answer, the file should use the SPSS variable type “Multiple Response”.

Using the right variable type:  

Variable 1: What colors do you like (check all that apply)?

▢ Red

▢ Yellow

▢ Blue

Using the wrong variable type:  

Variable 1: What colors do you like (check all that apply) - Red

○ True

○ False

Variable 2: What colors do you like (check all that apply) - Yellow

○ True

○ False

Variable 3: What colors do you like (check all that apply) - Blue

○ True

○ False

There will be cases where it makes sense to include two variables for a single question. For example, if you asked respondents to report their age as a number, your supplier might have helpfully delivered a file with two age variables, one numeric (the raw responses), and one categorical variable describing respondents’ age brackets. KnowledgeHound gladly accommodates these convenience variables.

If you’re unsure whether or not your variables are properly formatted, we’d be happy to examine a sample file and its related questionnaire and let you know. 

Skip logic

Some surveys use skip patterns which only ask particular questions to subsets of respondents. When you use a skip pattern, be sure that your data differentiates between respondents who (a) never saw the question (b) declined to answer the question, or (c) simply didn’t select an answer because none of the responses applied to them.  

Because these differences are integral to drawing inferences from your data, as in the sample question below, be sure that responses are coded appropriately.  

Sample Question:  

Q:  When was the last time you shopped at a Country Store?

○ In the past 7 days

○ In the past 30 days

○ In the past 365 days

For example, if a respondent saw this question but did not answer it, they were probably indicating they had not shopped at a country store in the last 365 days.

Tip: If applicable, provide a “None of the above” option on categorical questions and require a response to all questions presented to respondents.

Matrix/Grid Questions

When the questionnaire breaks out Columns and Rows, the row options will be displayed in the “Response” column, and the columns need to have a separate line item per choice, with the full question before it to provide user context.

For example, Q16. How much do you agree or disagree with the following statements?

Statement A

Statement B

Statement C

Statement D

Strongly Agree

Somewhat Agree

Neither agree nor disagree

Somewhat disagree

Strongly disagree

The properly formatted data file will display the following for the 4 statement variables. Please note the importance of the “space dash space” format between the question stem and the different statements. This "space dash space" format is critical to making complete grid/matrix questions. Failure to adhere to this standard will result in incomplete and undesirable results. Below is an example of that format.

How much do you agree or disagree with the following statements? – Statement A

How much do you agree or disagree with the following statements? – Statement B

How much do you agree or disagree with the following statements? – Statement C

How much do you agree or disagree with the following statements? – Statement D

Note that when self-loading a data file, an algorithm will detect grid questions based on question labels in the SPSS file. Professional data cleaning services require a data file to be properly formatted, as described above.

Searchability and Readability

To get the most from KnowledgeHound, the question prompts and listed responses for each question in your data set should be both readable and searchable.  

Readable means that any of your colleagues who find the question in KnowledgeHound will know immediately from the question’s prompt what was asked to respondents.

Readable example:  

Q: Where have you shopped the most often in the Past 6 Months (select the most accurate response)?

A1: Grocery Stores

A2: Club Stores

A3: Convenience Stores

Bad example:  

Q: Whr Shp Mst P6M?

A1: Grcry Strs

A2: Clb Strs

A3: Cnvnc Strs

Searchable means that the words in the question prompt capture the essence of the question.

Searchable example:  

Q: Where do you shop most often (select the most accurate response)?

A1: Grocery Stores

A2: Club Stores

A3: Convenience Stores

Bad example:  

Q: Which is most true of you?

A1: I most often shop at grocery stores

A2: I most often shop at club stores

A3: I most often shop at convenience stores

SPSS variables (each generally represents one question in your questionnaire) each come with, amongst other things, a name and a description. Our interest is in the description. Some tools and some users often abbreviate the question prompt in the description, sometimes inadvertently making the question unsearchable and unreadable.

Tip: The description field should contain the full question prompt without abbreviations

Categorical SPSS variables, questions that ask users to select their response from a list of predefined responses, also come with that list of responses embedded in the SPSS file. To make sure your colleagues can search for text that appears in those answers, for example when they search “How often do people shop at grocery stores?”, these too need to be readable and searchable.

Tip: The response selections to categorical questions should be as searchable and readable as question prompts

Tip: If you decide to edit the response selections to categorical questions, make sure to check afterward that the file has saved your data properly by looking at the summary for that variable. In some cases, SPSS may corrupt the data for a question if the response names are changed.

Screener Data

To determine whether or not screener questions should be included, you should first check the base sizes of those screener questions. If the base is higher in the screener question than it is in the main survey questions, the data file probably includes terminated data. Meaning, there is data from respondents who did not qualify for the study. In these cases, you can do one of two things: delete the terminated data from the data file or omit those questions. The reason we do not want to include this kind of data is that it is not representative of the sample for the study. If somebody pulls a data point from one of these questions and doesn’t realize that the base size is not the same as the main portion of the study, they could potentially use that data point in the wrong context.

References to Information not Included in the SPSS File

You may find that you have some data you can’t possibly include inside your SPSS file. For example, you may have respondents examine a set of concept images before offering their impressions. When this happens, be sure that you include enough information in your SPSS file that your colleagues searching KnowledgeHound will be able to make sense of the questions you’ve asked.

Tip: When referring to concept tests, give up to a three-word description that differentiates each concept. It is most encouraged to use the benefit statement or the key offering of the stimuli. Instead of calling it “Concept #1”, call it “Freshness Concept” or better yet “100% Freshness Guaranteed”. This is also applicable to claims testing.  Instead of calling it “Claim 1”, call it the actual claim tested like, “5 Times Fresher”.

Tip: When referring to outside assets, such as images from concept tests, make sure that you or your supplier also upload those assets to KnowledgeHound.

Working with Suppliers

Issues with data quality can be handled anywhere in the data pipeline, but the later they’re addressed the more costly they can become.

The least expensive and most effective place to address data quality issues is at the source. Whenever possible, it’s better to have suppliers deliver files that meet these standards. Better yet is for suppliers to use tools that require little or no additional effort from them to meet these standards.

Did this answer your question?