We have a new website go to gov.scot

Tips before starting data analysis


General tips

Prior to embarking on analysis of Scottish Health Survey data, consider the following points:

  • When is it appropriate to use a combined dataset? See 'appropriate use of the combined dataset' under the Change Over Time section below.
  • Can you replicate previous results? This will help confirm the correct variables are used in your analysis. For help on replicating published tables, see our section Replicating Report Tables. This includes links to documents which list the variables used to produce the published tables along with the corresponding syntax to recreate the published tables.
  • Double check the analysis variable name, the definition and response options are relevant to the purpose of your analysis - sounds obvious, but consider that there are over a two thousand variables produced in each survey dataset, which includes both original and derived variables. Some variable names are quite similar, eg the various smoking status variables use slightly different bases to produce different survey estimates, although they have similar names. This is an essential pre-analysis task when doing time series analysis. See the survey documentation for more information.
  • Ensure you are using the correct base from which to calculate your estimates. For example, consider whether you wish to include the 'missings' (these are respondents who were either not eligible to respond to a question (categorised as 'Not Applicable'), or they were eligible to respond but declined to provide a response). Are you analysing the correct group (eg just males or females, those aged 16 to 18, etc)?


Weighting the data

  • Ensure you apply the correct weighting variable to your analysis (e.g. whether adults or children, or analysing blood data or nurse data, or questions analysed are only asked in Version A of the questionnaire).  For full details on the weights included in each dataset, see the survey documentation.
  • Note that combined datasets (e.g. 2008/2009, 2012/2013/2014), and therefore weights, do not currently exist for periods spanning different survey cycles, so it is only possible to perform analysis on combinations of years within the same survey cycle (e.g. 2008-2011 or 2012-2015).  It is not possible to combine datasets from different survey cycles.


Change over time

Other considerations particular relevant to those undertaking time series analysis are :

  • Combined datasets should only be used when analysing multiple years together.  Weights used to analyse the data are slightly different in combined datasets compared with the individual survey years data.  This is because, with a larger dataset, extra constraints can be added to the weights to ensure it produces a representative age/sex split when analysing by those NHS Health Boards large enough to perform analysis on two years of data.  This does not have much impact on the results at Scotland level.  However, for certain NHS Boards, it will have an impact at board level.  Therefore, we strongly recommend analysts who perform time series analysis or analysis on specific single years to only use the single year datasets
  • Question modifications - check whether the variable(s) analysed have changed across the survey years eg in terms of definition, response options, or the subset of respondents eligible to respond to questions. For example, have single response questions changed to multiple response questions since the previous survey(s), or vice versa? See the survey documentation to view each survey questionnaire.
  • Methodology differences - for example the introduction of the revised variables.
  • Changes to sample structure - age restrictions on participation in the survey - in the 1995 survey respondents aged 16-64 were interviewed, however, in the 1998 survey this limit was extended and those aged 2 to 74 were interviewed. By 2003 onwards respondents of all ages (0+) were interviewed. For example, in time series analysis the analysis group must be limited to those age 16-64 to enable comparability across all survey years (1995 onwards).
  • Geographical changes - eg in 2006 and 2014 changes in NHS Board administrative boundaries.