1. Choosing geographic scales for the analysis of labour market and related statistics
- Author
-
Goold, Sally-Ann
- Subjects
Geographic scale ,Spatial statistics ,Labour market statistics ,Labour market outcomes ,Nomenclature of Territorial Units for Statistics (NUTS) ,Multilevel models ,Hierarchical modelling - Abstract
Researchers often only have access to aggregated statistics about people and businesses rather than individual data. This means that research into the relationships between labour market outcomes for geographic areas and demographic characteristics often has to rely on the analysis of statistics for areas rather than for individuals. This presents a problem as the results of the analysis of aggregated, areal data often lead to different results depending on the geographic scale of aggregation. When areal statistics are available for different geographic scales then researchers building statistical models have to choose which geographic scales to include in their models. When areal statistics are not available for different geographic scales then researchers have to consider whether the results of their research would have been different if areal statistics had been available for a different geographic scale to the one that they were forced to use. That different geographic scales can give rise to different results is important if the results are to be used to inform policies (to improve labour market outcomes for example). That researchers and those using their research are aware that different geographic scales in statistical models can give rise to different results is important as it may focus attention on the importance of choosing which scales to use. It may also help explain differences between different results from similar research projects. The specific aim of this project was to assess which geographic scales are the most appropriate and useful to include in the statistical modelling of selected UK labour market statistics and which geographic scales provide unhelpful or misleading information. The wider aim of this project was to develop an approach built using one set of labour market statistics that could subsequently be applied to other labour market statistics or other business or socioeconomic statistics in order to provide guidance to researchers on the effects of using different geographic scales for the analysis of areal data. The intention was to create transferable guidance on levels and methods of analysis rather than solely to analyse a single data set. This project contributes to knowledge by providing some original information about which geographic scales to include in models of various labour market outcomes. Moreover, it contributes to professional practice by describing the different stages used in choosing the geographic scales to include in the modelling of labour market outcomes. The research described in this report was conducted using multilevel modelling. The R statistical programming environment, R Cran Project (2019), was used to build the models and produce all the figures in the final report. Earlier model building was carried out using both MLwiN software and R. Whilst MLwiN produces user-friendly output which helps in understanding multilevel models, R was chosen for the main modelling as it allowed model building and the creation of charts in one language which could be documented and replicated easily in the form of R scripts, examples of which are included in the Annex to the report. The scripts did not contain functions written as part of the research. Instead, they contained sections of code that built models using parameters named 'Output_variable' and 'Predictor_variable' which could be set to each of the variables required for the models using an earlier section of the script. The data used by the R scripts were read in from csv files stored separately to the scripts rather than being contained in packages. The use of scripts rather than packages simply evolved as the code was written and was sufficient to produce and run the models required for the research. If the work were developed further, then the writing of packages containing code and data to make it easier for other researchers to run the models could be considered. The data used in the research were all downloaded from official UK government statistics websites. The dataset used for the main section of model building described in chapters 4 and 5 of this thesis consists of outcome variables at local authority level for the 326 English local authority districts and unitary authorities in existence up until early 2019 together with predictor variables mainly at local authority level. The research presented in chapter 5 of this thesis consisted of three stages, investigating the geographic scale of variation in the outcome variables, choosing the geographic scale to use for predictor variables, and choosing the geographic scales to include as levels in multilevel models. Many of the multilevel models contained one or more of The Europewide 'Nomenclature of Territorial Units for Statistics' (NUTS) geographic scales (Eurostat, 2018) as model level(s). This nomenclature provides a set of hierarchical areas for the collection and analysis of statistics. In the UK, the NUTS 1 areas are Scotland, Wales, Northern Ireland and the nine former government office regions in England. NUTS 2 areas in the UK generally consist of one or more counties depending on county population sizes. Single NUTS 3 areas in the UK can be either a single unitary authority, a group of local authorities or a single county depending on local population sizes, or a single London borough. The amount of variation at different geographic scales is important as it helps to show how similar units within the same areas are to each other and how different units in different areas tend to be from each other. The geographic scale at which units within areas are similar to each other and units in different areas are different to each other is important in finding which geographic scales it is helpful to have in multilevel models. The main conclusions from investigating the geographic scale of variation in the outcome variables were that: • for local authority unemployment rates there were higher proportions of variance at NUTS 1 and NUTS 3 areas levels than at NUTS 2 area levels; • for local authority employment rates and workplace earnings there were broadly similar proportions of variance at NUTS 1, NUTS 2 and NUTS 3 area levels; • for local authority mean hours and median hours variables there were negligible amounts of variance at NUTS 3 areas level; • for job density there was a negligible amount of variation at NUTS 1; • for the median residents' earnings variable there were equal proportions of variance at NUTS 2 and NUTS 3 areas levels and twice that proportion at NUTS 1 areas level. The main finding from the investigation into the geographic scale to use for predictor variables in models of local authority level outcomes was that it was usually better to use local authority level predictor variables rather than predictors calculated at higher geographic scales and that it was unnecessary to use predictors calculated at multiple geographic scales. For that reason, the main modelling part of the project was devoted to multilevel models of local authority level outcomes using only local authority level predictors. The research consisted of building a large number of models for each outcome variable using different predictor variables all measured or calculated at local authority level but within multilevel models that grouped the local authorities at different geographic levels. The models were then analysed to see which ones fitted the data better by comparing the AIC values for groups of the models that used the same outcome and predictor variables in different ways. This found the following models to be among the best for the various outcome variables: • four-level random intercept models for models of unemployment rates, residents' earnings and workplace earnings; • two-level random intercept models with grouping by NUTS 2 areas for models of mean hours worked and models of job density; • a variety of models for employment rates depending on the predictor variable used. An overall finding from the results was that there was often a choice to be made between complex, i.e. random coefficient, models with just two levels and simpler, i.e. random intercept, models with four levels. Given that this choice may have to be made, it was suggested that consideration should be given to what sort of information is sought from the model in order to help choose which geographic levels to include. To learn about influences coming from different geographic scales a random intercept model with many different levels is likely to be appropriate. However, to learn about different strengths of effects in different parts of a study area a random coefficient multilevel model with just two levels or a four-level model with random coefficients at just one level may be more useful. The recommendations of this project include guidance to researchers on how to choose which geographic scales to include in models. The guidance is presented in the form of a set of steps. The steps cover choosing outcome variables that have distributions suitable for linear modelling, dealing with outliers, building null models to investigate the proportion of variance of the outcome variables that occurs at different geographic scales, considering the intended purpose of the model to determine whether a random coefficient model would be helpful and being aware that the geographic scales to use for random coefficient models may be different to those to use for null or random intercept models, comparing the AIC values of models that include different geographic scale levels to assess which fit the data better, and where appropriate checking for any spatial patterns in the random coefficients estimated by a model.
- Published
- 2021
- Full Text
- View/download PDF