The availability of acceleration sensors creates exciting new opportunities for data mining and predictive analytics applications. In this post, we consider data from accelerometers to perform activity recognition. And thanks to this learning we want to identify the physical activity that a user is performing. Several possible applications ensue from this: activity reports, calories computation, alert sedentary, match music with the activity…In brief, a lot of applications to promote and encourage health and fitness. This post is inspired from the WISDM Lab’s study and data come from here.

## Data description

We use labeled accelerometer data from users thanks to a device in their pocket during different activities (walking, sitting, jogging, ascending stairs, descending stairs, and standing).

The accelerometer measures acceleration in all three spatial dimensions as following:

- Z-axis captures the forward movement of the leg
- Y-axis captures the upward and downward movement of the leg
- X-axis captures the horizontal movement of the leg

The plots below show characteristics for each activity. Because of the periodicity of such activities, a few seconds windows is sufficient.

The understanding of these graphics are essential to notice patterns for each activity and then recognize it. For example we observe repeating waves and peaks for the following repetitive activities walking, jogging, ascending stairs and descending stairs. We also observe no periodic behavior for more static activities like standing or sitting, but different amplitudes.

The data sets provide data from 37 different users. And each user perform different activities several time. So I have defined several windows for each user and each activity to retrieve more samples.

Just below, an example of GTS using our Cityzen Data widget.

More about the data right here.

## Determine and compute features for the model

Each of these activities demonstrate characteristics that we will use to define the features of the model.
For example, the plot for walking shows a series of high peaks for the y-axis spaced out approximately 0.5 seconds intervals, while it is rather a 0.25 seconds interval for jogging.
We also notice that the range of the y-axis acceleration for jogging is greater than for walking, and so on.
This analysis step is essential and **takes time** to determine the best features to use for our model.

We determine a window (a few seconds) on which we will compute all these features.

After several tests with different features combination, the ones that I have chosen are described below:

- Average acceleration (for each axis)
- Standard deviation (for each axis)
- Average absolute difference (for each axis)
- Average resultant acceleration (1/n * sum [√(x² + y² + z²)])
- Average time between peaks (max) (for each axis)

Now let’s use Einstein to compute all of these features!

No… Not this one…

## Just few words about Einstein

Einstein is our home-made language which allows to manipulate Geo Time Series and make statistical computations. It is composed of several frameworks and several hundreds functions.

### Bucketize framework

The BUCKETIZE framework provides the tooling for putting the data of a Geo Time Series into regularly spaced buckets.

### Mapper framework

The MAP framework allows you to apply a function on values of a Geo Time Series that fall into a sliding window.

### Reduce framework

The REDUCE framework operates on equivalence classes forming a partition of a set of Geo Time Series.

## Features computation with Einstein

Let’s use Einstein to compute all of these features!

### Average acceleration and Standard deviation

```
$data // call the data
false // Bessel correction
MUSIGMA
'standev_x' STORE // store the standart deviation
'mean_x' STORE // store the mean
```

### Average absolute difference

```
$data // call the data
DUP // duplicate the data. Don't forget Einstein use a stack
// compute the mean
bucketizer.mean
0 0 1 // lastbucket bucketspan bucketcount
5 ->LIST
BUCKETIZE
VALUES LIST-> DROP LIST-> DROP // As BUCKETIZE returns a GTS, extract the value of the GTS
'mean' STORE
// Here we do: x - mean for each point x
-1 $mean * // multiply by -1
mapper.add // and add this value
0 0 0 // sliding window of 1 (0 pre and 0 post), no options
5 ->LIST
MAP
// Then apply an absolute value: |x - mean|
mapper.abs
0 0 0
5 ->LIST
MAP
// And compute the mean: (1 / n )* sum |x - mean|
// where n is the lenth of the time series
bucketizer.mean
0 0 1
5 ->LIST
BUCKETIZE
// store the result
VALUES LIST-> DROP LIST-> DROP 'avg_abs_x' STORE // store the result
```

### Average resultant acceleration

```
$data // call the data
// Compute the square of each value
2.0 // power 2.0
mapper.pow
0 0 0 // sliding window of 1 (0 pre and 0 post), no options
5 ->LIST
MAP
// Now add up!
[] // create one equivalence class with all Geo Time Series
reducer.sum
3 ->LIST
REDUCE // it returns only one GTS because we have one equivalence class
// Then compute the root square: √(x² + y² + z²)
0.5
mapper.pow
0 0 0
5 ->LIST
MAP
// And apply a mean function: 1/n * sum [√(x² + y² + z²)]
bucketizer.mean
0 0 1
5 ->LIST
BUCKETIZE
VALUES LIST-> DROP LIST-> DROP 'res_acc' STORE // store the returned value
```

### Average time between peaks

```
$data // call the data
DUP
// Now let define the maximum
bucketizer.max
0 0 1 // lastbucket bucketspan bucketcount
5 ->LIST
BUCKETIZE // return a GTS
// extract the max value and store it
VALUES LIST-> DROP LIST-> DROP 'max_x' STORE
// keep data point for which the value is greather than 0.9 * max
$max_x 0.9 *
mapper.ge // ge i.e greather or equal
0 0 0
5 ->LIST
MAP
// just return the tick of each datapoint
mapper.tick
0 0 0
5 ->LIST
MAP
// compute the delta between each tick
mapper.delta
1 0 0
5 ->LIST
MAP
// keep it if the delta is not equal to zero
0
mapper.ne
0 0 0
5 ->LIST
MAP
// compute the mean of the delta
bucketizer.mean
0 0 1
5 ->LIST
BUCKETIZE
// and store the value
VALUES LIST-> DROP LIST-> DROP 'peak_x' STORE
```

## Decision Trees, Random Forest and Multinomial Logistic Regression

Just te recapp: we want to determine the user’s activity from data. And the possible activities are: walking, jogging, sitting, standing, downstairs and upstairs. So it is a classification problem.

After aggregating all these data, we will use a training data set to create predictive models using classification algorithms (supervised learning). And then we will involve predictions for the activity performing by users.
Here we have chosen the implementation of the **Random Forest**, **Gradient-Boosted Trees** and **Multinomial Logistic Regression** algorithms using MLlib, the Spark’s scalable machine learning library.

The algorithms are applyied on 6 classes: Jogging, Walking, Standing, Sitting, Downstairs and Upstairs.

*Remark:* with the chosen features we have bad results to predict upstairs and dowstairs. So we need to define more relevant features to have a better prediction model.

Here below the code which shows how to load our dataset, split it into trainData and testData.

```
// Split data into 2 sets: training (60%) and test (40%).
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.6, 0.4});
JavaRDD<LabeledPoint> trainingData = splits[0].cache();
JavaRDD<LabeledPoint> testData = splits[1];
```

### Random Forest

Let use the RandomForest.*trainClassifier* method to fit a random forest model. After that the model is evaluated against the test dataset.

More about Random Forest.

```
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numTrees = 10;
int numClasses = 6; // Jogging, Walking, Standing, Sitting, Downstairs and Upstairs
String featureSubsetStrategy = "auto";
String impurity = "gini";
int maxDepth = 9;
int maxBins = 100;
// create model
RandomForestModel model = RandomForest.trainClassifier(trainingData,
numClasses,
categoricalFeaturesInfo,
numTrees,
featureSubsetStrategy,
impurity,
maxDepth,
maxBins,
12345);
// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<Double, Double>(model.predict(p.features()), p.label()));
// the error
Double testErr =
1.0 * predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / testData.count();
```

### Decision Trees

Let use DecisionTree.*trainClassifier* to fit a logistic regression multiclass model. After that the model is evaluated against the test dataset.

More about Decision Tree.

```
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numClasses = 6;
String impurity = "gini";
int maxDepth = 9;
int maxBins = 100;
// create model
final DecisionTreeModel model = DecisionTree.trainClassifier(trainingData,
numClasses,
categoricalFeaturesInfo,
impurity,
maxDepth,
maxBins);
// Evaluate model on training instances and compute training error
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
// the error
Double testErrDT =
1.0 * predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / testData.count();
```

### Multinomial Logistic Regression

Now let’s use the class LogisticRegressionWithLBFGS to fit a logistic regression multiclass model. After that the model is evaluated against the test dataset.

More about Multinomial Logistic Regression.

```
LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(6)
.run(trainingData.rdd());
JavaRDD<Tuple2<Object, Object>> predictionAndLabel =
testData.map(p -> new Tuple2<>(model.predict(p.features()), p.label()));
// Evaluate metrics
MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabel.rdd());
// precision of the model. error = 1 - precision
Double precision = metrics.precision();
```

## Results

For 37 users, 1380/2148 samples(4/6 classes).

nb classes | mean error (Random Forest) |
mean error (Decision Tree) |
mean error (Multinomial Logistic Regression) |
---|---|---|---|

4 | 1.4% | 2.3% | 7.2% |

6 | 17% | 20% | 43% |

So we have pretty good result with these features for 4 classes, and else pretty bad results.

## Conclusion

In this post we have first demonstrated how to use Einstein functions and framework for extracting features.

The features extraction step is pretty long, because you need to test and experiment to find the best features as possible.

We also have to prepare the data before we compute the features and push them on the Cityzen Data platform. And it can be long too.

To finish if you are using Spark in your developments, it can be useful to use the Spark component called MLlib which provides a lot of common Machine Learning algorithms.